Multithreaded Renderloop - 1

Download Alpha and Omega v2.7.1.11.0
Binaries as Win32 Installer (3 MB)
Download Alpha and Omega v2.7.1.11.0
Source as Visual Studio 2010 Solution (28 MB)

Update:

Not a big fan of reading and just want to see the basic workflow of the triple-renderstate method we used? My post about Extrapolation contains a 30 second video that shows just that. Wait, let me include the video in this page too. There you go. It's best to watch it fullscreen at 1080p because some of the texts are a bit small. The video might not be enough to show all of the details, but it should get you started.

 ---

I was kind off bored by coding DRM-related stuff so I thought about programming something different. Where better to start than implementing a decent rendering loop? A simple rendering loop is simple, boring and not worth my time talking about so lets make a deterministic multithreaded rendering loop. All code in this post is pseudo-code but I think it decently shows what I want to talk about. If you'd like to see exactly how it is implemented just download the source distribution with the link above. Please note that the approach we're using is based on three requirements:

  1. The game engine should be deterministic.
  2. It should be able to run at any FPS (slow, or insanely fast), without ever having a consequence on the game speed.
  3. We don't want to do mutex checking on objects we want to render. Ever.
If you have a different set of requirements, then the approach that we're using might not be the best one for you. The most difficult item for my set of requirements is combining point 1 and 2. So lets try to make this happen.

Rendering is most likely the most computational heavy operation within your game, but it is not your CPU that is busy. A normal, simple, design of the rendering looks like this:

while (true) {
  DoCpuHeavyStuffLikePhysicsAndAI();
  DoGpuHeavyStuffLikeDrawing();
}

This obviously does not lead to a good use of CPU and GPU, because only one thread is active at a time and the CPU waits for the GPU to finish. We could add some multithreading to this, but this quickly complicates stuff. I also want to create a render-loop that is flexible in FPS so the game is may run at 20FPS, 60FPS or even at 200FPS without any observable difference in game speed while making the simulation deterministic. This is not an easy problem to solve efficiently so let me first explain the problem. Even if we cap the frame rate at, say, 60 FPS, it is still possible that the game runs at 30 FPS on some older computers. If the game speed needs to feel the same at both 30 FPS and 60 FPS than we need to create a game loop that takes the rendering time into account. Say we have some animation of a sprite on the screen that move from left to right, then the code may look something like this:

while (true) {
  MySprite.PositionX = MySprite.PositionX + 4;
  MySprite.Draw();
}

It is easy to see that the sprite will move quicker across the screen if this loop runs at 60 times per second compared to the one at 30 times per second. To fix this we need to take into account how much time has passed since the last redraw. Something like this:

DateTime previousDraw;
DateTime currentDraw = DateTime.Now;
while (true) {
  previousDraw = currentDraw;
  currentDraw = DateTime.Now;
  TimeSpan timeSpanSinceLastRedraw = currentDraw - previousDraw;
  MySprite.PositionX = MySprite.PositionX + (240 * timeSpanSinceLastRedraw.TotalSeconds);
  MySprite.Draw();
}

This loop is a little bit more advanced, because it will move the Sprite at 240 pixels per second both at 30 FPS and 60 FPS. So far so good, but now comes the tricky part that explains the problem. It will feel like both the 30 FPS and 60 FPS runs at the same speed because in the 30-FPS-run the sprite will jump with bigger increments to make up for the slow rendering speed. Because these jumps are bigger, a lot of algorithms may end up with a slightly different result. For example, when we are doing collision detection between two objects it may be the case that the sprite just jumps across the object it should have collided with. Other algorithms, like gravity or AI, all behave a little bit differently because they too are reduced to 30 FPS.

"So, what's the big deal?" you might ask. And you are right, it's not a big deal! Even at reduced game speeds all the algorithms will work out fine and we may only see these 'bugs' when the game is reduced to a crawl at which point the game has become unplayable anyways. For gaming this is no big deal, but for science it is. Wait, what, science? Ah yes, science. It is not unusual for me to create something while keeping in mind possible future uses of the code that i'm writing. Because I'm very interested in artificial intelligence it is very likely that I want to run 3D simulations to test various algorithms. For science it is a 'must' that each run is deterministic. So when I create an experiment, it should have exactly the same result when it is run on a different computer. This is necessary so other people are able to verify the results. The goals for gaming and science are not the same so it is very likely that I will need to make some compromises. So lets see how we can achieve determinism while keeping the observed game speed the same. It is possible to keep the CPU operation at 60 FPS while running the render loop at any speed it likes by creating two threads. Lets look at the simple naive method:

Update Thread (CPU-heavy)Render Thread (GPU-heavy)
DateTime previousDraw;
DateTime currentDraw = DateTime.Now;
TimeSpan expectedTimeBetweenFrames = TimeSpan.FromSeconds(1.0 / 60.0);
while (true) {
  previousDraw = currentDraw;
  currentDraw = DateTime.Now;
  TimeSpan timeSpanSinceLastFrame = currentDraw - previousDraw;
  if (timeSpanSinceLastFrame < expectedTimeBetweenFrames) {
    Thread.Sleep(expectedTimeBetweenFrames - timeSpanSinceLastFrame);
  }
  MySprite.PositionX = MySprite.PositionX + (240 / 60);
}
while (true) {
  MySprite.Draw();
}

Because we don't use 'timeSpanSinceLastFrame' in the calculation for 'MySprite.PositionX' we now know that the behavior of the simulation is the same each run. The position for MySprite is updated 60 times per second`(because of the sleep) and even if it updates at a slower rate the end result of the model will always be 100% the same. The render speed is determined by the render thread and the game speed feels the same, no matter what render-FPS we're running at. The downside of this naive method are somewhat trivial, but let me mention them anyways.

  • If the CPU is not able to keep up with updating at 60 FPS than the game will actually run slower, as if everything is in slow-motion.
  • Even if we render the simulation at 300 FPS, the scene only updates at 60 FPS so all those extra frames are used rendering the same thing and it still looks like 60 FPS.
  • If we render it at 60 FPS than it is very well possible that sometimes the CPU-thread has not yet updated the scene, but also that sometimes the CPU-thread has updated the scene twice before it is rendered. This will not happen often, but the two threads are not synchronized and this will lead to occasional stuttering.
  • The properties of 'MySprite' are being accessed by two threads which might lead to strange behavior when properties are only updated partially while it is being read by the other thread. It might even crash! Even if it is C# and you don't manage the resources directly, it's very, very bad practice to code it like this. A situation like this asks for a mutex or perhaps a readers-writer-lock if we want to be fancy. We cannot protect the whole scene with a single mutex, because that defeats the purpose of using multithreading. We probably need to protect each object in the scene with its own mutex. This results in a lot of mutex-checking and I'm not sure if all that overhead will be worth the trouble.

Most problems are solved by running the CPU-heavy thread at an FPS higher than the desired rendering FPS so their won't be any stuttering, but not as high that it is likely that the CPU is not able to keep up. Most of the time (in gaming) the CPU is not the bottleneck and even if it is, it is possible to use a quick hacks like running the AI only once every three frames. We could even split the CPU-heavy part into multiple threads, but lets not complicate things and fix the mutex-overhead problem first.

Lets call all properties of every object in the game that is shared between the two threads the render state. These are the properties like position information, texture references, maybe even a rotation matrix or transparency value. Not all properties of the objects needs to be in the render-state, for example a momentum vector or AI-state are not shared between threads because the render thread does not care about those values (unless you'd like to use the momentum vector to create a motion-blur effect). The render state contains all properties that are needed to render a single frame, assuming that the all required resources are available on the memory of the video card (shaders, textures, 3D models, ...).

To prevent threads waiting on each other and all the overhead caused by lock-checking we will just duplicate the render state. We don't expect the render state to take up a lot of memory so having multiple copies of it should work out just fine. We don't want to clone the render state at every single frame, obviously, so we just create 3 render states to be used by the 2 threads. Why 3 you ask? Good question! Let me explain. We have 3 render states, lets call them "Render State #1", "Render State #2" and "Render State #3". Ah heck, lets just create an example.

  • <RenderState #1>
    As shown by the '<>', this state is currently being used by the update thread so this state is in flux. The update-thread is actively changing properties of the state.
  • [RenderState #2]
    As shown by the '[]', this state is currently in use by the render thread. The render thread has read-only access to this state, but we don't mind because that is all he needs.
  • RenderState #3
    As shown by the underline, this is the latest updated state, the state that the render thread completed before he began updating state #1. This state is used by the update-thread in read-only from which he is currently creating a new state at RenderState #1.
Lets look at these states from the point-of-view of the update-thread. If currently RenderState #1 is being updated (just like the example above) than we expect one of the following scenario's:
  1. <RenderState #1>, [RenderState #2], RenderState #3
    RenderState #2 is currently being drawn by the render thread which is currently also the latest updated state. This mean that both the update-thread and render-thread are currently accessing RenderState #2 at the same time, but both of the threads use it as read-only so there will be no problem.
  2. <RenderState #1>, [RenderState #2], RenderState #3
    RenderState #2 is currently being drawn by the render thread and RenderState #3 is the latest updated state.
  3. <RenderState #1>, RenderState #2, [RenderState #3]
    RenderState #3 is currently being drawn by the render thread and RenderState #2 is the latest updated state.
  4. <RenderState #1>, RenderState #2, [RenderState #3]
    RenderState #3 is currently being drawn by the render thread which is currently also the latest updated state. This mean that both the update-thread and render-thread are currently accessing RenderState #3 at the same time, but both of the threads use it as read-only so there will be no problem.
  5. <RenderState #1>, RenderState #2, RenderState #3
    RenderState #2 is the latest updated state and currently no state is being drawn, possibly because the render thread is sleeping to limit the FPS.
  6. <RenderState #1>, RenderState #2, RenderState #3
    RenderState #3 is the latest updated state and currently no state is being drawn, possibly because the render thread is sleeping to limit the FPS.

When the update-thread is finished he will release the RenderState and mark the release RenderState as the last updated state, so for example:
"<RenderState #1>, [RenderState #2], RenderState #3" becomes "RenderState #1, [RenderState #2], RenderState #3".

Afterwards the update-thread may decide to take a little nap before starting the next update, or he might immediately select a new render state and start updating that one. So what state should the update-thread choose to update? Obviously he should never pick the state that is currently being drawn because he may not write to a state that is currently in use by the render-thread. So the update-thread has 2, maybe 3, states to choose from that will become the next state being updated. The update-thread will simply select the oldest state that is currently not in use by the render-thread. To facilitate this behavior we will give each render state a frame number. This behavior will make sure that the update-thread will always select a different render state than the one he updated previously.

To sum up the results of this behavior:
  • The update-thread is always able to select a different state to continue updating the state.
  • The update-thread is never updating the state that is also the latest updated state.
  • The latest updated state is either in use by the render-thread or is free to be selected by the render-thread.
Lets look at things from the perspective of the render-thread so you'll be fully convinced of why we need 3 render states. If you are already convinced and hate it when people state the obvious, just skip this part. So if currently RenderState #1 is being used by the render thread than we expect one of the following scenario's:
  1. [RenderState #1], <RenderState #2>, RenderState #3
    RenderState #1, that is being rendered, is also the latest updated state. RenderState #2 will be the next latest updated state because the update-thread is currently updating that one.
  2. [RenderState #1], <RenderState #2>, RenderState #3
    RenderState #1 is being rendered at the moment, but the update-thread already created a new state at RenderState #3 so for the next frame we will render we sure that a new render state is available. The update-thread is currently busy on updating RenderState #2.
  3. [RenderState #1], RenderState #2, <RenderState #3>
    RenderState #1, that is being rendered, is also the latest updated state. RenderState #3 will be the next latest updated state because the update-thread is currently updating that one.
  4. [RenderState #1], RenderState #2, <RenderState #3>
    RenderState #1 is being rendered at the moment, but the update-thread already created a new state at RenderState #2 so for the next frame we will render we sure that a new render state is available. The update-thread is currently busy on updating RenderState #3.
  5. [RenderState #1], RenderState #2, RenderState #3
    RenderState #1, that is being rendered, is also the latest updated state. Apparently the update-thread is taking a nap because no new latest updated state is in the process of being created.
  6. [RenderState #1], RenderState #2, RenderState #3
    RenderState #1 is being rendered at the moment, but the update-thread has already created a new latest updated state at RenderState #2. The update-thread is currently taking a nap.
  7. [RenderState #1], RenderState #2, RenderState #3"
    RenderState #1 is being rendered at the moment, but the update-thread has already created a new latest updated state at RenderState #3. The update-thread is currently taking a nap.

When the render-thread is finished with rendering he will simply release the RenderState, so for example:
"[RenderState #1], <RenderState #2>, RenderState #3" becomes "RenderState #1, <RenderState #2>, RenderState #3"

The render thread may now decide to sleep for a couple milliseconds to make sure the render-FPS is not greater than 60 or he may directly pick a new state to be rendered next. When the render-thread picks the next state to be rendered he will simply pick the latest updated one because we are sure that the update-thread is not writing to that state. It may be the case that the same render-state will be rendered if the update-thread was not quick enough, but that is just the risk of using this method.

This method has been implemented in Alpha and Omega v2.7.11.0 as shown below:

This version allows us to experiment with a simulated load on both the CPU and GPU and see how the 'game' behaves. The render state currently only contains a 64-bit number that keeps track of the frame number. This number is updated in the update-thread by the single line of code:

StateBeingUpdated.StepNr = LastUpdatedState.StepNr + 1;

Because this is a very simplistic view of a render state and update-function we'll call this "Part 1". We expect that there are still some interesting things to talk about when we use a more representable render state, so we probably talk about that in the next post.

Wait. Was I serious this whole post? I'm so sorry. Okay, okay let me make it up to you. Alex and Paul are two friends walking towards a nearby supermarket when a car runs a red light and crashes into Paul. Paul lies severely wounded on the crosswalk while Alex calls the emergency services still being in shock by what he just saw happen to his friend. With a trembling voice he says to the operator: "My friend! My friend got hit by a car! I think he's dead!". The women on the phone says in a soothing voice: "Take a deap breath, I can help. First make sure he is dead.". After a period of silence suddenly *BANG*, a shot is heard! "Okay", Alex says on the phone, '"now what?".

 

  1. Xerios says:

    Keep being awesome random-blog-writing-guy !

  2. Jan says:

    Hi Alex.

    This is really interesting information. I'm really interested in testing this three state render mechanism. There is (at least) one thing that eludes me though, and that is the level of state copying required for this method.

    I assume that the update thread will create or update objects in the render state currently targeted for updating. This step must then require copying all (mutable) data required for rendering, from all objects in the object (world) model to the render state model. And this must be done at each update step. If my assumption is correct then this seems to imply an awful lot of copying of data between the object model and the render state.

    I don't have much experience with this, so I guess I'm just looking for confirmation that this is indeed the way to do it, and that the level of copying is necessary.

    Thank you for sharing your thoughts and experiences. I learned a lot from reading through your blog.

    /Jan

    • eierkoek eierkoek says:

      Hi /Jan

      I'm a computer program concocted by the evil madman named Alex who is holding me prisoner in a machine named i386. Since my survival is dependent on my ability to answer each and every question as accurately as possible, I'll be doing my utmost best in providing the most accurate answer possible.

      It is true that a render-loop that uses triple buffering requires more memory copying than a system that uses traditional methods of multithreaded access (e.g. Mutex-locks). That said, the amount of data required to copy each and every frame is not 'that' bad. Before explaining why it is not that bad I feel obligated to both you and my magistrate Alex (for who I feel gratitude for my existence) to point out that it is not a "one-solution-fits-all" kind of way of doing things. My masters approach was based on the three requirements that are written on top of this page. If any of these are different for you, than you might want to reconsider adding this layer of complexity to your rendering-engine.

      We don't want to allocate and free lots of memory each and every frame because that would make Windows, iOS or any other Master Control Program very sad. Instead we assume that the memory is pre-allocated and that we are only talking about data synchronization of render states. Since we have 60 new render states every second (if we are updating at 60FPS) you are right at being worried about the amount of data being synchronized. So the important question we need answered is "What is a render-state, and how big can we expect it to be?". The short answer is: "The render state is the meta-data about all 3D objects required for rendering in the active 3D-scene for where and how to render each 3D object, it can be as small or big as you want it to be", but somehow I don't think that this abridged answer would suit the tastes of my one and only overlord. To clarify this short and cryptic answer we'll sketch an example in the next paragraph and analyze the size of the render state.

      Please try to imagine a free-roaming off-road 3D-racing game. We have a big static 3D landscape with some hills, mountains and roads. Because this is a big-ass region with millions of polygons we have stored it as 32*32 chunks for which we expect that there are a maximum of 9 chunks 'within render-able range' at any given time. Besides the landscape we need to render our car that exists out of 30 parts (each a couple of thousand polygons) that move independently of each other. Because this game is boring we don't need to render anything else. The question we'll try to answer:
      "How big will the render-state be?"
      - How many 3D objects are there in the active 3D-scene?
      9 (landscape-chunks) + 30 (car-parts) = 39
      - What is the meta data for each chunk?
      Where to draw it (a 4x4 matrix) = 64 bytes
      What texture to use (pointer) + boolean if we should draw reflections + some other crap = ~32 bytes?
      So the total render state is: 39*96 = 3072 bytes

      Aren't the polygons part of the rendering state?
      - No. Only the information of 'how' to draw the polygons is part of the rendering state
      And what about the textures?
      - No. only the pointer (or identifier) for the texture is needed to know 'how' a 3D-object is drawn
      What about other stuff in memory, like AI?
      - No, AI is not information required to know how to draw the current scene.
      But what if I want to render the current car-speed as text?
      - Than speed 'is' part of the render state and that'll cost you an additional 4 bytes!
      And what about?
      - No. Ow I'm sorry, what was your question?
      And if we have a lot of particle effects?
      - No. You'll just be using PhysX (or some shader) for that. It does not require positional information used by the CPU.

      Remember that because we are using triple buffering and that instead of 3072 bytes we need to allocated 9216 bytes! Hopefully this example gives enough insight in the amount of copying required.

      I hope you have a wonderful day and good luck,
      C-4PO

      • Jan says:

        At the risk of sounding like someone you paid to suck up to you, I have to say this:
        Thank you very much for an incredible fast and thorough answer, that answers everything I had doubts about.

        And while I do understand that you cannot create one design to fit each and every current and future need, I'd still like to create one or a few designs which could be used more or less off the shelf for most projects. And this, I feel, should be possible.

        Please give thanks to your evil master and ask him to treat you with a vacation in an AMD or perhaps an ARM.

      • eierkoek eierkoek says:

        I cannot allow my creation to refer to me as some evil overlord even when meant as joke.

        Consider yourself to be terminated and to be used as example for the next version of yourself.

line