Multithreaded Renderloop - 2

Download Alpha and Omega v2.7.1.12.0
Binaries as Win32 Installer (3 MB)
Download Alpha and Omega v2.7.1.12.0
Source as Visual Studio 2010 Solution (28 MB)

Hi there! Nice of you to keep reading all those words coming out of my fingers. Lets continue our previous post on a deterministic multithreaded renderloop, yay! The last post ended with the explanation and implementation of an actual deterministic multithreaded render loop, but the render state was still too simplistic for a decent demonstration. If we keep track of an actual state of the world and create usable render states we quickly discover that it is more difficult than we'd like it to be. Before we dive in the abyss that is called "game engine development", lets start with a quick recap:

  • The world state is the whole state we need. It contains all objects and properties we'd like to have in our update-thread and render-thread.
  • The render state is a subset of the world state. It contains all objects and properties that the render-thread needs to render a frame (assuming all resources are loaded to the memory of the video card). We always have 3 of these so we can apply safe and efficient multithreading.
  • The render-thread has at most 1 of the 3 render-states in use and it has read-only access to it.
  • The update-thread has at most 2 of the 3 render states in use. The 1 that is updating now (where he has read-and-write access) and the render state he updated previously (read-only access).
  • We only need a mutex when we are switching states which is a very quick operation. During rendering no mutex-checking needs to be done. The update-thread also doesn't need any mutexes, unless we decide to divide the update-thread into multiple threads.
There are two additional advantages to having an deterministic rendering loop I didn't thought of when writing Part 1.
  1. It becomes a lot simpler to create replay-files because we only have to record stuff like "at update-frame #X, User Y performed action Z". We don't have to be afraid that small perturbations in algorithms will cause the replay to deviate from the actual played game.
  2. It is easier to to add online multiplayer support. As long as we ensure that update-message get received correctly, we are able to ensure that the world state for each client runs synchronized (although a bit delayed) with the server.

Word state = Render state ∪ Rest state ?

The biggest issue is that we have 1 world state and 3 render states, so we need to split our world state in two parts:

  • The render state.
  • The rest. Ugh, we don't have a nice word for this. rest state?
With their powers combined they form world state. If you are a fellow software engineer you probably hate this situation as much as I do. There is just no nice functioning object-oriented model to represent our complete world-state and still keep a separation between the two. Before we start thinking about render state and rest state lets just create an example world state so we don't have to worry about splitting it just yet. Let me first state that there isn't a one-solution-fits-all achievement that we may get if we take our time and think things through thoroughly t... (sorry can't think of other words starting with a 't'). I'll just design it with my game in mind and hopefully it becomes usable to me in other future projects.

So what should the world state look like? I have a good idea on how to come up with a world state that allows me to do most the things I want, but explaining 'why' I've chosen this particular representation will be very difficult. Most of the design-decisions are based on experience in previous projects and it may seem silly if you don't understand my reasoning.

World state

Okay here goes my first attempt of an example, followed by a quick Q&A afterwards.

Note all the strange things in this model, to name a few: No hierarchical model of the world, separation of 'static' (which might not be completely static) and 'dynamic' world. We have even separated the 2D overlay in opaque and transparent, what is that all about? We also only show the objects, what about each object's properties? Why aren't you talking about how to manage and load 3D models and textures on the video-card? Why do you have such a weird shaped head? How do you split this model into a render state and a rest state? Which thread is taking input from my game-pad that moves the player?

So many questions. And what was that about a weird shaped head, I mean, like, seriously? That hurts you know. Ah well, you have to be somewhat of a masochist to try and create a decent 3D game without the use of an engine. Okay let me try to answer a couple of questions you might have before we zoom in on the objects to see what properties we may expect.

  • Why don't  you use an hierarchical model of the world?
    An hierarchical model of the world isn't all that great. Discussing this topic alone is worth its own post (or two) and I am not willing to do that. Instead I'll just redirect you to a post from 0 FPS that explains it far better than I ever could. If we were to use a ray-tracer to render a voxel world (which I won't) and like to have a sparse oct tree, even then I would be doubtful on using oct trees.
  • What's up with the separation of static world and dynamic world?
    We actually had two reasons for doing this.  The static world will contain blocks with static landscape of opaque polygons. When rendering a scene you'd usually want to render the opaque polygons front-to-back so more pixels you try to draw are actually hidden by pixels you have drawn previously. Its not as simple as I might make it seem. A lot of things are interconnected again. For example there exists something as 'Z-Culling' (= Don't draw the pixel if it is hidden by another pixel, implemented using the z-buffer) and 'Early-Z' (= Prevent the use of fragment shader if pixels is hidden by another pixel, often combined by z-checking on the GPU-chip before using the actual z-buffer). Both Z-culling and early-Z  are things you could 'simply' enable assuming you don't use a fancy shader that modifies z-values (and off course only if the video card supports it). Even if we don't use early-Z we still want to render our opaque polygons front-to-back so the video card does not have to overwrite back-buffer and z-buffer fragments as much. Confused? You should be. This stuff isn't easy. The second reason I have is in the use of level-of-detail algorithms. It would be too naive to think that a single level-of-detail algorithm suits all our needs. Reducing polygons of landscape segments or ocean segments are two completely different algorithms and we didn't even mention other stuff like grass, trees or creatures roaming the wild. (There are other reasons for the separation, like algorithms for creating shadows, but I don't want to talk about that right now.)

    • So wait, you don't allow the use of transparent parts in your 'static' world?
      Really good question! That's right, no transparent parts in the static world. If we like to add transparent parts to the world (glass or ice maybe?) then we need to add those things to the dynamic world. Transparent parts need a different method of being drawn because can't depend on the z-buffer for those polygons.
    • So does static mean that we can't change the world? What if we want to create an explosion that makes a crater?
      The static world exists under the assumption that it does not change that much over time. A small change in the static world would mean that at least one of the polygon-chunks needs to be recreated. Video cards are better at drawing big chunks of polygons at a time (instead of lots of small chunks) so we like to encourage big chunks for increased rendering speed. On the other hand we'd like to use stuff like frustum-culling on chunks and like our world to be dynamic wherever possible. It's like finding the golden ratio without a calculator!
    • So the reason for separating the opaque and transparent 2D overlay is also done to increase rendering speed?
      Yes indeed. We want to draw the opaque 2D overlay as one of the first things (that will also update the z-buffer) so it hides a lot of pixels from the 3D scene but we need to draw the transparent 2D overlay as one of the last things. If we want to draw a fully opaque rectangle we could even create a simple 3D area on which we can apply culling techniques (similar to frustum-culling), but this is probably overkill and not worth my time.
  • Why don't you put the free roaming objects and the dynamic world together?
    The dynamic world is still bound by absolute positions. It may be the case that leaves are dancing to the wind or that the ocean creates waves, but there is no actual physics involved when these objects interact with each other. All the complex collision detection, physics, AI, ... needs to take place between free roaming objects and the rest of the world.
  • Why do you have such a weird head and what about all the other question I asked?
    Seriously, the head-thing again? Don't worry about those other questions, they will be answered in due time... probably.
  • Are you ever serious?
    Sometimes...

Lets us continue moving forward, the only sensible direction for us right now.

Separation of Render State(s) and Rest State

When we split the world state in a render state (three of them actually) and rest state then it'll look something like this:

Not all world-state-nodes needs to have a render-state-node (e.g. node #4), because not all nodes need to be rendered. Maybe node #4 keeps track of some AI-related stuff and is only used by the update thread. When the update-thread is finished and jumps to Render State #2 it'll probably look like this:

The render state will now be busy to update Render State #2 and the Rest State, using some of the information stored in Render State #1. However, it is possible that a render-state-state-node has been added or removed, so the update-thread may not blindly assume that all nodes in render-state #2 are available. The update-thread in the example above has been lucky enough that the render-state did not change but what if it did? What if the update-thread decided to remove node #5 and adds node #7 as child of node #6? Then the situation looks like this:

During the update process the update-thread needs to make Render State #2 the same as Render State #1 (and possibly make additional changes)! There are several ways to go about solving this issue, but I think it is important to keep all this complexity from the actual update-function. We don't want to worry about the possible non-existence of render-state-nodes while calculating physics! Have I already mentioned that we keep track of parent nodes and that order is important? Hah, just when you thought things were still reasonable.

So how will we fix old render states to have the structure we expected? There are several ways of approaching this but I have chosen the way where each operation that modifies the render-state will become an update-message. Before the update-procedure has its way with the render-state we will parse all update-messages so the render-state structure becomes up-to-date. So basically when node #5 gets deleted at render state #1 it will immediately gets deleted. For render state #2 and #3 an update message gets added to the todo-queues: "Delete render state at position X". The same happens when adding node #7, it will be created immediately for render state #1 and for render state #2 and #3 the update message "Add render-node of type X at position Y" will be added to the todo-queue.

There is only one last simple issue we need to address. We need a good way to encode the position of where to add/remove a node from the render-tree. We may not blindly use references to nodes in the rest state, because the node may no longer exist.

If the update thread decides to remove node #7 directly after he added it then the message "Add render-node for <ref. to rest-node #7>" for render state #3 becomes invalid. I'm implementing this on a managed language, C#, so it is not 'that' big of a deal, but it is still bad practice to ignore this because we may want to dispose the object before it gets deleted by the garbage collector. Again, there are multiple ways of solving this, but in my opinion the best way is to dispose of a rest-state-node only after all associated render-state-nodes are disposed. This implementation still allows the code to be ported to unmanaged programming languages, so yay :) . For technical details I'll redirect you to the source distribution, because it would probably be boring to show here.

The first thing he needs to do is to parse all items in the todo-queue so the structure of the render-tree is up to date. Only after this is done will we perform the normal update-procedure. Basically this should be working, but the data structure is not as nice as I'd hoped it to be. We now have 3 render-state-trees and 1 rest-state-tree completely separated from each other. For the update-thread this is not a great situation, because this means that we need to parse 3 trees simultaneously (the current rest-state, the current render-state and the previous render-state). To somewhat remidate this annoying situation we'll give each renderable rest-state-node references to the 3 associated render-state-nodes. This connects the rest state to the render-states. We have to be really careful with these reference because all those states have become a web of references and maintaining it during all tree operations may get complicated.

That took a lot of effort, don't you agree? All this hard work is totally worth it! We can now render and update the world in two separate threads without having to worry about thread-safe access to objects. The two threads can run separately from each other and never really have to wait, we only have a little bit of mutex-checking when one of the threads wants to change the state. Better yet, almost all this multithreading is hidden from you when you're making the game WHOHOOW. So are we done? Sadly no :(

Stuttering, The Problem

In the sce, sce, scenario where we set the update-FPS to 120, while the render speed is stuck at 60FPS you might expect something like this to happen in chronological order:

  • Render frame #0
  • Update to frame #1
  • Update to frame #2
  • Render frame #2
  • Update to frame #3
  • Update to frame #4
  • Render frame #4
  • ...

Because the render thread is twice as slow as the update thread we expect that the render thread to skip a frame. We do not synchronize the threads, so he might not always skip exactly 1 frame. Sometimes he may skip 2 frames, or sometimes he doesn't need to skip at all. If you expect a behavior where the update thread almost always skips 1 frame then you are sadly mistaken :( . Yeah, I'm sad too. Before we'll explain let's see the behavior that you are more likely to get:

  • Render frame #0
  • Update to frame #1
  • Update to frame #2
  • Update to frame #3
  • Render frame #3
  • Render frame #3
  • Update to frame #4
  • ...

Wow, that is quiet messed up right? No wondering that the game is stuttering! This has to do with thread-switching. The operating system decides which thread currently gets some execution-time on what core of your CPU. You may have hundreds of threads running on your quad-core processor, so each thread will be given some time to do his thing. So even if we visualize threads as processes that run completely seperatly from each other, this is probably not the case. Your OS is deciding when your thread may perform a couple of operations. Even if it may not look like it, threads do 'wait' for each other. It is not possible to tell the operating system when and when not to switch, but we are able to give a couple of 'hints. One of the possible hints we may give is "Processor affinity".

Proccessor affinity:
Processor affinity is a property you can set on a thread that tells the operating system on what processor(-core) of your CPU's the thread may run. So it is possible to create a thread that will only get executed on processor-core #3 of your CPU. For C# check out this MSDN page to see how this is done in code. Usually you don't mess around with these properties, but if you are a tweaker then you probably already know that you can also set process-affinity using the task manager. You may be shocked to hear this, but if you are programming for the XBOX360 then you must set process affinity on each thread you create! For XNA you'll have access to Thread.SetProcessorAffinity to configure on what CPU your thread will run. It is actually not 'that' shocking, they probably have specialized cores for certain tasks instead of your run-of-the-mill cores in your desktop computer. I'm currently not that focused on development for the XBOX and messing around with processor affinity just so solve the stuttering problem is probably not a good idea. So lets continue to the 'sleep' function.

Sleep(0):
It is possible to let a thread sleep for a couple of milliseconds, for C# this is done using the Thread.Sleep(...) function. This will actually execute a windows API call to let windows know that this thread is inactive for a while. This way windows knows not to give this thread a couple of cycles on a CPU when it is thread-switching, so the CPU can be busy computing stuff for other threads. When calling this function windows also knows that it is better to immediately switch to another thread, because this thread will not be busy for a while. When calling Thread.Sleep(0), you are telling Windows that you want to sleep 0 milliseconds. You may expect that this function returns immediately  but in reality windows will force a thread-switch. It may be the case that the same thread get a couple of cycles, but it may as well be another thread. This is not some obscure undocumented feature that exists as a side effect of 'sleep', it is actually well documented and a really nice feature to have.

A forced thread switch using Sleep(0) is probably a good idea. This is the first 'fix' that we'll make. After the update thread is done performing a 'step' and when the rendering thread is done rendering we'll always make sure that we'll sleep, even if it is 0 millisecond! This may help a little bit, but it won't solve the underlying problem. The problem is caused by a lack of thread synchronization. We don't want the traditional form of thread synchronization (because we don't want threads waiting for each other) so what are we able to do?

Stuttering, Solution #1: Partial Thread Synchronization

It is possible to do some thread synchronization without creating nasty situations where threads need to wait for each other. If the update thread runs at 120 FPS and the render thread at 60 FPS, then certainly under no circumstance should the render thread draw the same frame twice! If it was reversed (update-thread at 60 FPS and render thread at 120 FPS), then under no circumstance should we allow the update-thread start updating before the render-thread is starting to draw the last updated state. This is implemented using two ManualResetEventSlim's and are used as follows:

  • if desired update fps <= current render fps, prevent frame skipping
  • if desired render fps <= current update fps, prevent frame duplication
If we prevent both frame skipping and frame duplication then the two threads are completely synchronized. Note that even in this scenario it is still fully multithreaded because the update thread is still one frame ahead of the render thread. Fully synchronized threads are desirable if (and only if) the render-thread and update-thread are almost equal in speed. The condition above is a little bit strict, because this will only happen when the all FPS-values are exactly the same and because we are talking about estimated averages here, that's very unlikely. To improve the chances on full synchronization we'll state that it is okay to slow down a thread up to ~15%, if this helps to improve synchronization. The two conditions have now become:
  • if desired update fps <= current render fps * 1.15, prevent frame skipping
  • if desired render fps <= current update fps * 1.15, prevent frame duplication

So even if the CPU is able to keep up, the game may actually run a little bit slower than desired to keep 'smoothness'. This may be undesirable for online multiplayer or games that depend on a very stable game speed, so hopefully I'll remember this setting when I get to the point where I'm actually creating a game :) . Lets continue to a better solution.

Stuttering, Solution #2: Motion Extrapolation by Render Thread

If the render thread has access to the variables that are used to create motion (momentum-vector, rotation matrix, ...), then we'll be able to extrapolation the probable position of an object. We do not apply things such as physics or AI, we'll just extrapolate the expected position of an object in the render thread. This is actually not as difficult as it might seem. This solution is actually really great! Even if we render at a whopping 1000 FPS, every frame is able to render a new frame, not just a frame that we have seen before. We'll just use simple linear extrapolation. Even if the extrapolation turned out to be wrong (because of a missed collision response for example), the shown intersection will only last for a fraction of a second. We could even use input from the game-pad mouse or keyboard for extrapolation! This will increase the observed response-time of the game, so what's not to like?

Binaries and Source

The binaries and source distribution of this version contains both of these methods (download links at top of this page). I've added a couple of bouncing 'balls' so it is possible to see whether the game is stuttering. It is very likely that you'll see stuttering when you disable both thread synchronization and extrapolation. Ahhhh and everything is still deterministic... so awesome...

*Update: The next post is about deterministic behavior, but we've also written part 3 and part 4 if you like to continue reading about the multithreaded renderloop.

Multithreaded Renderloop - Part 1 Deterministic Behavior

line