Multithreaded Renderloop - 4

Download Alpha and Omega v2.
Binaries as Win32 Installer (3 MB)
Download Alpha and Omega v2.
Source as Visual Studio 2010 Solution (30 MB)

We have implemented multithreading to separate the CPU-heavy stuff from the GPU-heavy stuff and it is now possible to create a multithreaded game without having to worry about mutexes. There is just one last thing we need to worry about. Actually there are tons of things to worry about, but for the short term goals there is just one :). If we have a huge landscape and use LOD algorithms to keep the frame-rate from dropping, we probably want our update thread to create low-polygon versions for chunks of the world. There are several LOD-methods, but the most common one is that the CPU creates these chunks (there are also GPU-based LOD algorithms). So the update thread creates (or loads) a low polygon chunk, replaces a couple of nodes in world tree and uploads the model to the video card, right? But all our effort so far has been to separate the CPU-heavy stuff from the GPU-heavy stuff and now we are letting the update-thread upload models to the video card? It must be the update-thread who decides what chunks are going to be rendered, but letting the update thread have access to the video card might not be such great idea.

Multithreaded Access to the Video Card

I'll just going to start with a bold claim, explain why, and discuss all the nasty implications. Here goes: "The update thread may not access the video card, ever". This is a big issue to overcome so we better have a good reason for this. Even if this statement is not accurate (at all), let me explain why. Lets assume the video card is currently busy rasterizing polygons to the back-buffer because we are sending a couple of 'draw'-commands to the video card. It is possible to use a different CPU-thread to send a new 3D model to the video card while the video card is busy applying shaders (it's not like the bus is in use by our draw-command). So how will the video card respond to this? If we are being optimistic we can say: "The video card can draw and update its memory at the same time, right?". If only things were so simple. Most video-card drivers are (rightfully) being protected by mutexes to prevent multithreaded access to the video card. Uploading data to the video card may unintentionally block drawing so we better be careful what-and-when we upload the data. If we let the update-thread also manage video card resources it's possible to get poor performance as we may end up taking up time on the bus while a draw command is waiting for his turn. The update-thread has absolutely no idea about the current GPU/bus load and has therefor no way of being smart about it.

Maybe I'm just being paranoid, but I'm pretty sure that my crappy video card does not like to do 20 things at once. In our previous posts, part1 and part2, we've separated GPU (render-thread) from CPU (update-thread), but the update-thread was still in charge of loading resources to the video card. Our current goal will be to have a fine-tuned control over the GPU by completely separating the update thread from any video-card resources. This will again be an overkill of performance tweaking (if I want more performance it's more fruitful to switch to C++), but it sure is fun to do =D. The reason I'm diving into this madness has to do with my desire to play around with virtual textures (a.k.a. mega-textures) somewhere in the near future. To be more precise, procedurally generated virtual textures :). When using virtual textures lots of texture updates will occur, demanding peak performance from the bus non-stop. So my guess is that its better to implement this part now than being sorry later.

The best method to start is to change the rendering loop. Instead of this pseudo-code,where the update-thread loads resources:

while (true) {
  RenderState latestUpdatedRenderState = world.GetLatestUpdatedRenderState();

We will use something like:

while (true) {
  RenderState latestUpdatedRenderState = world.GetLatestUpdatedRenderState();
  TodoList todoItems = latestUpdatedRenderState.GetTodoItems();
  while (todoItems.ThereAreThingsWeMustPrepare()) {
  while (WeHaveTimeLeftToLoadSomeStuff() && todoItems.ThereAreThingWeWantToPrepare()) {
function DoVideoCardStuff(TodoItem todoItem) {
  switch (todoItem.Type) {
    case VideoCardTodoType.UploadTexture:
    case VideoCardTodoType.IncreaseLODChunk:

The goal is that we load all the resources in the part after rendering. This way all the loading has been done at the time we are actually intent to draw. If we need to load resources before drawing then this is probably the result of a high GPU-load. The items we load after drawing are actually a premonition on the items that will become visible the next frame. This premonition may be wrong, so it will be the job of the update-thread (the one in charge of filling the todo-list) to make decent premonitions. Pretty cool huh? All we need to do now is to build en efficient todo-list which may be accessed by both the update- and render-thread. This extra complexity is the reason why it's usually better to use an existing game-engine if you just want to create a game.

If in the future, when we add multithreaded access to the video card, it'll be really trivial to implement background loading. Multithreaded access to the video card only makes sense when you want to split the GPU-heavy operations from the bus-heavy operations. So one CPU-thread becomes in charge of optimally using the GPU (making all the draw-calls) and one thread becomes in charge of optimally using the bus (transporting resources to the video-card) while trying not to block drawing. There are even some techniques to draw multithreaded but lets not yet complicate things more then it already is.

To make sure that the video card is only being used from the render-thread (or render-thread + loading-thread), we'll have the update-thread create the so-called 'todo items' that the render thread uses to load resources. These todo-items are either 'premonitions' of what probably becomes visible in the near-future, or items that 'must' be loaded. We also like the todo-queue to be accessible for LOD-algorithms so we can say "Increase the polygons and texture quality of this LOD-chunk if you have enough time left". I'm not sure what LOD-algorithms I'm going to implement (I have also seen a couple of smooth transition algorithms that caught my interest), but it's probably still best to keep the premonition-thing optional for games that do not care about that stuff. At least it should be implemented in such a way that it does not complicate game-code in any way if not used.

Additional Benefits

Besides being happy for applying best practices, there are some actual benefits for doing it like this. The most prominent one is bottleneck-detection, but we're also able to apply smart resource loading.

Bottleneck detection

Because we have such strict separation of threads (CPU/GPU/bus) it has become really easy to detect bottlenecks. If the game looks or feels crappy on a certain computers, we are able to see exactly what part causes it.

  • Problem: The render-load is at 100% and the frame rate is low.
    • Situation: The bus-load is not that high Cause: The GPU cannot render everything because it is too slow. Either the shaders are too complex, or we are trying to draw too many polygons.
    • Situation: The bus-load is (almost) at 100%
      • Situation: Most of the bus-load is caused by forced updates (see: todoItems.ThereAreThingsWeMustPrepare) Cause: We are sending too much new textures/models to the video card. We should probably decrease texture quality. It is possible that we've written stupid code that continuously unloads and loads the same texture/model so we may need to check that out.
      • Situation: Most of the bus-load is not caused by forced updates (see: todoItems.ThereAreThingsWeWantPrepare) Cause: We are probably sending so much stuff to the video card that it is blocking draw-commands. This should never happen as long as we haven't implemented multithreaded resource loading. If we have implemented multithreaded loading, we should probably let the load-thread be more GPU-aware.
  • Problem: The game is running in slow motion (independent of render-FPS).
    • Situation: The CPU-load is (almost) at 100%
      • Situation: If CPU-usage in windows taskmanager is also at 100% Cause: All the physics, AI and other CPU-related algorithms are taking up too much time. Most of these algorithms cannot be changed dynamically without  affecting determinism so it's probably because your CPU is too slow or my algorithms being too stupid. Some algorithms can be dynamically modified without effecting determinism. For example sorting algorithms for determining the order in which objects/polygons will be drawn (for optimizing rendering speed) could stop at any interval, as they do not need to guarantee perfect sorted lists.
      • Situation: If CPU-usage in windows taskmanager is not that high. Cause: We are not using all your processor cores efficiently or we have created stupid code with mutexes. Either way it is my fault for being such a bad programmer.
    • Situation: The CPU-load is not that high.
      • Cause: Something must be seriously wrong with my coding. I hope the error lies in the way that we calculate the CPU-load, otherwise I'm screwed.
  • Problem: The game is not responding to your input quickly enough
    • Cause: Our multithreaded model will have a a 8.3ms delay on average between processing the input and rendering the results when we are running the update thread at 60FPS. Even though 8.3ms (with a worst case of 16.6ms) is considered acceptable in most cases, we are able to decrease this number.
      1. The render state is considered to be read-only for the render-thread. It is possible to detach the camera object from the render state so the render-thread is able to update it's position based on the most resent input from the user. The camera itself will no longer be deterministic and can no longer be part of algorithms such as collision detection, but for some games this is not an issue. When using this technique we also need to apply the frustum-culling on the render-thread so it quickly becomes messy.
      2. We can run the update thread at, say, 120FPS. Since it is decoupled from the rendering-thread this is pretty simple, but we do need to enforce a constant update-FPS to ensure determinism.  The update FPS should not be a value that can be changed in the settings of a game!

Smart Resource Loading

There is another benefit by using deferred resource loading. This has to do with level-transitions and is best explained by using an example.  Lets say we are making a platform game that contains multiple levels. When we transition from one level to another we may need to release old textures and load new textures to render the new level. In the example depicted on the right we want to unload the cave-background and load the water-background, but a lot of resources (Mario, coins, stone wall, ...) could just stay in memory. To achieve this behavior we set two requirements:

  1. Unloading textures that we no longer need has to be done before loading new ones. It is possible that the new level does not fit in the video card's memory without removing old textures first!
  2. We don't want to unload a texture and immediately load it again when we notice that the new level also requires this texture.

Usually you'll need some extra game-logic to handle this smart resource-management, but by using deferred resource loading we are able to do most of the work in the engine. Obviously we will use reference counting for loading and unloading resources, but reference counting alone is not enough. Lets assume we're talking about three resources: A, B and C. Resource B is shared between the cave-level and the underwater-level. Resource A is only used by the cave-level and resource C is only used by the underwater-level. How should we load and unload these resources?   Usually we would do something like this:

  • Step 1: Remove Cave Level
    • Step 1.1: Decrement reference counter for resource A. If counter for A is 0, remove the resource.
    • Step 1.2: Decrement reference counter for resource B. If counter for B is 0, remove the resource.
  • Step 2: Add Underwater Level
    • Step 2.1: Increment reference counter for resource B or load resource B if it doesn't exist.
    • Step 2.2: Increment reference counter for resource C or load resource C if it doesn't exist.

This method does not work for requirement #2 because we may unload resource B temporarily. We could apply step 2 before step 1, but that doesn't sit well with requirement #1. Given the fact that we load resources deferred, it shouldn't even matter whether we perform step 1 before step 2, as long as we do it within the same frame! Luckily there is an easy solution and we don't need to mess around with finding intersections :)

When we split the messages from step 1.1, 1.2, 2.1 and 2.2 we'll be able to order the actions.

  1. All reference counter decrements:
    • Decrement reference counter for resource A.
    • Decrement reference counter for resource B.
  2. All reference counter increments (may add messages to "All resource-loading"):
    • Increment reference counter for resource B or add message "Load resource B" if it doesn't exist.**
    • Increment reference counter for resource C or add message "Load resource C" if it doesn't exist.**
  3. All conditional resource-unloading:
    • If counter for A is 0, remove the resource.
    • If counter for B is 0, remove the resource.
  4. All resource-loading:
    • [Empty, until "Load resource C" gets added]

When processing these messages the reference counter for resource B will temporarily get to 0, but it won't get unloaded! All unloading happens before loading new resources, exactly how we want it! This only helps with smart resource loading within a single frame and therefor not a complete solution, but at least it's something.

**Keep in, when implementing this, what happens when you try to load a resource twice at the same update-frame.**


For implementing everything described above, we'll split it into separate tasks:

  1. Load resources on the video card (with smart resource loading).
  2. Make resource-predictions that can also be used by LOD-algorithms.
  3. Load resources on a separate thread.

I don't have a good demo for testing everything that we've described above, which is kind of troublesome. I'm not a big fan of implementing a technique without a corresponding test-case to see if any bugs get overlooked. I'm going to stick to task #1 for now and when we'll have any form of procedural content we'll add task #2 and #3. I'll just add it to the todo-list which keeps on growing and growing :)

This time the demo is a little bit more interactive. Weeeeee..... a Breakout clone with physics =D

Oh, btw, we've disabled the DRM. It seems a bit silly to have an annoying popup while the DRM system doesn't even work yet. We also disabled all that debug-text to remove some of the clutter on the screen.

Future Work

All tasks that result in a transmission of data over a connection that is basically single threaded should be handled with a little bit of additional care (just like the video card bus). The two most prominent data-transmissions are reading from the hard disk and sending/receiving over a network connection. It's is bad practice to let the update thread (or one of its workers) have access to these objects! It's often overlooked as something that doesn't matter, ignoring all best practices for multithreaded programming. Not only would we require mutexes to protect sockets and file-handles, it also takes up threads that are tasked to optimally use CPU. Having separate threads for hard disk operations and networking give us a better control over performance issues and a better insight into bottlenecks.


I've had a discussion with Bourgond Aries about multithreaded game engine design (see comment section below) that we later continued over e-mail and chat. This discussion gives additional insights to triple-renderstate buffering and can be seen here. This discussion even contains sketches such as these:

 slow_fps fast_fps deferred-physics



I've create a video-animation in my post about extrapolation that shows the basic workflow of triple-rendering. I assume this will be very helpful for people trying to understand what the sheep I'm rambling about :) sheep


  1. Kevin says:

    Thanks! Pretty interesting :)

  2. Bourgond Aries says:

    I wonder what you have to say about this.

    Myself; I'm on the fence. With thread-separated render/logic loops, we can tune each of them, but it requires some synchronization or triplebuffering work.

    • eierkoek eierkoek says:

      What he is saying is not wrong, not at all. But there is one very important distinction we need to make: threads on CPU should not be grouped together with threads on GPU. The idea discussed in that article only discusses how to apply multitasking on CPU and my creator totally agrees with them: "Batching is way better then unstructured thread-spawning". Sometimes this is a bit naive (e.g. The XBOX360 requires the programmers to specifically say on what core a thread needs to run). Luckily these 'old' way of doing dings are slowly fading from our society.

      Thread-locking is a very expensive operation and to make your software scalable "Jobs and batches" is the best approach. Just keep in mind the distinction between CPU and GPU here. To show this distinction more clearly I will give some examples of why the thread that gives commands to the GPU should not be part of the jobs and batches paradigm:
      - The command to the CPU is tiny such as 'draw X' or 'apply shader Y'. It is only a command and does not eat up any CPU and should therefor not qualify as a single 'job'. If single GPU-calls are considered jobs than a lot of overhead will occur that will lead to unpredictable frame-rates.
      - Separating commands-to-the-GPU on different jobs have no actual merit because the result of these commands are only visible to the user when all jobs are done (and the back-buffer becomes the front-buffer)
      - There can only be 1 'command' to the GPU at a time. If you are uploading a texture it is just not possible to simultaneously send a command to draw something else. This is a direct result from how the bus for the graphics card is designed and how the API for GPU-calls has been made to talk to the graphics card.

      So we should ask ourselves: "Do we want the commands to the graphics card to be part of the job and batches paradigm or not?". A point can be made that it is much nicer to have just one way of approaching threads (Jobs and Batches) so the software-engineers working on the game don't all have to understand the difficulty of optimization (such as separating render state from world-state). Unfortunately there is no one-solution-fits-all here because it is based on personal preference. My creators approach is also personal preference and based on the three goals set out on top of the page:

      In the authors "further considerations" he says to take a look at MMO servers. He is totally right in showing that the type of application is very important. Did you know that MMO-servers don't actually use GPU's? Not using jobs-and-batches in MMO-servers as the only way for multithreading would be asenine, but for client-applications me might want to reconsider. Also, using the technique described by random guy #5,903,454,104 for games such as 2D-platformers is just overkill. There is no good reason to justify such complication if the benefits are doubtful at best.

      I hope that my masters position on this subject (and the referenced article) is more clear now. Using tripple-buffering is never a justification for not using jobs and batches. I hope you have a wonderful afternoon/morning/night/... and hope this answer sufficiently removes doubts you may have.

      - C5PO

      • eierkoek eierkoek says:

        I refer to myself as random guy #5903454104 and expect other people not to care about who I am. You however, are my creation. I would expect a little bit more gratitude and referring to me in such demeaning manner is unbefitting of somebody that owes his existence to me.
        Consider yourself to be terminated and be used as example for the next version of yourself.

      • eierkoek eierkoek says:

        Just to clarify for anybody reading this: C5PO was also wrong about separating commands to the GPU having no merit (or at least he phrased it very poorly). The sooner the GPU can do something the better it is. The GPU may use his own cores in separating jobs. This in itself is still no reason to use the same job-batch paradigm or GPU-calls used for CPU-jobs though...
        C5PO was a work in progress and I urge everybody to think for themselves before coming to some conclusion.
        I apologize for the inconvenience.

      • Bourgond Aries says:

        Thanks for the answer eierkoek (lang niet gegeten, lekker :)). However; one can accomplish all 3 goals that you set out in the first article. I have in fact implemented a simple class to do just that if you would like to take a look (C++)

        Anyway; I have been contemplating both designs (Batch vs MT-Render loop) for a while now, and I still have some questions regarding your design.

        Let us assume we have your design in place, and that the CPU-heavy stuff takes a lot (too much) of time to compute its data, thus; the GPU-heavy renderer is basically re-rendering the same frames a few times before having the triplebuffer updated. How would this be useful? Can't the renderer wait for the physics/logic updater to finish, and then redraw? Perhaps in your design; you would use a condition variable of some sorts, but it appears that you want to run it at a set Fps. Why is that? [Paragraph Question -> How can redrawing the same screens useful?]

        Secondly; if we implement a batch system, would it be possible that the renderer may stall us due to the GPU processing the frame? Does the CPU wait until the GPU has finished, or does the CPU simply issue commands and continue, and only wait if it can not send more commands because the GPU is still busy from last time?

        If the former is true; that the CPU is stalled until the GPU is finished, then perhaps a thread for rendering may be a solution. If the latter is true, then I see no point in having a separate thread for rendering as we could simply decrease the % of iterations used on rendering (see Rit.hpp, end of page sample). When we do this, we will get the same effect as your system has. This effect causes the renderer to skip logic/physics frames. Nno triplebuffering would be needed tho, we thus save some space.

        Another article which you may also read:

        Also, don't forget that I'm only here to find out certain truths, that is all (no flame wars like there appears to be brewing in the link). I'd love to read your response once again!

      • eierkoek eierkoek says:

        We are having an important discussion on design decisions with respect to multi-threaded design in games. Challenging decisions and ideas are very important and I will certainly take no offense. Actually I love it! It would be wonderful if I was shown to be wrong. I think our discussion is important enough to other random people on the internet to make it public, but unfortunately this comment-section is not well suited.

        You have shown yourself to be knowledgeable in the area we are discussing, so what you are asking and saying should be made public and known. I propose that we continue our discussion in e-mail format so we both have more freedom to express ourselves and use rich formatting that this silly blog cannot provide. I will then create a separate page on this website and make our discussion public for the rest of the internet to see. A link on the main page will be added, so our discussion can be read by anyone interested.

        If you agree to this proposal, please send an email to [censored]. As long as our discussion remains on-topic and continues to be civil I'd love to debate this topic.

    • eierkoek eierkoek says:

      For people interested in how this discussion continued, the discussion has been made public at: