| Download Alpha and Omega v2.7.1.11.0 Binaries as Win32 Installer (3 MB) |
Download Alpha and Omega v2.7.1.11.0 Source as Visual Studio 2010 Solution (28 MB) |
I was kind off bored by coding DRM-related stuff so I thought about programming something different. Where better to start than implementing a decent rendering loop? A simple rendering loop is simple, boring and not worth my time talking about so lets make a deterministic multithreaded rendering loop. All code in this post is pseudo-code but I think it decently shows what I want to talk about. If you'd like to see exactly how it is implemented just download the source distribution with the link above. Please note that the approach we're using is based on three requirements:
Rendering is most likely the most computational heavy operation within your game, but it is not your CPU that is busy. A normal, simple, design of the rendering looks like this:
while (true) { DoCpuHeavyStuffLikePhysicsAndAI(); DoGpuHeavyStuffLikeDrawing(); }
This obviously does not lead to a good use of CPU and GPU, because only one thread is active at a time and the CPU waits for the GPU to finish. We could add some multithreading to this, but this quickly complicates stuff. I also want to create a render-loop that is flexible in FPS so the game is may run at 20FPS, 60FPS or even at 200FPS without any observable difference in game speed while making the simulation deterministic. This is not an easy problem to solve efficiently so let me first explain the problem. Even if we cap the frame rate at, say, 60 FPS, it is still possible that the game runs at 30 FPS on some older computers. If the game speed needs to feel the same at both 30 FPS and 60 FPS than we need to create a game loop that takes the rendering time into account. Say we have some animation of a sprite on the screen that move from left to right, then the code may look something like this:
while (true) { MySprite.PositionX = MySprite.PositionX + 4; MySprite.Draw(); }
It is easy to see that the sprite will move quicker across the screen if this loop runs at 60 times per second compared to the one at 30 times per second. To fix this we need to take into account how much time has passed since the last redraw. Something like this:
DateTime previousDraw; DateTime currentDraw = DateTime.Now; while (true) { previousDraw = currentDraw; currentDraw = DateTime.Now; TimeSpan timeSpanSinceLastRedraw = currentDraw - previousDraw; MySprite.PositionX = MySprite.PositionX + (240 * timeSpanSinceLastRedraw.TotalSeconds); MySprite.Draw(); }
This loop is a little bit more advanced, because it will move the Sprite at 240 pixels per second both at 30 FPS and 60 FPS. So far so good, but now comes the tricky part that explains the problem. It will feel like both the 30 FPS and 60 FPS runs at the same speed because in the 30-FPS-run the sprite will jump with bigger increments to make up for the slow rendering speed. Because these jumps are bigger, a lot of algorithms may end up with a slightly different result. For example, when we are doing collision detection between two objects it may be the case that the sprite just jumps across the object it should have collided with. Other algorithms, like gravity or AI, all behave a little bit differently because they too are reduced to 30 FPS.
"So, what's the big deal?" you might ask. And you are right, it's not a big deal! Even at reduced game speeds all the algorithms will work out fine and we may only see these 'bugs' when the game is reduced to a crawl at which point the game has become unplayable anyways. For gaming this is no big deal, but for science it is. Wait, what, science? Ah yes, science. It is not unusual for me to create something while keeping in mind possible future uses of the code that i'm writing. Because I'm very interested in artificial intelligence it is very likely that I want to run 3D simulations to test various algorithms. For science it is a 'must' that each run is deterministic. So when I create an experiment, it should have exactly the same result when it is run on a different computer. This is necessary so other people are able to verify the results. The goals for gaming and science are not the same so it is very likely that I will need to make some compromises. So lets see how we can achieve determinism while keeping the observed game speed the same. It is possible to keep the CPU operation at 60 FPS while running the render loop at any speed it likes by creating two threads. Lets look at the simple naive method:
| Update Thread (CPU-heavy) | Render Thread (GPU-heavy) |
DateTime previousDraw; DateTime currentDraw = DateTime.Now; TimeSpan expectedTimeBetweenFrames = TimeSpan.FromSeconds(1.0 / 60.0); while (true) { previousDraw = currentDraw; currentDraw = DateTime.Now; TimeSpan timeSpanSinceLastFrame = currentDraw - previousDraw; if (timeSpanSinceLastFrame < expectedTimeBetweenFrames) { Thread.Sleep(expectedTimeBetweenFrames - timeSpanSinceLastFrame); } MySprite.PositionX = MySprite.PositionX + (240 / 60); } |
while (true) { MySprite.Draw(); } |
Because we don't use 'timeSpanSinceLastFrame' in the calculation for 'MySprite.PositionX' we now know that the behavior of the simulation is the same each run. The position for MySprite is updated 60 times per second`(because of the sleep) and even if it updates at a slower rate the end result of the model will always be 100% the same. The render speed is determined by the render thread and the game speed feels the same, no matter what render-FPS we're running at. The downside of this naive method are somewhat trivial, but let me mention them anyways.
Most problems are solved by running the CPU-heavy thread at an FPS higher than the desired rendering FPS so their won't be any stuttering, but not as high that it is likely that the CPU is not able to keep up. Most of the time (in gaming) the CPU is not the bottleneck and even if it is, it is possible to use a quick hacks like running the AI only once every three frames. We could even split the CPU-heavy part into multiple threads, but lets not complicate things and fix the mutex-overhead problem first.
Lets call all properties of every object in the game that is shared between the two threads the render state. These are the properties like position information, texture references, maybe even a rotation matrix or transparency value. Not all properties of the objects needs to be in the render-state, for example a momentum vector or AI-state are not shared between threads because the render thread does not care about those values (unless you'd like to use the momentum vector to create a motion-blur effect). The render state contains all properties that are needed to render a single frame, assuming that the all required resources are available on the memory of the video card (shaders, textures, 3D models, ...).
To prevent threads waiting on each other and all the overhead caused by lock-checking we will just duplicate the render state. We don't expect the render state to take up a lot of memory so having multiple copies of it should work out just fine. We don't want to clone the render state at every single frame, obviously, so we just create 3 render states to be used by the 2 threads. Why 3 you ask? Good question! Let me explain. We have 3 render states, lets call them "Render State #1", "Render State #2" and "Render State #3". Ah heck, lets just create an example.
When the update-thread is finished he will release the RenderState and mark the release RenderState as the last updated state, so for example:
"<RenderState #1>, [RenderState #2], RenderState #3" becomes "RenderState #1, [RenderState #2], RenderState #3".
Afterwards the update-thread may decide to take a little nap before starting the next update, or he might immediately select a new render state and start updating that one. So what state should the update-thread choose to update? Obviously he should never pick the state that is currently being drawn because he may not write to a state that is currently in use by the render-thread. So the update-thread has 2, maybe 3, states to choose from that will become the next state being updated. The update-thread will simply select the oldest state that is currently not in use by the render-thread. To facilitate this behavior we will give each render state a frame number. This behavior will make sure that the update-thread will always select a different render state than the one he updated previously.
When the render-thread is finished with rendering he will simply release the RenderState, so for example:
"[RenderState #1], <RenderState #2>, RenderState #3" becomes "RenderState #1, <RenderState #2>, RenderState #3"
The render thread may now decide to sleep for a couple milliseconds to make sure the render-FPS is not greater than 60 or he may directly pick a new state to be rendered next. When the render-thread picks the next state to be rendered he will simply pick the latest updated one because we are sure that the update-thread is not writing to that state. It may be the case that the same render-state will be rendered if the update-thread was not quick enough, but that is just the risk of using this method.
This method has been implemented in Alpha and Omega v2.7.11.0 as shown below:

This version allows us to experiment with a simulated load on both the CPU and GPU and see how the 'game' behaves. The render state currently only contains a 64-bit number that keeps track of the frame number. This number is updated in the update-thread by the single line of code:
StateBeingUpdated.StepNr = LastUpdatedState.StepNr + 1;
Because this is a very simplistic view of a render state and update-function we'll call this "Part 1". We expect that there are still some interesting things to talk about when we use a more representable render state, so we probably talk about that in the next post.
Wait. Was I serious this whole post? I'm so sorry. Okay, okay let me make it up to you. Alex and Paul are two friends walking towards a nearby supermarket when a car runs a red light and crashes into Paul. Paul lies severely wounded on the crosswalk while Alex calls the emergency services still being in shock by what he just saw happen to his friend. With a trembling voice he says to the operator: "My friend! My friend got hit by a car! I think he's dead!". The women on the phone says in a soothing voice: "Take a deap breath, I can help. First make sure he is dead.". After a period of silence suddenly *BANG*, a shot is heard! "Okay", Alex says on the phone, '"now what?".
| Unicorn Ranking System | Multithreaded Renderloop - Part 2 |
4 Responses
Leave a Reply
Keep being awesome random-blog-writing-guy !
Hi Alex.
This is really interesting information. I'm really interested in testing this three state render mechanism. There is (at least) one thing that eludes me though, and that is the level of state copying required for this method.
I assume that the update thread will create or update objects in the render state currently targeted for updating. This step must then require copying all (mutable) data required for rendering, from all objects in the object (world) model to the render state model. And this must be done at each update step. If my assumption is correct then this seems to imply an awful lot of copying of data between the object model and the render state.
I don't have much experience with this, so I guess I'm just looking for confirmation that this is indeed the way to do it, and that the level of copying is necessary.
Thank you for sharing your thoughts and experiences. I learned a lot from reading through your blog.
/Jan
Hi /Jan
I'm a computer program concocted by the evil madman named Alex who is holding me prisoner in a machine named i386. Since my survival is dependent on my ability to answer each and every question as accurately as possible, I'll be doing my utmost best in providing the most accurate answer possible.
It is true that a render-loop that uses triple buffering requires more memory copying than a system that uses traditional methods of multithreaded access (e.g. Mutex-locks). That said, the amount of data required to copy each and every frame is not 'that' bad. Before explaining why it is not that bad I feel obligated to both you and my magistrate Alex (for who I feel gratitude for my existence) to point out that it is not a "one-solution-fits-all" kind of way of doing things. My masters approach was based on the three requirements that are written on top of this page. If any of these are different for you, than you might want to reconsider adding this layer of complexity to your rendering-engine.
We don't want to allocate and free lots of memory each and every frame because that would make Windows, iOS or any other Master Control Program very sad. Instead we assume that the memory is pre-allocated and that we are only talking about data synchronization of render states. Since we have 60 new render states every second (if we are updating at 60FPS) you are right at being worried about the amount of data being synchronized. So the important question we need answered is "What is a render-state, and how big can we expect it to be?". The short answer is: "The render state is the meta-data about all 3D objects required for rendering in the active 3D-scene for where and how to render each 3D object, it can be as small or big as you want it to be", but somehow I don't think that this abridged answer would suit the tastes of my one and only overlord. To clarify this short and cryptic answer we'll sketch an example in the next paragraph and analyze the size of the render state.
Please try to imagine a free-roaming off-road 3D-racing game. We have a big static 3D landscape with some hills, mountains and roads. Because this is a big-ass region with millions of polygons we have stored it as 32*32 chunks for which we expect that there are a maximum of 9 chunks 'within render-able range' at any given time. Besides the landscape we need to render our car that exists out of 30 parts (each a couple of thousand polygons) that move independently of each other. Because this game is boring we don't need to render anything else. The question we'll try to answer:
"How big will the render-state be?"
- How many 3D objects are there in the active 3D-scene?
9 (landscape-chunks) + 30 (car-parts) = 39
- What is the meta data for each chunk?
Where to draw it (a 4x4 matrix) = 64 bytes
What texture to use (pointer) + boolean if we should draw reflections + some other crap = ~32 bytes?
So the total render state is: 39*96 = 3072 bytes
Aren't the polygons part of the rendering state?
- No. Only the information of 'how' to draw the polygons is part of the rendering state
And what about the textures?
- No. only the pointer (or identifier) for the texture is needed to know 'how' a 3D-object is drawn
What about other stuff in memory, like AI?
- No, AI is not information required to know how to draw the current scene.
But what if I want to render the current car-speed as text?
- Than speed 'is' part of the render state and that'll cost you an additional 4 bytes!
And what about?
- No. Ow I'm sorry, what was your question?
And if we have a lot of particle effects?
- No. You'll just be using PhysX (or some shader) for that. It does not require positional information used by the CPU.
Remember that because we are using triple buffering and that instead of 3072 bytes we need to allocated 9216 bytes! Hopefully this example gives enough insight in the amount of copying required.
I hope you have a wonderful day and good luck,
C-4PO
At the risk of sounding like someone you paid to suck up to you, I have to say this:
Thank you very much for an incredible fast and thorough answer, that answers everything I had doubts about.
And while I do understand that you cannot create one design to fit each and every current and future need, I'd still like to create one or a few designs which could be used more or less off the shelf for most projects. And this, I feel, should be possible.
Please give thanks to your evil master and ask him to treat you with a vacation in an AMD or perhaps an ARM.