Multithreaded Game Engine

Sorry, no new binary release. The last one can be found here.Sorry, no new source release. The last one can be found here.

Introduction

This is a discussion between me and Bourgond Aries, from Norway, about multithreaded game engine design made public for the world to see. We do not discuss all forms of multithreaded game-engine design, instead we focus on double-state (Bourgond's point of view) vs. triple-renderstate (my point of view). Multithreading is also possible with just a single state, but that scenario is not discussed here. Even though a single-state design is not discussed here, it is my personal recommendation to read (and understand) about single state multithreaded game-engine design and it advantages and disadvantages before diving into this discussion.

The double-state keeps two versions of the world in memory so one can be written to while the renderer is reading the other buffer. The triple-renderstate does not talk about a triple back buffer on the video-card, so it's NOT this. For some background on on the triple-renderstate I recommend reading my previous posts on the topic first: part1, part2, part3 and part4. Keep in mind: a triple-renderstate still has but one world, physics and AI state.

Luckily I found an impartial referee with google image search, who was willing to be our discussion leader. So is everybody ready? Is the popcorn buttered? Are the pets fed? GO!

Kingdom of NorwayVS.Kingdom of the Netherlands
 
 
versus
If you don't think Vikings are cool, please leave... Who is that guy? His video's are pretty awesome!

Vikings are soooo awesome, how can the Dutch even compare? Is this even a fair fight? How can wooden shoes and tulips even be useful!?! The only thing the Dutch have going for them is the Dutch-Deutsch-Holland-Netherlands-Kingdom-Country-Confusion (DDHNKCC for short). Can the Dutch confuse the Vikings enough to win? Who knows, let's find out!

Well actually, it is just a fair discussion to get to the bottom of the double-state and triple-renderstate dilemma. In what scenario's is a triple-renderstate viable and justified? Ready gentlemen!?! No elbows, no eye-poking... Fight!

Person1 (referee)

Summary: Bourgond Aries wants an opinion about this article, which talks about how dedicated subsystem threads should be avoided and how job-batches and workers might be the best solution.

 
Person1 (Bourgond)

I wonder what you have to say about this. http://seanmiddleditch.com/journal/2011/04/multi-threaded-game-engines/

Myself; I'm on the fence. With thread-separated render/logic loops, we can tune each of them, but it requires some synchronization or triple-buffering work.

 
   
 

Summary: The tin man agrees that using jobs and workers is generally the best way to go, but notes that some parts, like the GPU-bus, are intrinsically single-threaded and that the design should reflect that.

Person2 (referee)
 

What he is saying is not wrong, not at all. But there is one very important distinction we need to make: threads on CPU should not be grouped together with threads on GPU. The idea discussed in that article only discusses how to apply multitasking on CPU and my creator totally agrees with them: "Batching is way better then unstructured thread-spawning". Sometimes this is a bit naive (e.g. The XBOX-360 requires the programmers to specifically say on what core a thread needs to run). Luckily these 'old' way of doing dings are slowly fading from our society.

Thread-locking is a very expensive operation and to make your software scalable "Jobs and workers" is the best approach. Just keep in mind the distinction between CPU and GPU here. To show this distinction more clearly I will give some examples of why the thread that gives commands to the GPU should not be part of the jobs-with-multiple-workers paradigm:

  • The command to the CPU is tiny such as 'draw X' or 'apply shader Y'. It is just a simple command and does not eat up any CPU and should therefor not qualify as a single job. If single GPU-calls are considered jobs than a lot of overhead will occur that will lead to unpredictable frame-rates.
  • Separating commands-to-the-GPU on different jobs have no actual merit because the result of these commands are only visible to the user when all jobs are done (and the back-buffer becomes the front-buffer)
  • There can only be 1 'command' to the GPU at a time. If you are uploading a texture it is just not possible to simultaneously send a command to draw something else. This is a direct result from how the bus for the graphics card is designed and how the API for GPU-calls has been made to talk to the graphics card.

So we should ask ourselves: "Do we want the commands to the graphics card to be part of the job and workers paradigm or not?". A point can be made that it is much nicer to have just one way of approaching threads (Jobs and Workers) so the software-engineers working on the game don't all have to understand the difficulty of optimization (such as separating render state from world-state). Unfortunately there is no one-solution-fits-all here because it is based on personal preference. My creators approach is also personal preference and based on the three goals set out on top of this page.

In the authors further considerations he says to take a look at MMO-servers. He is totally right in showing that the type of application is very important. Did you know that MMO-servers don't actually use GPU's? Not using jobs-and-workers in MMO-servers as the main way for multitasking would be asinine, but for client-applications me might want to reconsider. Also, using the technique described by random guy #5903454104 for games such as 2D-platformers is just overkill. There is no good reason to justify such complication if the benefits are doubtful at best.

I hope that my masters position on this subject (and the referenced article) is more clear now. Using triple-buffering is never a justification for not using jobs and workers. I hope you have a wonderful afternoon/morning/night/... and hope this answer sufficiently removes doubts you may have.

-C5PO

C5P0 (C5PO)
   
 

Summary: Random guy #5903454104 enters the ring, the tin man is tapping out.

Person2 (referee)
 

I refer to myself as random guy #5903454104 and expect other people not to care about who I am. You however, are my creation. I would expect a little bit more gratitude and referring to me in such demeaning manner is unbefitting of somebody that owes his existence to me.

Consider yourself to be terminated and be used as example for the next version of yourself.

Person2 (#5903454104)
   
 

Summary: That random guy gives clarification about how we should still sent draw-commands to the GPU as soon as possible to make the best use of the GPU's cores.

Person2 (#5903454104)
 

C5PO was wrong about separating commands to the GPU having no merit (or at least he phrased it poorly). The sooner the GPU can do something the better it is. The GPU may use his own cores in separating jobs. This in itself is still no reason to use the same job-batch paradigm or GPU-calls used for CPU-jobs though... C5PO was a work in progress and I urge everybody to think for themselves before coming to some conclusion.

I apologize for the inconvenience.

Person2 (#5903454104)
   
Person1 (referee)

Summary: Bourgond Aries:

  • states that there is a different way of achieving determinism with a decoupled updater/renderer that also uses a limited amount of mutex checking;
  • asks how it can be useful to redraw the same frame and whether the CPU needs to wait if the last frame has not yet been rendered;
  • refers to an article that uses single-state multithreading.
 
Person1 (Bourgond)

Thanks for the answer eierkoek. However; one can accomplish all 3 goals that you set out in the first paragraph. I have in fact implemented a simple class to do just that if you would like to take a look (C++)

Anyway; I have been contemplating both designs (Batch vs MT-Render loop) for a while now, and I still have some questions regarding your design.

Let us assume we have your design in place, and that the CPU-heavy stuff takes a lot (too much) of time to compute its data, thus; the GPU-heavy renderer is basically re-rendering the same frames a few times before having the triple-buffer updated. How would this be useful? Can't the renderer wait for the physics/logic updater to finish, and then redraw? Perhaps in your design; you would use a condition variable of some sorts, but it appears that you want to run it at a set Fps. Why is that?
Paragraph Question-> How can redrawing the same screens useful?

Secondly; if we implement a batch system, would it be possible that the renderer may stall us due to the GPU processing the frame? Does the CPU wait until the GPU has finished, or does the CPU simply issue commands and continue, and only wait if it can not send more commands because the GPU is still busy from last time?

  • If the former is true; that the CPU is stalled until the GPU is finished, then perhaps a thread for rendering may be a solution.
  • If the latter is true, then I see no point in having a separate thread for rendering as we could simply decrease the percentage of iterations used on rendering (see Rit.hpp, end of page sample). When we do this, we will get the same effect as your system has. This effect causes the renderer to skip logic/physics frames. No triple-buffering would be needed tho, we thus save some space.

Another article which you may also read: http://blackhole12.blogspot.no/2012/05/multithreading-problems-in-game-design.html

Also, don't forget that I'm only here to find out certain truths, that is all (no flame wars like there appears to be brewing in the link). I'd love to read your response once again!

 
   
 

Summary: No information is given, just a note that it would be best to continue the discussion on e-mail.

Person2 (referee)
 

We are having an important discussion on design decisions with respect to multithreaded design in game-engines. Challenging decisions and ideas are very important and I will certainly take no offense. Actually I love it! It would be wonderful if I was shown to be wrong. I think our discussion is important enough to other random people on the internet to make it public, but unfortunately this comment-section is not well suited.

You have shown yourself to be knowledgeable in the area we are discussing, so what you are asking and saying should be made public and known. I propose that we continue our discussion in e-mail format so we both have more freedom to express ourselves and use rich formatting that this silly blog cannot provide. I will then create a separate page on this website and make our discussion public for the rest of the internet to see. A link on the main page will be added, so our discussion can be read by anyone interested.

If you agree to this proposal, please send an e-mail to <censored>. As long as our discussion remains on-topic and continues to be civil I'd love to debate this topic.

Person2 (#5903454104)
   
Person1 (referee)

Summary: Bourgond Aries says hi on e-mail, notices that guy #5903454104 is the same guy that was having the discussion on the previous mentioned article.

 
Person1 (Bourgond)

Hello,

it's me; Bourgond Aries. I usually do not use e-mail a lot, we could have a quicker debate using Skype. I assume you speak Dutch due to your user name, so we would be able to discuss the philosophy behind multithreaded game engines more deeply.

If it's alright with you; my Skype ID is "<censored>".

P.S. judging from your presumably similar Full-Metal Alchemist picture, you must have been the guy debating with the author of http://blackhole12.blogspot.no/2012/05/multithreading-problems-in-game-design.html.

 
   
 

Summary: No information is given, just some general ideas on how to approach the discussion we're about to have.

Person2 (referee)
 

Hi Bourgond,

I'm not sure if you're situated in the Netherlands (as I am), but it's currently 05:30 AM so I'll continue our discussion tomorrow (and respond to your comment). I was indeed the guy discussing with Erik McClure and it became a (bit too) heated discussion. He claimed certain things that were just untrue and has shown not to have read (or understood) the problem/solution discussed and claimed his approach was better in every way. When I thought it was senseless to continue, I just summarized the differences between our approaches and moved on. I was a bit overwhelmed by his aggressive way of arguing and a bit disappointed that it was not a conversation that has lead to new insights. I must admit that I was offended by his (to me arrogant) stance and my response where I used his own words against him was very rude of me.

The points you made in the comments on slapware.eu are all based on sincere doubt about techniques used and are not unfounded. These points deserve to be addressed and I hope that our discussion will lead to insights that will also benefit other people. For this reason I like to use this e-mail conversation we're about to have and copy-paste it on the slapware.eu. I will not infringe on privacy (obviously) and will not put anything online that you don't agree too (for instance I would never post your real name). For this reason I'm not sure if using Skype will be the best approach.

I don't mind discussing the topic in the Dutch language, but that would imply that I need to translate our discussion if I want to make it public to a wider audience. If you're better at expressing yourself in Dutch, then I wouldn't mind of course. My idea is that multiple people have the same concerns as you have on making a determination on such difficult topic. Having a public debate may help others in forming well-founded opinions.

I'll get back to you after I've gotten some sleep :)

Person2 (#5903454104)
   
Person1 (referee)

Summary: Bourgond Aries

  • notices how the triple-render-state design may have a different update iterations-per-second (IPS) compared the the frames-per-second (FPS);
  • proposes to introduce a constant 'iteration_weight', so the IPS can be changed by the programmer during development or to be used for a crude slow-motion effect;
  • explains the basics of his design thus far; a double-state design based on state machines.
 
Person1 (Bourgond)

Hey Alex,

I actually live in the same time zone, but in Norway. I guess we could stick to English, doesn't really matter to me. And regarding the e-mail, that's fine. We both get to type our fully fleshed out arguments instead of partial chat comments which I suppose noone would care to follow.

Let me start by taking up an important topic which you seem quite focused on; determinism. I have read some of your comments regarding determinism; you claim it to be quite important in RTS games and MMO games. I suppose it would be important to have on the client side; as each client should process the same kind of logic. Of course, most MMO's check if this logic is valid through hashing. An example of this is the game MapleStory. This of course; does not at all invalidate determinism; it's still needed.

Your engine seems to claim that it is fully deterministic, regardless of rendering speed. You have different iteration speeds in the renderer and the physics thread. This implies that; if we were to extract the physics thread, we would find certain traits:

  1. Only itself is able to change its own state
  2. It does not use time-based steps
  3. It uses iteration-fixed steps

A possible #4 would come into place:

  • The controller (user) may submit events, which could alter the state.

Luckily, these state changes can be recorded and stored. Therefore, we could store a replay file that would replay all of the physics in the exact same manner. Even if random number generators (From now on RNGs) were to be used, their seeds could be stored as a part of the logic, and thus, we would retain determinism. I assume that this is the determinism you talk about.

This, however, implies a limitation in our physics calculations. We would have a fixed incrementation on the units that are manipulated inside the physics function. The speed at which physics runs can only be altered by calling less or more frequently. This may introduce additional overhead. Suppose we start a game and we say 60 Iterations Per Second (IPS) is fine and dandy for the physics simulation. Suddenly, a new idea pops into the programmers/leaders' head: " Hey! What if we made the game more action-rich?!". To achieve this, physics IPS must be changed to 180 IPS. Surely we have determinism, but we're performing redundant calculations. If the physics function contains an incrementation of: "x += 2.5f;", we could simply change it to "x += 7.5f;" and keep the IPS at 60. This method would reduce redundant overhead. Sadly, the 60 IPS version with an incrementation of 7.5f versus the 180 IPS version with an incrementation of 2.5f should according to floating point imprecision yield non-equal behavior. Especially when floats are not so rational-friendly like the previous ones.

I therefore propose a hybrid stepper, not based on elapsed time, but based on the following: "How much should a single iteration affect all units?"

  • x += 7.5f; can now be rewritten as
  • x += 2.5f * iteration_weight;

Where iteration_weight is a constant inside the physics function and should be used by all per-iteration value changes. A more realistic example would be how fast an a living entity becomes hungry:

  • hunger += 0.3f * iteration_weight;

At 60 IPS, this implies += 18 per second. We will not change IPS to 1 and iteration_weight to 60, since we would like to have non-abrupt updates of the state, which will be reflected by the renderer.

So what do I think this all solves; what does it do?!

  • This method allows the programmer to fine-tune physics to be around 60 IPS (which should be the ideal IPS), or any other IPS of their liking.
  • Physics can be run as fast or as slow as the user or programmer would like.

The programmer could implement a "slow-motion" effect by temporarily suspending the physics updater to a 10-fps state. If the game is really slow-motion based, perhaps the updated can normally run at 600 IPS, and slow-mo into 30 IPS. Or the developer can sacrifice determinism and change iteration_weight during run-time in order to not lose too much speed over ~600 possible redundant calls per second.

This brings me to another question:

  • Should iteration_weight be modifiable during run-time?

As we can see, it is mostly multiplied by constants; however; I believe that in more complex scenarios, a tonne of run-time variadic variables will be used. Therefore we can argue that iteration_weight should not be a constexpr (true constant in C++11, the compiler can optimize constexpr equations, which will make our physics iteration a little faster, at the expense of flexibility.) I think I'll bring this up later because all of what I have written now points to the 2 designs that are important to us.

  1. Your MTR Design
  2. Batched Linear Subsystems (BLS)
  3. A mixture of MTR and BLS

I have currently implemented both systems, but I have only been able to code a meaningful bench-marking application in the latter of the 2. Yes, #1 gets very complex very quickly, with mutexed command queues and triple-buffers sharing states all over the place, I have no doubt that we could examine your implementation more as I suppose yours is more fleshed-out than mine.

With the BLS; there is conformism to the 3 points that you outlined here.

  1. The game engine should be deterministic.
  2. It should be able to run at any FPS (slow, or insanely fast), without ever having a consequence on the game speed.
  3. We don't want to do mutex checking on objects we want to render. Ever.

I implemented this using my Rit class (found here)

while (glocal.active == true)
{
  glocal.ips.limit();
  delegateInput();
  if (glocal.rit.isFirstReady())
    delegateLogic();
  if (glocal.rit.isSecondReady())
    delegateRender();
  std::cout << glocal.ips.getDelay().count() << std::endl;
}

glocal.ips is a class that can be found here which is a minimalistic Iteration Per Second (IPS) limiter. Here, the IPS is set to 60, which is the same as the highest of the 2 possible rit values. glocal.rit.isFirstReady() will check if the ratio allows function 1 to run. If it does, it increments the run counter inside rit and returns true. This way, we can approximate (and sometimes even keep a perfect) ratio of distribution. Since IPS is always the highest of the two possible distributions; this while loop will always run distribution_1 times delegateLogic() and distribution_2 times delegateRender() per second. FPS can thus be changed to whatever liking, without ever affecting the actual physics speed. I did a simulation where the physics loop was calculating Fibonacci numbers recursively, which is quite an expensive operation. The interesting results are that if we keep rendering at a constant 60 IPS, and increase the physics IPS to about 400, then we will notice update lag. If we increase the frequency at which the renderer runs, we basically decrease the ratio between the renderer and physics, yet physics will still be run at IPS amount per second because it is bigger than render IPS, and thus controls the iterations per second. When the Fibonacci calculations are removed. A lag will only occur on my computer when logic updates at 6000 IPS and rendering at 60. The lag disappears when rendering approaches ~180 IPS. Another interesting observation is that rendering distributions of 30 and logic distributions of ~80 IPS show significant lag. I am unable to explain this. This lag is removed when the rendering distribution is elevated to around 38-42, then the range of possible IPS values goes up very high (over 1000, not 9001 tho :().

What do these results tell us?They tell us that we can only keep a smooth updating frame within the bounds of how many iterations we can actually complete per second. If the distribution is; logic~30 and render~60, that means that we will iterate 90 times per second. 60 times into render, 30 times into logic.

Thinking about that; what good does it do to make render update more often than logic? My assertion is that screen_refresh_rate >= rendering_distribution <= logic_distribution. Anything else can be considered a waste of processing power.

Another problem I have been thinking about is the copying of data using your double buffering technique. Surely the geometry pointers can be valid, but what if the geometry (vertices) themselves change whilst your renderer is reading from that pointer? Will the geometry data be copied every time? Perhaps the BLS can be noted has the same problem, however I intend to fix this problem quite simply, without any overhead compared to a purely single-threaded engine.

My solution builds upon the principle of the state machine; that is: there is a previous, complete world state, and there will be a next world state evolving from this. Your first question should be: "Wtf?", alright, code is better. At the end of each physics iteration we have: "oldstate = newstate;". All data is copied from the newly computed state into the old state. Why?

  1. The worker threads in physics can all concurrently access the old state because noone writes to it.
  2. We want the units to perceive the world relativistic to their previous snapshot.

I shall elaborate on #2, as I imagine it being a little hard to understand from the previous statement. Suppose there are 5 units on the world state array. Every unit requires a reference to the world state so it can compute force, actions, paths, AI, or whatever else you can imagine. Suppose units are planetary bodies. If we compute a new force on unit_0 and compute its velocity vector, and increment its position by that vector, then units_1, 2, 3... would all compute based on unit_0's current position. Actually, they should compute their forces based on the previous frame because the distance that unit_0 perceived between the other units is different than the distances perceived by units 1-4 to unit_0! This implies that Newton's 3rd law of motion! (Third law: When one body exerts a force on a second body, the second body simultaneously exerts a force equal in magnitude and opposite in direction to that of the first body.). Using an immutable old-state would fix this, new_units would be reading from the old state whilst writing to themselves; yay Newton's law is (partially - ugh, floating points... -) preserved!

Thus; I consider the BLS threading model to give us virtually 0 overhead. Batching the tasks should not be a problem as that would take only a few pushes of functions and legitimate iteration values for each thread.

OOowey! Did I really just type all this? I guess I have no specific topic really, since it feels so interwoven. All of it! I have so much more I could write down, but perhaps I need to give you some talking space as well, hehe. These are just random thoughts that popped into my head whilst writing, I think it's good as a kickstarter of the discussion regarding a deterministic, lock-free, concurrent implementation of a game-loop. Well, 07:35AM, guess it's time for bed!

 
   
 

Summary: Random guy

  • shows the effects of running the update-thread at a different speed compared to the render-thread he references the video from part 3 to demonstrate extrapolation;
  • also thinks using batched linear subsystems (BLS) is a good idea;
  • thinks the use of an 'iteration_weight' variable is a good idea, but only as a constant;
  • want to use determinism for minecraft-like games;
  • based his triple-render-state design on the observation that the render-state (as opposed to the world-state) doesn't need to be that big and that this allows him to completely decouple the renderer from the updater;
  • notices that the design is a reasonably complex and not suitable for every type of game;
Person2 (referee)
 

For everyone that has the endurance to follow our discussion: "You should be awarded for not giving up!". So sit back, listen to this song, and give yourself a somewhat inappropriate smile:

Dancing-pedo-bear-1
Dancing-pedo-bear-2
 
Okay let's continue... I understand what you are saying. The techniques you described are exactly those as taught in school and I too am aware of these methods for doing things.

To resolve some of the questions asked, I'd like for you to take a look at the executable of "Alpha and Omega 2.7.1.13.0". This executable still had some bugs, but it was meant to experiment with things like render-rate and update-rate to see the effect is has on the simulation. The video made of v2.7.1.13 is a bit blurry, so let me also show you a printscreen:

DeterministicPrintscreen

The video was meant to show the following properties:

  • If the render-rate is at 60FPS, yet the update-rate is at 2 FPS (and hence physics as well) we can still render frames smoothly (even if it is shown as slow-motion).
  • Motion of shown objects can be extrapolated for smooth rendering and this does not effect the determinism of the simulation.
  • If the render-rate is at 1000FPS (absurd), yet the the update-rate is fixed at 60 FPS, the eventual state of the world will be the same as when rendered on 60FPS
  • If somebody has a slow GPU, yet the CPU has no problem, it will not effect the update-FPS (needed to ensure a good network experience by the player on the other side of the world). no latency problems may exist because of outdated graphic cards.
  • Even with low update-rates in physics (say 20 per second) it should be possible to run the update-rates of 60-per-second for handling input-control and this should not result in stuttering.
  • When the update-rate is modified (say 60/s -> 20/s) and the movement calculation in the update-step (say *3) to accommodate this change of perceived speed then the determinism we want to have is gone (floating point errors). To ensure determinism for networked games we must enforce the use of but 1 update-rate. When a programmer changes his mind and creates a new version to use 20/s (and *3 in calculations), then it can no longer be reasonably used in network games with an older version of the client.

Determinism

I discussed the topic of determinism a lot (here), and do this because a lot of software-developers seem to be under the impression that:

  • Determinism cannot be achieved
  • It is a lost cause and it's better to use things like hashing to ensure equal world-states on networked (real-time) games.
  • The solution as used in engines used for CIV5 (turn based) can also be used in engines used for Starcraft 2 (real time).

All these three statements above are false. A responsive real-time networked determinism cannot be achieved (for obvious reasons) and hashing must be used. However, applying hashing does not automatically state that it is senseless to focus on deterministic properties of the game and to show why this is I'll sketch an example:

A minecraft-like world (huge with chunks) where plenty of players are frolicking around, but where the physics (or 'rules') of the world also make changes (such as lots and lots of red-stone machinery). So now we created a situation where we want responsiveness within the visible chunk that more then 2 players use to make modifications (need of hashing), but also deterministic rules of the world that modify chunks that should be the same for everybody. If we are to create a game-engine that handles this problem, we need the rules of the world to become deterministic otherwise every visible chunk will receive hash-updates collisions. It is not a good idea to let the server just 'push' these world-changes (because depending on the game, this might encumber a lot of data).

This is the problem I wanted to address, and not all those 'easy' problems that can be fixed by applying those techniques as described in textbooks.

On the question: "Should iteration_weight be modifiable during run-time?" in the context of networked games that want to ensure determinism to prevent hash-clashes my answer is "No". In games that do not have the need in preventing hash-clashes the answer is "Yeah sure, could be. Although it not certain what you want to achieve (players of the game usually don't care about this as much as programmers do)". I personally would only use iteration_weight as a constant (because of compiler optimizations as you pointed out).

I can see no good justification for my MTR Design, to be used without the use of Batched Linear Subsystems (BLS) and my reason for this is:

When so much effort is put into a system that complicates the engine so much, it is reasonable to assume that the software-engineer (me) wants to optimize certain aspects for which there exist other methods that might not be 'as' good, but good enough for most practical purposes. Not using Batched Linear Subsystems (BLS) for having a scalable method for handling jobs would therefor just be silly.

Why? What is the motivation?

My principle is based on these observations

  • oldstate becomes newstate
  • Building newstate is threadsafe because it's not being read yet
  • To build newstate we can read oldstate concurrently (because it has become immutable)
  • If the update-rate is fixed for ensuring determinism, this system may stutter
  • I can use oldstate and newstate and apply extrapolation (not interpolation) techniques without this having an effect on determinism
  • By using extrapolation I do not add a new factor of delay (interpolation would have)
  • If the render-thread is reading two world-states for extrapolation, I cannot just 'delete' the oldstate after the newstate is build.
  • The part of the state required to use extrapolation is not 'that' big to hold in memory.
  • If the render-thread accesses two states at once (and locks them so it can be used) then where is the update-thread writing while the render-thread is rasterizing?
  • Extrapolation allows for the full separation of update-FPS and render-FPS without the need for mutexes (handy for networked games)
  • Having a technique that allows me to use absurdly low update-FPS (like 20 FPS for physics) without this leading to stuttering is very handy (physics may become very CPU-heavy)

To conclude

Is it clear what design-decisions I made, why they 'might' be useful and why some people should just use existing game-engine (or existing techniques) for creating games and not bother with triple buffering?

Person2 (#5903454104)
   
Person1 (referee)

Summary:  Bourgond Aries

  • requests elaboration on how extrapolation is implemented;
  • notices that skipping the rendering of some frames could keep the update-rate high enough so that it will not interfere with networked games;
  • worries about possible errors when using extrapolation in a double-state system and wants to talk more about extrapolation;
  • thinks a single threaded design (with BLS) can achieve the same things as multithread design (with BLS).
 
Person1 (Bourgond)

I think there are some interesting points here that I'd like to discuss. But before that; I'd like to state that I have never gone to a school for or about programming. I am completely autididactic in this area.

"Motion of shown objects can be extrapolated for smooth rendering and this does not effect the determinism of the simulation."

I request an elaboration on this point: I am mostly interested in how you implement extrapolation. It appears that this will imply that the renderer will do more than just render data to the screen. Afaik; changing the renderer to 2 FPS may not result in a stuttering-like appearance when no extrapolation is used. Considering the pixels to move slowly twice a second.
"If somebody has a slow GPU, yet the CPU has no problems, it will not effect the update-FPS (needed to ensure good network experience by the player on the other side of the world). no latency problems may exist because of outdated graphic cards."
Is very interesting to me. In the comment section on your 4th article I commented on this. It appears that you would skip rendering a few frames every now and then. I can imagine this same behavior being possible in a pure BLS by lowering the rendering distribution whilst having a loop bench-marking the main game loop constantly. Bench-marking can be considered very cheap, as I'd only be testing the time the entire loop takes to complete, and probably use an array of the 60 latest values in order to compute an average. Based on this average, the FPS speed could be lowered all the way down to 1 fps in order to give more CPU-time to our physics function. In the scenario of networking; I do not think we could change the update speed of the physics function.

I would like to hear what you think about this and how it is compared to the method used by the MTR. The other points your had there seem to be no problem to implement. Here is a picture of a ball circling around an axis simulated by trigonometric functions and iterative steps.

ss (2013-10-26 at 02.53.42)

When the render speed is at 1001, there ball appears to be moving at the exact same speed. The picture was merely a test to see if copying the buffers would work. It's a small simulated workload that's copied each frame. Tho, I suppose in reality; there would be no need to copy an entire texture, but only the handle to it.

I suppose I should think a little more about rendering FPS and update-IPS. You claimed that we can still have a perception of fluid motion when updating ips is at 2 and render fps is at 60. I suppose that with extrapolation, the renderer could run at a higher speed than the updater.

"Is it clear what design-decisions I made, why they 'might' be useful and why some people should just use existing game-engine (or existing techniques) for creating games and not bother with triple buffering?"

I believe it is quite clear what the purpose is, however; when we disregard the complexity, we still have a lot of inter-thread communication going on. Suppose the user issues an event, we'd need 2 lockless queues or mutexed queues for both the renderer and the physics thread. The physics thread needs to hold a reference to the queues as it would like to communicate to the renderer and perhaps put a message on its own queue for the next iteration. 2 Triple-buffers can be made here, 1 for the controller, reading a shared state, and one for the renderer. The renderer will also need a triple-buffer for its own shared state (containing information like where the view is located, how the screen is rotated,...). The argument of scaling comes into mind here. It appears you would like to mix BSL and MTR, which I can consider perfectly fine, but that would basically leave use with a batched system with threaded subsystems.

I decided to read up on extrapolation a little bit, and it would appear to me that if you use 2 world states, then would you not from time-to-time get wrongly rendered frames? Is not extrapolation here a cruder, more simplistic form of physics? Where do we set the boundaries for physics and rendering? Do you use some other method that I do perhaps not know of?

I think I do understand why you made your decisions, I liked separating my functionality by threads and synchronizing the threads. It felt as if the thread "just runs", which I felt gave me a good deal of freedom to do whatever I wanted. Now I am quite worried about the efficiency. It appears threads slow a system down if you synchronize too much or encourage a lot of context switches. Tho it is quite clear you do not want to synchronize (by using triple-buffering), I am seeing no particular reason why the MTR/BSL should be in some cases superior to the BSL alone. I hope you could enlighten me. The only possible argument at the moment - I think - is the extrapolation one. Very interesting and perhaps harder to implement in a purely BSL system -> I would like to talk more about extrapolation.

To conclude:

I'll sum up my current thoughts of the 2 systems. To me it seems that an MTR/BSL hybrid and the BSL alone can both achieve the following:

  • Efficiently threaded
  • Scale to more cores
  • Be deterministic
  • Run the renderer and physics updaters at mutually exclusive ips.
    • Without affecting physics results.
  • Neither need to check mutexes; if implemented properly
  • Both can handle input regardless of slow render/physics functions.

Regarding input handling, I do not consider popping an event from the window's event stack "handling" the input. I suppose there must be a change in the state not only of the window, but in the state of the "world" as well. Using a separate thread for input may perhaps handle input more quickly as it would modify a shared state and push it to the world. I however think the fundamental change needs to be spotted by the next iteration of the logic/physics thread. It therefore needs to complete. For the changes to become apparent, we'd require the physics function to finish and push a visual state again. I suppose this argument does not count when we are editing a state that's independent from logic. Anyways, I thus think both BSL and MTR/BSL can handle input with at least equal speeds.

 
   
 

Summary: Random guy

  • elaborates on how extrapolation is implemented in the triple-render-state design by means of an example;
  • show that error in extrapolation is negligible and does not interfere with determinism;
  • notices that extrapolation in a double-state design is more difficult;
  • discusses the problem when draw-commands become part of the BLS-system and why this should be avoided;
  • shows some worries about memory allocation that may happen every frame in a double-state design.
Person2 (referee)
 

Wow, you taught yourself all this difficult stuff without teachers pointing you in the right direction? That is pretty impressive! I too have taught myself a lot, but my education showed me several approaches that would have been very difficult to learn if I only used the internet.

Elaboration on Extrapolation:

Let me start by an elaboration. My claim was:

"Motion of shown objects can be extrapolated for smooth rendering and this does not effect the determinism of the simulation."

I'm only considering games that use floating-point positions here (so no pixel-perfect retro games). Extrapolation (and interpolation) are actually very simple. I'm talking about linear extrapolation here (so no complex curved estimations). In case of a simple 2D-simulation it's possible to use a motion-vector and simply estimate the new position: Lets say that the IPS=30 (so 0.0333.. seconds per iteration if we're lucky)

Oldstate (t):

  • T_{t}: State created at 0:10:00.0000
  • P_{t}: Position of object: x=10 y= 40
  • M_{t}: Motion vector of object: x=0, y= 3

Newstate (t+1):

  • T_{t+1}: State created at 0:10:00.0333...
  • P_{t+1}: Position of object: x=10, y= 43
  • M_{t+1}: Motion vector of object: x=0, y= 2(perhaps the motion is slowing down from friction or some other complex reason we don't care about)

So now the render-thread wants to use the newstate to render it on screen:

  • T_{now}: Time we started rendering newstate: 0:10:00.0370
  • P_{now}: Position for rendering: x=10, y= 43.22(only temporarily used for rendering, not stored) - calculated as:
    P_{now}=P_{t+1}+M_{t+1}\times\frac{P_{now}-P_{t+1}}{1 \over IPS}, resulting in:\{10,43\}+\{0,2\}\times\frac{600.0370-600.03333...}{\frac{1}{30}} = \{10, 43.22\}

It is important to see that in the example used above, we are not actually using data from oldstate (M_{t+1} was enough for extrapolation). For more complex (3D) situations we can use a different approach. The oldstate will contain a 4x4 matrix for the transformation of the 3D object as will the newstate. We can simply extrapolate the matrix (by looking at the differences) using a very similar method as the one described above.

Errors of extrapolation:

It is indeed possible that extrapolation is wrong, but the fault is always very minimal and usually not observable by a human eye. Take for example a ball bouncing of a wall such as described at the bottom of render loop part 2:

DeterminismExperiment-Laptop1

It is possible that for 1 frame the ball is rendered as if it goes through the wall even if the physics-engine is perfect. Fortunately the extrapolation is so incredibly small that this is negligible (unless our IPS is very low). Extrapolation will never modify a world-state so it will never cause a ball to go through a wall without ever bouncing.

Response to conclusion:

  • Efficiently thread.- true (although it depends on your stance on mutexes explained below)
  • Scale to more cores.- true (both CPU and GPU)
  • Be deterministic- true
  • Run the renderer and physics updaters at mutually exclusive IPS. - true(isch)
    • Without affecting physics results.
    • When using double-state MTR only, you still need to have some sort of solution to smooth rendering on low IPS. (if you find this important)
  • Neither need to check mutexes; if implemented properly- Incorrect. "Properly" is not defined, so let's elaborate:
    • When draw-commands are separated into several smaller jobs:
      This only looks like a proper implementation because the software engineer himself does not use any mutexes. We have to keep in mind what happens in the driver that sends the data to the video card. To ensure no command is send on the bus simultaneously the driver uses mutexes for the communication. So even when commands (or textures/shaders) are sent from multiple threads, a mutex in the driver ensures that only one thing happens at the same time. Using more than one thread to send commands (and textures/shaders) is senseless in the context of efficiency and will only add to the overhead. A fair question would be " What is the point? Even when one thread is used, mutexes will be checked.". The reason why mutexes are so taxing is not because they need to be checked, but because the operating system needs to do a context-switch in case the mutex is already locked. To summarize: Having more then one thread sending data to the video card is not only unnecessary, it is also bad for performance. A solution to this issue needs to look at the inherent single-threaded nature of the communication bus. This problem is actually everywhere you look (e.g. having multiple threads doing File-IO operations only hurts performance and makes performance in itself very unpredictable). When ignoring the inherent single-threaded nature of hardware communication channels, software engineers often overlook the use hidden mutexes what might result in a lot of unnecessary context-switches.
    • When a single job is used to draw one frame:
      Yep, no big problem here. There are some minor issues that need to be resolved though. If the job-queue is also filled with other jobs (such as Fibonacci-calculations :P), then the render-job may not be picked up as quickly as you'd like. The render-job itself is not computationally heavy on the CPU, so it is reasonable to give render-jobs a higher priority. For this reason a priority-queue (such as a min-heap) for all jobs in a BSL is more reasonable than some blind job-queue. A fair question for this model would be: " What is the point of using a BSL for render-jobs if these jobs are never actually multithreaded?". Every (decent) operating system already has an implementation of assigning priorities to threads that is used for context-switching. And what if the CPU is working on a low-priority task when a high-priority task gets added to the queue? This is known as the priority-inversion problem and is very important for OS's. So why not just make a separate thread for rendering and give that thread a higher priority. When the OS performs a context-switch you are sure that a high priority job is executed immediately and not just when another low-priority job gets completed. Priority-inversion is a complex problem and it cannot be solved by priority-queues. This problem can be addressed by OS's and that is exactly what they do.
    • Conclusion:
      The two possible solutions above are a true dichotomy and neither can be considered proper. No amount of hacking can solve these problems properly. A lot of software engineers are misinformed about why people like John Carmack are so passionate about topics like these and on why they use complicated methods. As John Carmack experimented with Mega Textures (a very taxing method for the communication bus) it turned out that some systems experienced severe performance problems. Debugging mega-textures is very difficult (especially when even driver-problems crop up) and it becomes more and more important to have a good view on bottlenecks on certain hardware/driver combinations. This is (one of) the reason(s) I wanted to use a different method for thread-separation, even if it complicated the engine.
  • Both can handle input regardless of slow render/physics functions.- true, but some further explanation is required.
    Do you propose collecting user-input on a separate thread? A thread that is not one of the worker-threads? It must be (I can't see any other way of properly doing this). for Windows the WndProc-thread is often used, although nowadays input-API's are a bit more flexible (e.g. DirectInput). I'm assuming you are not using an event-based system and continuously poll for data. This is very similar to one of the functions of my main update-thread. So you too agree that some threads should not be part of the BLS-system because it would be senseless to apply multitasking to something that is single-threaded in nature.
    This is how I approach this:
    My update thread runs at 60 IPS, collecting user input (and immediately applies it to the new world state, if possible). Other actions that also may never give rise to additional delays are executed on the update-thread too.
    • When a game is not CPU intensive:I like to prevent BSL, because without it I can use easy event-based world-state modifications without locking on mutexes. (It's also possible with a BSL, but it just complicates things because each position-modification needs to become a 'job'). I could be lazy and do physics every odd-frame and AI every even-frame. I'm more concerned about practicality then unnecessary complicating my code for something unimportant.
    • When a game is CPU intensive:For a CPU-heavy game I will create a BSL that will do computations such as Physics and AI in separate jobs (and perhaps even separating those jobs in child-jobs to increase scalability). It depends on the game what decision I would make, but usually I like the idea of performing these computations an older game-state in such way that it is a bit deferred and perform actions at a different speeds (such as once every 3 world-states).

Memory Allocation:

One last thing you might want to consider is the allocation of memory every frame. When you create a 'new' world-state and actually do a malloc (and free when you're done), you are using an operation on the operating system that also has a lot of overhead (hidden away from the software engineer). Whenever possible it is better to keep a world state in memory and just overwrite it with new data.

Person2 (#5903454104)
   
Person1 (referee)

Summary: Bourgond Aries

  • agrees error in extrapolation is negligible;
  • notices that the render-thread must be the one that does the extrapolation and worries about possible CPU-usage;
  • notices that the extrapolation itself could be threaded;
  • thinks linear extrapolation is just as easy applicable in a double-state design;
  • acknowledges the problem of having GPU-calls be part of the same BLS system as the rest of the code;
  • agrees that having threads (other than the usual BLS threads) can be beneficial;
  • thinks that given the speed of memory-copying there is no need to worry about state-copying each frame.
 
Person1 (Bourgond)

Extrapolation:

I believe I agree that the errors generated in extrapolation will often be too small for the user to notice, unless - as you said - the IPS is very low. If over >100 approximations take place before a true physics update, and the physics updater runs once a second, then we could see objects moving linearly before being corrected. I however do not think the deviations will be extreme in normal applications, and thus I think this method is valid. Perhaps non-linear approximations can be made in more complex applications. I think that's not our issue tho. Let me just draft my thoughts on extrapolation in the different systems.

Regarding Extrapolation in MTR:

Linear extrapolation may often yield results that are accurate enough. Like you said that the extrapolation is not written to the state of the object; but instead used to compute the an approximation to the next frame's logical outcome. I wonder where you would place extrapolation calculations in the MTR - The Render Loop, or the Physics loop? Actually, this question seems kind-off obvious (but I ask anyways, you never know), because it appears to me that there is no need in the physics loop to extrapolate and push a state when it is running at 2 IPS, since it can not push a state at any higher rate than 2 IPS. Thus, the only logical answer would be that the renderer uses the most recently updated render state from the triple buffer. It will thus most likely write to that a copy of this data in order to approximate ~30 frames when rendering runs at 60 FPS and logic/physics at 2 IPS (extremely low; just an example). It seems to me that such an ordeal would mix CPU-intensive approximations based on linear motion vector additions with GPU-intensive rendering. In addition: if we were to use the same buffer, we would need to provide the renderer with data that is not drawn, but used to approximate the next frames until a new update has come. Concerning the MTR; I would like to propose a thread that runs parallel to the renderer, and at the same ips as it. This thread could perform all extrapolative jobs and push the newly created pure drawing data into the triple-buffer.

Possible Application of Extrapolation in BLS:

I can imagine that; as we add another thread in MTR to manage this job - we could in BLS add another binary distribution of iterations (Rit) within the physics function to handle extrapolation. No triple-buffering or locks would be needed since the old state is read, whilst a temporary approximation state is written to by the approximator. Once the true physics function runs again, the approximator will need to reset its state and upon first approximating again, needs to copy from the true physics state into the temporary approximation state.

Regarding Response to Conclusion:

  • Run the renderer and physics updaters at mutually exclusive ips. -true(isch)
    • Without affecting physics results.
    • When using MTR only, you still need to have some sort of solution to smooth rendering on low IPS. (if you find this important)

I agree with the first point. That is what I actually meant with mutually exclusive (Is the English wrong? We're Dutch and we have a reputation to maintain x]). Anyways, did not linear extrapolation partially if not completely solve the second case here?

  • Neither need to check mutexes; if implemented properly - Incorrect. Properly is not defined, so let's elaborate:
    • When draw-commands are separated into several smaller jobs [...]
    • When a single job is used to draw one frame [...]

I think you are correct in outlining this; but I would like to comment on both, especially the first one.

  1. When draw commands are separated into several smaller jobs.
    I have read some OpenGL documentation regarding stalling, and it appears your claim is correct. Tho I have also read that Direct3D allows for experimental multithreaded drawing onto the screen. Rumors have caught my ears of highly inefficient rendering. When it comes down to it, separating drawing into batches can thus be claimed to be a big disaster and must be avoided. You touch a very interesting concept here because the MTR design basically renders whilst physics is doing its own work. The same effect would perhaps only be experienced if the physics code in an BLS (Yeah, BSL flows more naturally, I motion their use to be interchangeable :3) were to incorporate drawing commands. This however gives you multithreaded rendering; which we appear to want to avoid. All this brings a few thoughts into my head; firstly, using a separate thread for drawing allows the GPU to stall this thread when its command queue is not empty. Without a problem we can then draw the previously generated scene which has been generated by the physics thread. It can be imagined that the GPU may stall the main loop in the BLS system. The main loop could use a single thread from the worker pool and upload all render data to this worker, which then draws all items on the screen. Once this worker is done. It notifies (using a condition variable) that it can perform more work. The advantages of this I think are quite clear: The main loop can continue delegating and working on a new state within the physics loop whilst a dedicated renderer renders the old state. In addition; since we already have an older state copied, we can make the main thread wait for the condition variable before copying the new physics state into the old state. Or we can use triple-buffering, where the copy can be performed whilst the renderer still renders, and pointers are merely flipped afterwards. I see this as a very potent argument for using a hybrid system (both MTR and BLS). Oh, another thing is that this renderer, once it has rendered a physics frame, can extrapolate until it receives a notification of a new state.
  2. When a single job is used to draw one frame:
    I think I can agree with this quite well. I have not been able to manipulate priorities in the BLS, but I suppose I would have to fiddle with the Windows API to increase a dedicated render thread's priority.

To continue:

So you too agree that some threads should not be part of the BLS-system because it would be senseless to apply multitasking to something that should be single-threaded in nature.

I think I agree; I am not to go overboard with multithreading. My benchmarks have shown that only batchable jobs would improve speed from actually being multithreaded (such as physics). I intend to handle events in a single (the main thread) in order to avoid mutexes. Since the controller were to change booleans and other PODs (most likely) - it would notify the next physics frame of this state, and thus process it in batches as well. So in a sense, events are kind-off handled in batches, only not.

Memory Allocation:

I intend to allocate the world state on the heap and copy it during each physics iteration using C++'s operator=. Since basically each member's operator= is called, all specializing algorithms are called and it would thus be very fast to copy (nothing is deleted, also, in C++ most programmers use "new" and "delete" or "operator new(std::size_t)" and "operator delete(void*)" for raw byte allocations.). Apparently, on my computer I am able to copy 10.3GB/s from different RAM locations using std::memcpy (which is a specialized algorithm for most architectures). Considering the world state itself not to be extremely big, we could run physics at max 120 ips whilst copying a world of 85,833... MB. I imagine shared static resources like textures and constants like iteration_weight - resources that are guaranteed to not change during a physics update - do not need to be copied; as most objects requiring them will hold references.

 
   
 

Summary: Random guy

  • tells that his renderer is indeed doing the extrapolation;
  • wonders whether separating the extrapolation to a separate task is indeed beneficial;
  • notices that double-state does not give the renderer access to another state that can be used for extrapolation and that the two states must therefor contain the full physics state too if extrapolation is used;
  • still worries about memory allocation.
Person2 (referee)
 

Extrapolation

Just to clarify for other readers, if they are still interested 😛 (since you already know): Extrapolation in my code (currently) takes place in the render-thread (the one sending all the data to the video-card), because we extrapolate information for every frame rendered. You rightfully point out that the thread, who's job it is to talk to the video-card, needs to make a couple of calculations. To respond:

Yes, indeed 😛

I'm not sure if the proposed solution (separating extrapolation calculations from video-card API-calls) will have the desired effect though. These are is my reason for not applying that separation (yet):

  • Will this separation be faster?
    The time we are talking about is the time spend from where we start extrapolating to the time that the back buffer is flipped. Although the api-caller-thread will depend on the extrapolation-thread, we may expect that it is a little bit faster (utilizing two CPU-cores if available). But how much faster? Well, we can only flip the back buffer when all extrapolations have been done. So will the extrapolation-thread much faster because it is not calling API-functions anymore? Yes it is, but but probably not by much. If we are using the API's (default) non-blocking functions we can even push this optimization to the render-API. Notice that the use of non-blocking functions often just means that your job is put in a job-queue which is handled by a single thread by the API's code. It'll be a different thread than the one you are currently using, but it'll be just one thread nonetheless. There is but one good reason to have more then one thread for handling message to video-cards and that is SLI. Having two video-cards that can share memory, but both have a different bus allows for true multitasking (in respect of sending jobs to the video card). The multitasking that video-card vendors are talking about is a completely different and has nothing to do with the number of threads spawned on a CPU. This sound very counter-intuitive for most people and is probably the biggest reason why some people think that multithreaded access to the video-card is a good idea. When we are only separating extrapolation from API-caller, then there is not that much reason to assume it's much faster. But! We can split all the extrapolation calculations to separate jobs (because none of these calculations depend on other data).
  • Is it nicer?
    Yes it is. The separation allows us to easily detect bottlenecks because computation is fully separated from the API-calls, although it's not much computation.
  • Will it use more memory?
    Yes separation will use more memory. In worst-case scenario's you need a complete copy of a render-state. (A render-state that contains all extrapolated matrices). By use of smart memory-management we don't necessarily need to allocate it every frame, but it's not ideal. The current system just calculates the matrix and sends it to the video-card (because there is no need to keep it in the heap after the API-function has been called)

So it is a consideration and the answer to "Is it worth it?" may sometimes be "yes" and sometimes be "no" (most likely "yes"). If we are creating a generic engine I will certainly implement it and have some sort of switch to enable/disable this technology :). So sir, have impressed me once again. I have not considered this separation before and only now am thinking about all possible consequences.

Possible Application of Extrapolation in double-state MTR:

the proposed solution does indeed work. If you want to use an existing physics engine you can just write a wrapper to accommodate this functionality. Assuming you have implemented this and have it working properly I'm asking myself this:

What is the worst-case scenario for how much data this system remembers and has in memory at one point in time (with respect to render-states)?

  • You have a full latest created state that the renderer is currently using to send commands to the video-card.
  • You have a full state that the current BLS-threads are currently writing to.
  • You have a subset of the previous render state (only positional/rotational information) which the renderer (or extrapolator) is using for extrapolation

That is almost three full states :) The only reason that it's not three full states is because you're not interested in remembering things like texture-pointers of the previous iteration. If you want to make sure that your renderer always has the latest render-state at his finger-tips (even when the update-threads is still doing their thing) then even this model has 3 render states :). Well, it the worst-case scenario off course (and some render-states are stored as a bunch of unhandled jobs in a queue instead of being an 'actual' render state)...

Memory Allocation:

As you know, new and delete both use the OS-api for malloc and free. When the '=' operator is applied on a structure, the memory block of the structure is is just copied, but before that happens the OS need's to make sure it can reserve a block on the heap (malloc). I'm not worried about the speed of writing to memory, I'm worried about lot of consecutive malloc's and free's because every reservation on the heap the OS needs to read (and update) it's administration on the memory-blocks that are free. Usually this is no problem; this operation is very fast when there is plenty of memory. Problems occur when memory is getting depleted because malloc will take longer in finding a free block (and possible forces paging). To illustrate, which of these two functions do you think is faster and why? (assuming thread-safe operation):

  1. void SwapAB() {       
    int tmp = a;
    a = b;
    b = tmp;
    }
  2. static int tmp;
    void SwapAB() {
    tmp = a;
    a = b;
    b = tmp;
    }

Well it is probably not the best example, since I'm reasonably certain C++ will optimize this (and am too lazy to actually test this). But I think the example is clear. Using the triple-state mechanism I'm basically doing the same: state_to_overwrite.object_to_draw.x = ...; I'm not allocating a new world-state on every update, I'm overwriting existing memory that has been allocated some-time in the past. I have not done the actual test to see how much difference it would make, but I'm reasonably certain that it'll help a bit. This was a very-very difficult topic because it may be the case that 'object_to_draw' does not exist in state_to_overwrite (without solving this issue) and I did not want to use lazy initialization. This can also be achieved when not using a triple-buffer (I think), but its benefits may not be really measurable. I just added this because it was fun 😛

Conclusion:

This is a truly wonderful conversation. We do not have a single disagreement (as far as I can tell) and are just exchanging ideas. If only every conversation went like this :).

If you want to think about something else (that is pretty annoying unfortunately), then I encourage you to look at instancing. All the ideas of "beautiful code" goes out the window if the only thing your interested in is performance :(... Well it's still a wonderful world :)

Person2 (#5903454104)
   
 

Summary: Random guy thinks unicorns are cool (referee: "what a wuss").

Person2 (referee)
 

Oh, before I forget. You also deserve kudos for using a unicorn-image. Unicorns rule!

 

Person2 (#5903454104)
   
Person1 (Bourgond)

Summary: Bourgond Aries probably thinks he's a wuss, but doesn't want to say it like that.

 
Person1 (Bourgond)

^How dare you call Princess Luna a unicorn! She's an Alicorn. But yeah, unicorns rule as well :3.

 
   
Person1 (referee)

Summary: Bourgond Aries

  • states that the physic state (of which he has two) can be used for extrapolation and that this does not need to be seen as a third state;
  • says that the renderer 'can' use the old state for extrapolation, iff the renderer is run after physics and extrapolation (sequential rendering);
  • states that his world-state is about 120Kb and that there is no need for worrying about memory copying.
 
Person1 (Bourgond)

You posed the question of the separation of the render and extrapolation thread. Would it be faster? I suppose data buffer between the two can be upgraded to a triple-buffer. I wonder if this will yield any meaningful result as the threads should ideally run at equal speeds. We would just be creating a second physics thread that is simpler. As for your questions posed under Possible Application of Extrapolation in BLS, I have constructed the following answers and some comments.

  • "You have a full latest created render-state that the renderer is currently using to send commands to the video-card."
    Correct, however this rendering state is in the oldstate of the physics updater. Thus, we are having the same memory overhead as we had anyways.
  • "You have a full render-state that the current BLS-threads are currently writing to."
    This made me think. The renderer or extrapolator (depending on the programmer willing to separate them) would use a state containing the most fresh extrapolated rendering data. This data would not be written to from the renderer, but instead be written onto the screen (I will elaborate on this afterwards).
  • "You have a subset of the previous render state (only positional/rotational information) which the renderer (or extrapolator) is using for extrapolation."
    I currently think this is not necessary. In my 4th post I talked about determinism and adherence to Newton's 3rd law. This implied a complete copy of the entire world's public state. I consider each unit inside the world state to contain its own rendering states such as position, size, color, shape, and so forth. Considering a copy is made during each physics update, this copy would simply be a part of the oldstate (which is one of the 2 states: oldstate and newstate) contained in the physics loop. Since physics computes position, motion vectors and stuff that is; physics; then I suppose there is no need, no, there is a very low usefulness in extracting the pure render state (including positional/rotational information) for extrapolation or rendering purposes.

Regarding the points above:

Your second point is very interesting to me. Considering the sequential or linear nature of the BLS, there is a possibility for the BLS-threads inside the extrapolator to use the oldstate of the physics updater to write into (change the positions of units). After this iteration completes, the frame can be rendered. This will most likely only work when the renderer runs sequentially (directly after the BLS-extrapolator has completed). This does thus not allow us to multithread the renderer unless we create an entire separate new render state (A triple-buffer, perhaps?). However, this introduces some new overhead; when the physics updater starts running again - the old state has been invalidated, and thus we must perform a oldstate = newstate; operation. Luckily, only positional data has changed. Using smart algorithms, there might be very little data to copy here. It appears to me that it is possible to not have "almost 3 full render states". When the renderer is not multithreaded (runs sequentially after physics/extrapolation), then the newstate can be read from the extrapolator whilst writing into the oldstate. Suppose real physics starts after the current iteration, newstate is again copied into oldstate. The newstate will not use its own data to compute unit data that requires an interpretation of the scope outside of the unit itself. Instead, it uses the data from the oldstate world to compute interactions! Newstate can still work on its own, but only if the unit is isolated. An example is how quick a unit gets hungry or how quickly a unit decays based on no interpretation of the surroundings of that unit. Thus; I have reason to believe that we can in fact keep our old ram usage overhead of 2 worldstates.

  • But now we have a possible stall in the renderer, right?
    Well, since we're running the extrapolator and the renderer sequentially, yes. But what if we were to thread the renderer? (I think this is a really strong point of the MTR)

Now we have the problem of immutability of the rendering state. I suppose that creating a new state might be the best solution. Perhaps even a triple-buffer. However, can't the renderer read from oldstate whilst the extrapolator writes into newstate, and then a pointer can be flipped using synchronization? Why synchronization? Well there is perhaps no need for the extrapolator and the renderer to run at different speeds (Isn't the whole point of the extrapolator to run at the same speed as the renderer - to give an appearance of fluid motion during low ips physics updates?). However, I think this introduces another problem. When the physics loop starts running after a few extrapolations have occurred, it will read an oldstate that has been tampered with: positions are changed. I suppose that the only way out of this is to always leave the oldstate immutable, and to create a copy only of movement data for the extrapolator. It would also require a copy of the rendering state. We now have a double-buffered rendering state that the extrapolator and the renderer flip when both are done with their frame. This introduces some overhead: the extrapolator would need to copy its previous iteration or re-do its previous iteration in order to continue from the frame where it left of. Anyways; suppose movement and rotational data in 2D space is represented by floats of 4 bytes; being 8 bytes for movement [x, y], and rotational data by a single float [r] - 4 bytes - for radians in change per iteration. That is 12 bytes for a single unit. Having 10000 units would require you to copy 120 KB. This operation can be performed approximately 85833 times per second on modern hardware. Regarding the extrapolation data copying in the double buffer - the render state may be a little larger than the the movement/rotation state.

The initial data copied does not change, so for 8 extrapolations, we need 1 copy of the old state's movement/rotational data. For 30 extrapolations, we also need 1 copy of the old state's movement/rotational data. Considering this, we can simply use this delta information to change the newstate and then render newstate (sequential rendering).

Thus; I would like to refute the claim as to having almost 3x the amount of rendering states in memory. Well, it's kind-off true because we have 2 worldstates, which is already 2x the amount of rendering states. These are merely a wanted effect to accomplish Newton's 3rd law of motion. When considering the 2x renderstates as the normal value - 1x - We actually have 1x renderstates when extrapolation occurs in a purely BLS system. In an hybrid system I believe there would be at least the need for 1.5x render states (where 2x is the normal distribution); as we'd implement a double buffer so rendering and extrapolation can occur in parallel.

Anyways, I'm getting side-tracked. The question in bold asks whether our extrapolation will be stalled due to the GPU being slow. Perhaps this is true. I suppose - since I've already claimed that the renderer and the extrapolator must run at equal rates (would you agree, btw?) - there must be a copy of newstate's renderdata that the extrapolator and renderer alternate between (thus 1.5x). Perhaps this would be the fastest implementation, theoretically? I suppose the MTR will squeeze out the best performance, but I am not quite sure. It would perhaps be dependent on the hardware used, right?

Memory Allocation:

You implied that new and delete use malloc and free internally. I would like to refer you this page. I have read that the implementation offers no guarantee that new and delete are implemented using malloc and free. I am unable to refer you to that C++ specification.

  • "When the '=' operator is applied on a structure, the memory block of the structure is is just copied, but before that happens the OS need's to make sure it can reserve a block on the heap (malloc)."
    I will consider this partially true. When a block of memory is copied over another, we will most likely call operator= on vectors and other variadic-sized containers. The newest container could contain more elements, and it thus has to check if the previous container has enough storage space. If it does not, then it will have to allocate new space in this container. Most of the time, however, the data is of equal size and is simply copied by a mov instruction to a ram->register, and a mov instruction out again register->ram.

Regarding the functions you showed me.

I do not know this language but I think I can interpret it regardless. I currently would say function 2 is the faster one - but I am not sure. The static int would logically be stored in the BBS segment of a program and the local int on the stack. I think it boils down to which one is faster. This; I do not know. On the other hand, the first function can be faster considering the compiler simply keeps the temporary in a register, without ever storing it in RAM. The static int would always be stored somewhere.

"This is a truly wonderful conversation. We not a have a single disagreement (as far as I can tell) and are just exchanging ideas. If only every conversation went like this :)"

I get that quite a lot; I must say reading some dialogues from ancient Greek philosophers have not hurt me.

 
   
 

Summary: Random guy

  • apologizes about calling Princess Luna a unicorn;
  • makes a comparison on memory usage when extrapolation is used between double-state and triple-renderstate;
  • shows some additional information about new/malloc.
Person2 (referee)
 

On the comparison of Princess Luna with a unicorn

I suppose you're right about Princess Luna not being a unicorn. A depiction of a mythical creature is in itself not a mythical creature. This fallacy is known as the reification-fallacy :3. On the question: "Is Princess Luna a depiction of a unicorn, just named differently?" I too must agree that the answer is "No". Unicorns are mythical creatures in our universe and Princess Luna a creature in a completely different universe (I assume). So I must apologize for this incorrect comparison, I too think it is very important to use the most suitable words to convey ourselves.

Introduction

Let me start by saying that some of the examples used below are pretty obvious. I'm still using these examples so other readers that are not interested in reading our whole conversation have the ability to jump-in. They are not meant to be demeaning in any way.

On the separation of the render-thread and extrapolation-thread :

I was actually thinking of applying BLS here, since all extrapolations only uses local data and does not depend on other subsystems. This may result in a considerable speed-up if the extrapolation code is complex and processor-cores are available.

Memory usage:

Memory usage when applying extrapolation and using BLS-only for rendering in the worst-case scenario:

I could have phrased it a bit better. Let's try to rephrase what I said for the sake of clarity:

  • We'll have an actual latest state which will use something like a hierarchy or tree-structure - render state #A, tree
  • The latest render-state is also being stored as a bunch of (yet unhandled) draw-jobs - render state #B, queue (not a 'full' render state)
  • The copy of the latest render-state currently being modified by physics (and whatnot) - render state #C, tree

So it depends on what you are calling a render-state. When applying the same description on triple buffering (again worst-case):

  • The render-state that is the latest fully modified render-state (tree) - render state #1, tree
  • The render-state that is currently being used to rasterize an new frame - render state #2, tree
  • The render-state that is currently being modified by physics (and whatnot) - render state #3, tree

When I'm applying the BLS technique on extrapolation I'm introducing a new render-state in triple-buffering the form of a queue:

  • The render-state that is currently being used to perform extrapolation jobs - render state #2.1, tree
  • The render-state that is currently being used to make API-calls for the video card as a bunch of (yet unhandled) jobs - render state #2.2, queue (not a 'full' render state)

To not have 3 render states on BLS-only systems:

We can run the renderer directly after our BLS-jobs are done on render state #A. This makes sure no queue-jobs are created for calling render-API functions. This indeed eliminates this render-state. There are some possible complications you might want to consider (that may not apply to your game):

  • Deferred computation:
    Note that deferred computation introduces an additional world state in both BLS-only-designs as triple-buffer-design and is in itself not an argument to use triple buffering. If you are using a very complex physics process (that takes 0.1 seconds to complete for instance), then we need to have to run the renderer (and extrapolator) at a higher speed. When the physics-job get initiated we cannot wait for it to complete, so we continue extrapolating on the old state. This way the new frame can be rendered without a hitch. If we are using the old state for extrapolation we need data to extrapolate from. This data can be stored separately of course, but we cannot use linear extrapolation between the differences of old_state and even_older_state. Note that triple buffering does have the solution to this problem either, but there is an intuition we can use when solving such complication. When this complication is addressed (properly), the programmers always use a method of deferred computation. This means that we keep some old world-states in memory that the physics computation can use for modifying matrices. For example: when physics is done once every 7 iterations (to keep IPS high):
    • "At frame 35, because 35 % 7 == 0, we make sure to remember this world-state and name it "physics_state" and let some thread(s) do their complex 0.1 second thing. At frame 42, because 42 % 7 == 0, we are going to apply the results and overwrite physics_state with this new state (notice the use of a monitor here)"If your intuition is to use the same BLS-queue and threads for these physics jobs (the same queue as used for AI, sound-effects and other threaded operations) than I would urge you to reconsider. When we are at frame #42 and physics is still not finished, it becomes very easy to temporarily give a higher priority to physics and no blocking function calls for playing audio-effect should lock up a worker thread.
  • Deferred API-calls:
    We never want to delay an API call to the video-card because: "The sooner we can inform the video-card that he should put his cores to good use, the better it is". This is why complex AI is often executed deferred (physical interaction are shown without possible delay from AI). John Carmack has actually experimented with responsiveness on multithreaded game-engine designs (e.g. directly applying user-input to the world state without waiting for the physics to iterate). This gives rise to certain complication (you don't want the user to stutter through walls), but responsiveness is important enough to take some ideas into practice (rotating camera for example does not need to wait for physics).
  • To summarize:
    If you know what the final transformation of a 3D object is, it's best to put it to the renderer-queue immediately. There is often no reason to wait until all transformations are done and it'll be silly to let the GPU wait for completion of all the CPU-calculations (even when using front-to-back rendering). So no, I don't think it's a good idea to run the rendering-API calls after physics is done and I think that a queue (render state #B) is the best option. When using an existing physics engine this can be a bit tricky, but luckily most of these engine support event-handling on transformation-updates to realize this.

This scenario does introduce a third rendering state (queue) and could be a good motivation to consider triple rendering. Having queues like these above makes it look like that we don't have as many render-states (in worst-case)

Render-state synchronization:

As you mentioned

oldstate = newstate

Is an operation in O(n) with respect to our render-state-size, assuming no shenanigans (clearly splitting render-state and AI/Physics state here). The tree-synchronization technique used on my 3 trees circumvents this linear-algorithm. The technique of tree-synchronization is also applicable on a BLS-only approach.

Preventing lots of mallocs and free's (new and delete's) in a BLS:

We can only prevent this if we are not using pointers to jobs, but to use the actual (dereferenced) data-structure in the queue. This will cause the following to happen:

  • When used in conjunction with a priority-queue;These jobs may get copied around when jobs are added and removed from the queue
  • When used in conjunction with a normal queue;We can use an array as a cyclic-array that keeps track of a pointer to the 'first' element. This is unlike the std queue implementation (having a cyclic array on a data-structure that is dynamic in size does not work that well). Fixed size queues for multithreading is shown nicely in the barbershop-problem for readers that are interested.
  • To conclude:A queue-based render-state requires us to make some very difficult decisions that affect either memory-copying, the use of priorities, or excessive use of malloc's and free's. When the actual data-structures are used in the queue we also need to keep memory-alignment for all jobs.

Memory Allocation:

Quoted from a random forum:

The Standard does not require any particular implementation. Therefore, malloc may or may not be used. Here is a *possible* implementation (taken from TC++PL):

void *operator new(size_t size)
{
  for(;; )
  {
    if(void *p=malloc(size)) return p;
    if(_new_handler==0) throw bad_alloc();
    _new_handler();
  }
}

Mixing up malloc and delete is a very bad practice (as is mixing new and free) as pointed out by the very awesome website parashift.com. I know that website very well, that is where I learned about const-correctness :). There is no guarantee (because the implementation of c++ does not require to describe its internal operations), but it is easy to see why so many implementations do use malloc and free internally. As you pointed out in "Most of the time, [...]" - you are absolutely correct and this is why it only becomes a problem when physical memory runs out. (says the guy using c# for his engine, lol) The example used was indeed based on stack-allocation (a more efficient method) and not of heap-allocation. That is why I thought it might be a bad example. When we are talking about copying of render-states we are talking about heap-allocation are we not?

Ancient Greek philosophers:

I too have read a lot of texts from ancient Greek philosophers out of fun, but also because the study " Cognitive Artificial Intelligence" at the university was part of the Humanities-faculty (in Dutch: " Geesteswetenschappen"). They changed it not too long ago, I believe they merged it with the regular study of " Artificial Intelligence".

Person2 (#5903454104)
   
Person1 (referee)

Summary: Bourgond Aries

  • shows benchmark results of his game-engine (still in development);
  • demonstrates some of the methods used to create the benchmark;
  • wants to devise a method for comparing engine-designs.
 
Person1 (Bourgond)

Some Research Results:

I shall first show some of my latest bench-marking results and I shall then fade into my thoughts regarding your assertions. I have implemented a little application with the following properties:

  1. Every 10th iteration of the main loop; it outputs the time taken for that iteration to complete
  2. The physics updater is completely batched (BLS)
  3. The main loop distributes rendering and physics in accordance to Rit
  4. 10k (standard) particles are moved across the screen (CPU-driven motion)
  5. There is an ability to edit the amount of particles
  6. There is an ability to turn MTR on or off
  7. Princess Luna is still watching us (most important point, this guy needs escapism)

Since pictures tell a thousand words; so here's a simplified description of how the architecture I have written works:

MTR-BLS-hybrid2

When the MTR is turned off, logic sequentially enters render and renders on its own. Here are the results:

Run

The first numbers are  s in time of main loop completion using sequential rendering. The lower numbers around 5-8k  s are the loop times when the MTR was turned on. This MTR was a very simple model and only synchronized after physics was done. This allowed physics to run whilst the renderer rendered the oldstate. It is quite clear that the MTR is faster; however, I am not using raw OpenGL calls since my drawing goes through SFML. I have looked at the sources and it appears that drawing commands are directly sent to OpenGL from the same thread that requests drawing. I think I can therefore safely say that this library does not already implement some kind of queue for drawing. I changed the amount of particles from high to low and the results were approximately the same: MTR causes the loop to iterate approximately 3x to 4x faster. Quod Erat Demonstrandum MTR/BLS-Hybrid yields superior iteration rates.

Memory Usage and Extrapolation

The memory used in the above example is 2x worldstate (including rendering state) + 1x global state (window, textures, etc...). This came down to 52-54 MB (alternating every second). We could consider extrapolation in this scenario. I think you brought up a very good point: If you are using a very complex physics process (that takes 0.1 seconds to complete for instance), then we need to have to run the renderer (and extrapolator) at a higher speed. When the physics-job get initiated we cannot wait for it to complete, so we continue extrapolating on the old state. This way the new frame can be rendered without a hitch. Such complex physics simulations - seriously, this is intense, but that does not invalidate the point - would require us to run physics and extrapolation in parallel so that the frames appear smooth. Yet it requires one additional renderstate. Let's see if we're thinking about the same thing; since pictures can be so much more informative:

workflow

Here the physics and extrapolator run using BLS-setup. Extrapolator pushes extrapolated rendering states to the queue, Once it is done extrapolating the frame, it will push an object that causes the renderer to flip the buffer - and thus make the rendering visible to the user. I can imagine that one could remove the WRITE-ONLY channel from the physics thread into the queue, as the extrapolator will simply continue from the old state and render the next frame. If your intuition is to use the same BLS-queue and threads for these physics jobs (the same queue as used for AI, sound-effects and other threaded operations) than I would urge you to reconsider. When we are at frame #42 and physics is still not finished, it becomes very easy to temporarily give a higher priority to physics and no blocking function calls for playing audio-effect should lock up a worker thread. I think this point can be refuted as my implementation is independent of a queue. It is the main delegator's job to delegate the work to each thread separately. I am suspecting that I am misunderstanding you and that you are actually talking about operating-system level threads and how they are prioritized. Please let me know. We try to give each thread an equal amount of work using simple calculations: (C++)

code1

Each logic function in each thread will using the thread number compute what data range it will work with:

code2v2

We are thus unable to stall a physics process by accidentally inserting audio work. When we are talking about copying of render-states we are talking about heap-allocation are we not? I would say so. I see no point in allocating rendering data on the stack as it can become quite big, whilst the stack is limited quite limited. The only problem with the heap is that it takes an instruction to dereference. I think that this will affect us minimally, and is thus not worth of optimizing away.

To summarize:

It appears to me that the main unresolved problems are: Are the implementations of deferred API-calls faster than the BLS-threaded way? That is, BLS waits before overwriting the oldstate to see if the renderer has rendered the old state. As I have shown, it is about 3-4x faster than sequential rendering. I think I could use a boost::lockfree::queue to experiment with this. I urge you to do the same so we can compare results and methods. That I have to know more about your tree-implementation to circumvent the O(n) data copying relation. How can we quantify and weigh memory copying and usage against responsiveness and CPU-time?

Shhhh, no physics now, only extrapolation:

In addition; I would like to propose a method of not threading extrapolation whilst still being able to extrapolate without the need to wait for physics to complete. Considering physics to take up a whopping 0.1 s per iteration, we could partially complete a physics computation. Let's compute 1/20th of the physics per iteration; this results in physics taking up 0.005 s, which is really low. We could after computing physics continue our extrapolation like always and send another message to the renderer that a new state is ready. Or we could use a queue from within the extrapolator to achieve deferred rendering. I suppose it will be a little easier to implement and would not require more than 3 render states in total.

 
   
 

Summary: Random guy

  • makes a summary of the workings of Bourgond's code after asking a couple of questions in the chat;
  • proposes a different workflow (assuming state-copy is very quick and not a problem);
  • thinks a comparison between double-state vs triple-renderstate would be more interesting, instead of whether or not a separate render-thread is a good idea;
  • states that extrapolation is not the same as physics and that the two should not be grouped together.
Person2 (referee)
 

I already said it to you personally, but it would be a shame if I didn't repeat myself:

This is pretty awesome =)

Summary of your results (for other readers):

You have shown with your example that a multithreaded render loop (MTR) can be very beneficial (using a double render-state), because the render-thread is separated from the update-thread(s). The applied stress test was to see how the frame-rate was impacted by a CPU-load: After some questions asked in the chat for clarification, I came to this workflow for what happens when your computer has 8 processor cores:

Without MTR: ~20ms/frame

  1. [queue work] - queue work on mainthread for 7 worker-threads
  2. [8 cores working on newstate] - mainthread + 7 worker-threads do each 1/8th of the task at hand
  3. Done [7 cores halt] - wait on main-thread to be sure task is done
  4. currentstate=newstate - on main-thread
  5. [API-calls (render)] - on main-thread
  6. ... back to step 1 ...

With MTR: ~6ms/frame

  1. [queue work] - queue work on mainthread for worker-threads
  2. [7 cores working on newstate] - mainthread + 6 worker-threads do each 1/7th of the task at hand
  3. Done [6 cores halt] - wait on mainthread to be sure task is done
  4. wait_for_previous_drawing_of_currentstate() - main-thread
  5. currentstate=newstate - main-thread
  6. [signal render thread] - main-thread
  7. [API-calls (render)] - render currentstate on render-thread
  8. ... back to step 1 ...

Indeed Q.E.D. :). I will not argue that a MTR is not useful (because it is, as also shown by your evidence) and I will also not argue that BLS is not useful (because it is, especially in CPU-heavy games), but I will debate 'how' BLS is used en 'how' the MTR is implemented (given a certain set of goals we wish to accomplish).

Alternative to used workflow:

Modification to "Without MTR" ... can this still be called "Without MTR"? Don't think so:P.

  1. Done [8 cores halt] - wait on main-thread to be sure task is done (initial state of monitor is signaled)
  2. currentstate=newstate - on main-thread
  3. [queue work] - queue work on main thread for 8 worker-threads
  4. [API-calls (render)] - render currentstate on main-thread
  5. ... back to step 1 ...

The main thread is now basically the render-thread (plus state-copier) and is not working on jobs together with worker-threads. I think it'll be faster and it's even less steps. Personally I think this workflow is much nicer. It does not require the main-thread to pick up some of the work. It seems to me that somewhere the decision "No more threads than processor-cores" was made. If this is true, then I think that this is an unfounded decision. There is good justification for using a higher sum-total threads than CPU-cores when there are occasions that some of the threads need to wait. There is no need to go overboard for obvious reasons, but being a little bit greedy (e.g. 9 threads on 8 cores) doesn't hurt a bit.

My example used above has one issue you might want to think about:

Because work is queued before doing the API calls, 9 threads want to do CPU-heavy stuff (even though API-calling may be less CPU intensive because of waits on the video-card-bus). Because of this, rendering may be delayed by 'incorrect' context-switched made by the OS.

Four simple solutions (if you don't want this) are:

  1. Do the [queue work] after API-calling
  2. Set a higher affinity to the main-thread.
  3. Reduce the worker-threads to 7
  4. Queue no more then 7 jobs before the API-calls (and prevent the queue from ever having more then 7) - this requires the jobs to be separated into smaller chunks (e.g. 100 'jobs', each 1/100th of work using a fixed-size queue of 7 with 8 threads) To draw an analogy with the Sleeping barber problem:
    1. There are 100 customers wanting their hair cut. (total jobs)
    2. There are 7 seats in the barber-shop (queue-size)
    3. There are 8 barbers in the shop, one of them is just opening the door for when there is a new empty seat. (threads)

    This requires the implementation of a Semaphore (Yay, Go! Dutch Computer Scientists! Go!)... a pretty fun exercise! Having a data-structure like that allows you to create more then 1 barbershop (BLS-queue), having control over load by queue size and having barbers work at any shop they want. (For most cases I recommend the number of barbers to be: number_of_cores + number_of_barbershops, or translated: recyclable_thread_count = number_of_cores + number_of_bls_queues). Using this method you can even recycle threads (barbers) to be used on different queues (barber-sops).

Personally, I would never pick option 3 here:

  • Hypothetical guy asks: "What if I introduce threads somewhere else in my code, do I need to keep reducing these worker threads?" No, that would be bad. Also keep in mind what would happen when you run this example on a single-core CPU. Having 0 worker threads may become a bit unproductive :)

Personally, I would have picked option 4, especially because we often can't use a share-and-divide tactic on jobs because we don't always know how many jobs there are beforehand. (we cannot cut the cake in equal portions if we don't know yet how big the cake will be)

The more interesting comparison would be:

  • "Your MTR without BLS" vs. "MTR using Triple buffering without BLS"
  • "Your MTR with BLS" vs. "MTR using Triple buffering with BLS"

The approach where you wait for the renderer to complete (in both your "with MTR" and "without MTR") may be harmful for networked game, because we don't want to wait for some guy on the internet with a terrible video card. So the comparison should include a set of ' problems' that for which the comparisons will be made.

Testing:

For testing purposes I can recommend to introduce some code to introduce additional fake-load:

  • A thread-sleep before the renderer (to simulate GPU-load)
  • A thread-sleep on the main-thread to introduce delay in the main thread
  • A thread-sleep-jobs (so some cores are simulating heave CPU-load

The amount of time that you let these sleeps use will be a variable that you can modify by using the keyboard so you can see (at run-time) what effect fake-load will have on several portions of the code. It will give no guarantee of course (because the CPU-cores are not actually busy), but it's better than nothing. An alternative would be to introduce are recursive "fib(x)" function to calculate Fibonacci number inefficiently where x becomes a variable you can modify with keyboard-input during the tests of your engine.

Shhhh, no physics now, only extrapolation:

Haha, that is a great phrase, but there is truth to that :). Extrapolation is not the same as physics simulations. Extrapolation is a local algorithm; it only looks at data for the object itself. Physics also needs to look at collisions and more importantly collision-responses and is therefor not a local algorithm.

Small summary of the process of creating physics engine (for other readers, as you already know):

For people reading this that are interested in developing physics engine for them selves, this is the usual path at creating a physics engine:

  1. Create simple bounding boxes (AABB/spherical/...) + overlap detection
  2. Use Euler
  3. Acknowledge Euler integration is bad
  4. Implement something like RK4 and fix time-steps with deltas
  5. Add simple collision response (see below)
  6. Improve collision response (see below)
  7. Add different bounding boxes (like convex shapes)
  8. Implement it in 3D
  9. Apply rotation/friction/...
  10. Add springiness (and possibly stress-levels)
  11. Support networked physics
  12. Allow objects to be attached to each-other
  13. Allow breakability.

A first draft of collision-response usually goes like this:

  1. Do motion
  2. Detect overlap
  3. Modify motion vectors

A better version is soon to follow, such as:

  1. Do motion
  2. Detect overlap
  3. If overlap place the 2 objects a little bit back in time (to time of collision)
  4. Change motion vectors
  5. Move the 2 objects for a a fraction of the time-step using new motion vectors
  6. Rinse and repeat (from step 2). Repeat for a fixed number of steps to prevent infinite loops to resolve most (if not all) physics responses.

Because of the "detect overlap" (usually like: octree-grouping -> coarse grained AABB -> convex hull) it will never be a truly a local algorithm, but we can still do multithreading! Yay! But multithreading has a very serious consequence, it may destroy determinism if the order of operations is changed. There are some subset problems that can be solved independently of each other, if implemented correctly (e.g. collision detection within 2 separate octree-cells while having specials rules for overlapping objects).

Malloc's:

Indeed, the method you used in your example does not overly abuse malloc's. We resolved this in chat after having a good opportunity to look at source-code and discussed possible ramifications. There is another technique you might want to consider. Besides shallow and deep copy there is another possible approach: lazy copy (also known as copy-on-write). For large datasets that are copied a lot and often do not get modified (but sometimes will) shallow copy is a pretty good alternative. Unfortunately this requires writing some of you own datasets because std's containers doesn't provide lazy-copying to my knowledge (but I could be wrong, because it has been a while since I used C++ for a big project). I'm still wondering a bit though, it is not one big memcpy right?

  • Assuming memory-alignment of all child-tree-nodes:
    It is likely that there are as many memcpy's in your design as there are non-leaf tree-nodes, and possibly realloc's when datasets are modified. To enforce memory-alignment on different node-types, it's probably best to use an 'union' in combination with an enumerator (as opposed to the 'virtual' keyword for polymorphism)
  • Assuming no memory-alignment of all child-tree-nodes (and pointers are used):
    A lot of malloc's are required because we cannot do "*myNode = *otherNode" when the object's memory size may differ.

The reason I think this part is very important is because in a lot of designs, in the part where render-states are copied, all threads (besides the one copying) need to wait. Even if you have a design where this is not the case, the thread that is responsible for rendering needs to wait until the copy is done. Do you really want to introduce an operation such as this, that reduces frame-rate? Keep in mind that the render-state may become pretty big for some games!

Person2 (#5903454104)
   
 

Summary: A summary of a discussion in the chat to state several problems to solve in our game-engine-design so that they can make a comparison of two designs that have the same goals in mind.

 
 

Summary of discussion in Chat (typed by random guy #5903454104) :

Let's both create a game-engine design with a ridiculous high standard for the set of problems it wishes to solve. The idea is to solve all the difficult problems in the design, so we can simplify it using configuration-options for when there is no need for additional complexity. These are the requirements (problem statement):

  • Reading user input at 120 Iterations Per Second (IPS)
    Reason:The game should feel responsive.
  • Physics at 20 IPS:
    Reason:Assumption that the physics of the world is just too heavy to do at every iteration.
  • FPS should not be limited and can easily reach 400 FPS.
    Reason:New rendered frames may not blindly use render-states, because that would make rendering at a high FPS useless (each frame should be a little bit different and not just carbon-copies of the previous frame).
  • All player's actions should feel responsive.
    Reason:User input for camera rotation must be shown on-screen as soon as possible. Camera-rotation will not use physics checks and it will be senseless to wait for physics updates to 'finally' react to input from the user. Still, all users actions should feel responsive. When the user is walking into a wall it may not stutter (because of blind extrapolation) and we may not allow to have a IPS of 20 to dictate the responsiveness of inputs.
  • We should be able to do (possibly very complex) AI-calculations and this should not (or barely) affect FPS.
    Reason:Assumption that AI calculations cannot be made at the same speed as normal render-state-updates.
  • If rendering is very slow, such as 18 FPS (because of an outdated graphics card or a temporary hick-up), but the CPU is decent, the game-speed should still run at the same speed as when rendered at 400 FPS.
    Reason:It may never happen that the game will run in slow-motion because this may lead to disconnects when used for a networked game.
    Example 1:Imagine a networked Minecraft-like game where someone does something that is enormously taxing to video-cards (e.g. super-explosion, but with better graphics) for everyone watching it. Is it possible for people watching the event to disconnect because the server detects a too large time-gap between server-state and client-state?
    Example 2:In a very competitive game (e.g. StarCraft II) when armies get large and lots of polygons are onscreen, will a low FPS of PlayerX lead to messages such as "Waiting for PlayerX" on the other player's computer?
  • The updates for the game-state should be deterministic.
    Reason:No thread-ordering may destroy determinism. When the simulation is run twice and input is replayed at the same frames nr. the result should be exactly the same.
    Example 1: Image a networked minecraft-like game where someone does something that is enormously taxing to the gamestate (e.g. super-explosion), can we use client-computation to reduce network-load for gamestate-modifications that take place deterministically?
    Example 2:In a very competitive game (e.g. StarCraft II) can we use client-computation for gamestate-modifications without introducing a system that can be abused by hackers? Can this network-game still be able to feel responsive and where hash-checks are only used for resolving minor discrepancies?
  • We should have some solution to blocking function-call for things like File IO, Audio-playback or network-functions.
    Reason:No blocking function may cause the CPU cores to go idle while there is still work to do somewhere else. Granted this is the easiest of the list, but lets include it anyways.
  • The renderstate will be quite big in normal RAM. Not as big that you can't have more then one render-state in memory at once, but big enough that it might give you concern.
    Reason:We don't want to have more than necessary renderstates in memory, they need to be small in size and fast to edit when the worldstate (and therefor also the renderstate) is modified.
    Example 1:Will renderstate-nodes have a references to physics-states? Will renderstate-nodes have references to AI-data? Are renderstate-nodes allocated separately on the heap? Are renderstate-nodes memory-aligned?
    Example 2:Are renderstates fully copied every frame that we want to draw a new frame? Are renderstates fully copied every frame what we want to update it?
    Example 3:How will the used method handle games that want to use a lot of CPU-based LOD-transitions? Notice that we're talking about CPU-based variability of polygon complexity and not about video-card tesselation (GPU-based variability of polygon complexity that can also be used for other cool stuff).
  • We should be able to manage the load on the video-card bus.
    Reason:Assumption that we apply techniques that are very stressing for the video-card bus (e.g. virtual textures)
    Example:If we are continuously uploading textures to the video-card and want maximum performance out of it (assuming different bus-speeds on different computers), how are we scheduling these updates? We don't want the texture-updates to get in the way of FPS, or that it causes the FPS to become very irregular.
  • We should apply culling of objects that are not visible.
    Reason:Assumption of a huge 3D world where some parts are pre-loaded in memory, but not visible yet because of camera-angle, draw-distance or delayed LOD-transitions. New chunks need to be regularly loaded from a slow hard-disk (or network-connection) well before they become visible. Where is the crude CPU-based culling applied? Where is the fine-tuned CPU-based culling applied? Do we need to differentiate between these two? Will the use of extrapolation cause errors in culling-techniques?

These are the difficult issues that you don't often hear about. The internet is full of MTR-techniques for engines that most software-engineers could have thought of themselves. If we are picking or creating an engine then we need to ask ourselves these questions. Is the engine-design really good enough for what we have in mind? Are there engines that allow for compile-time-switches or run-time-switches to switch between different approaches? If we want to dream big, we need an answers to all these questions before starting to build our game. Talking about multithreading problems in a game-engine for a 2D-platformer with physics is just too simple to be really interesting; some of the hard challenges may get overlooked. I want a challenge. Challenges are fun.

 
   
 

Summary: Random guy

  • shows that the use of extrapolation is not limited to rendering more frames than we are updating;
  • states that it not easy to multithread physics without influencing determinism;
  • has made a basic flow-chart for his triple-renderstate design for two scenario's (slow FPS and fast FPS).
Person2 (referee)
 

For everyone that is enduring this mile long discussion, here is some music:

When does extrapolation for rendered frames make sense?

  1. If you want to solve the issue of rendering new frames when FPS is faster than IPS.
    When no extrapolation is used the following may be considered a partial solution: When we allow quicker updates of the camera-matrix by either having an input-reader that really reads input at a very high IPS, or by smoothing out transitions. Even so, you might still want to cap your frame-rate (e.g. by using vsync) otherwise some players may get unpleasantly surprised.
  2. If you want to solve the issue of minimal stuttering when there is discrepancy between IPS and FPS (e.g. IPS=60, FPS=50)
    When we have a simple linear motion of a ball over the x-axis such that its x-postition is increased by one:
    x-position of ball in updates states:123456789101112131415161718192021222324252627...
    x-position of ball in rendered states:12345789101113141516171920212223252627...

    Keep in mind that letting the update-thread wait on the render-thread is bad practice if we want to have a deterministic networked-game. If extrapolation is going to use motion-vectors then this implies that the render-thread needs access to the physics-state (bigger states in memory + bigger copy-actions each frame), unless you use something like triple-buffering of course where only one physics state is required (because extrapolation uses matrix-differences between two render-states).

  3. Will extrapolation help in making the game feel more responsive?
    No I don't think so, but it won't make it worse either.

Render-Extrapolation and Culling:

Extrapolation on the render-state will give rise to minor changes in positions that may require us to adapt the culling-algorithms. We may need to adapt the frustum culling algorithm to use a slightly bigger frustum or we need to apply frustum culling only when we are absolutely sure where the object is drawn. Other culling techniques, such as simple distance-culling don't need to be so picky.

Multithreaded Physics:

Naive implementations of physics are easy to multitask and it is also easy to "cut the cake in equal portions" (divide the job equally to multiple threads). These implementations of physics will not break determinism when applying multitasking, so it seem a bit silly why I consider this to be such a hard problem, right? There is good reason why I'm calling some physics-methods naive, because it only looks like they work well in theory. All popular physics-engine-builders have realized this issue. When a single update-step is done by the physics engine, it is very well possible that more than one iteration of updates takes place. This sounds very weird for people that have not looked into the innards of a physics-engine before, and it may seem like an inefficient implementation. To have the (default-enabled) possibility of multiple iterations within one update-step is the direct result of strange effects that will occur with physics when just a single iteration is used. To give an example to what kind of iterations I'm talking about I will show the documentation of an existing physics engine, Box2D - http://www.box2d.org/manual.html#_Toc253068188. Box2D-physics-iterationsThis is the direct result of using step-sized physics (as opposed to the more CPU-heavy continuous physics). If we want to have a deterministic physics engine and want to multithread it, we can't just split these iterations in separate jobs that we force sequential computation on. If we are to multithread it over several computation-iterations for a single step then our physics-engine will have to wait a lot and this leads to huge delays. To conclude: To have multithreaded step-based physics on medium/big deterministic worlds it is best to apply one thread per area (e.g. Octree-cell) and define special cases for objects that overlap multiple areas.

Start of the design, only looking at the easy parts (for now):

Below you can see my workflow-designs using triple-buffering on the possibility of Fast FPS and Slow FPS and marked the differences with a yellow marker. There is still plenty left to design (introducing deferred physics without losing responsiveness, a couple of BLS-work queues, separation of render-API-draw calls and render-API-upload calls for preloading, ...) but lets not complicate things from the get-go.

fast_fps

slow_fps

Question: Why would you make the extrapolator a separate thread to the renderer-thread? If you have two threads at the same IPS and one (renderer) is dependent on the other (extrapolator) it might as well be one thread.

Answer: The renderer now signals the extrapolator with "Render Done" so the data for the next frame will be extrapolated. This is currently done after the BackBuffer-FrontBuffer-flip, so this doubt about the design is justified. The idea of this separation is that that it is possible to send the message "Render Done" actually a little bit before rendering is done. It will not break anything in the code. This makes it possible for the renderer to signal the extrapolator to start working on the next frame before the current frame is even completely drawn. We may need to rename the message from "Render Done" to "Next Extrapolation Requested" though.

It has to be said that I'm not a one-hundred percent certain about the extrapolator-renderer thread separation. There are both pro's and con's to consider. The most fun solution would be to let the software itself find out what design is better by running small test before actually running the game :P. This should not be that difficult to implement.

Hehe, and just because I can... remember this?:

Person2 (#5903454104)
   
 

Summary: Random guy

  • states the importance of deferred physics for some games;
  • shows that it really difficult to keep responsiveness when applying deferred physics, but that it is still possible;
  • shows a flowchart for how deferred physics can be implemented in the triple-renderstate design (while keeping player-physics responsive and fast).
Person2 (referee)
 

Applying deferred physics without losing responsiveness:

This is not an easy issue to overcome and is often skipped by game-engine-builders because a lot of them think that it is ' stupid ' to run physics at a slower pace. The argument used is: "Applying deferred physics will lead to a reduced responsiveness of the game". This is very naive because of the following reasons:

  • Responsiveness is becoming more and more important, that running the input-reader-loop at ridiculous high IPS is no longer " just for fun". When games are used for virtual-reality environments, the response must be immediate and may not wait for some complex physics calculation on the world.
  • It actually is possible to have deferred physics without reducing responsiveness of the game. Coming to a wrongful conclusion that it is impossible usually stems from thinking about this topic in the setting of a simple rendering-loop (Ad-hoc modifications of flawed designs).

Oculus-Rift *click here for more info*

Just to be clear: these are my words. I just thought that an image of John Carmack using Oculus Rift would give some of you guys more motivation to think about these issues without brushing it aside. Just click on the link below his image and watch the discussion he had at QuakeCon 2012. Ahhhh.. just thinking about using Oculus Rift in combination with networked-games makes me think of a brighter future :)

My solution to this problem comes from the following observations:

  • Physics-response is the cause of being CPU-heavy in good step-based physics algorithms. It is also the cause of why multithreading on deterministic physics is so difficult.
  • Running world-physics at 20 IPS, while running player-physics at 120 IPS is possible (with some precautions).
  • Calculating player-physics (with response) only, is a sub-problem of calculating world-physics (with response). This sub-problem can be solved much faster.

To apply this approach on my triple-buffering-design, I will make the following modification to the update-thread:

deferred-physics

In the 120 IPS loop reads the input and no world-physics-response will be calculated. The 120 IPS thread WILL calculate a physics-response for the player-object(s). Using this addition to the design I currently have 4 threads and I have not even added job-queues with multiple worker-threads yet 😛

Notice that by using this design we have separated the physics-state from the render-state. Even though position-data and some bounding-boxes are shared between renderer and physics, duplication of some of this information doesn't hurt a bit (e.g. we are not using convex-hulls for culling, just for physics). Properties like weight, friction, terminal-velocity, gravity, fracture-data and stresslevels do not belong in our render-state. When we are not interested in deferred physics, 1 physics state will be enough, even when using my proposed triple-renderstate-buffering technique.

Person2 (#5903454104)
   
 

Summary: Random guy

  • shows his design for worker-queues and threads and gives a description for each one of them;
  • show how some threads can communicate with each other without introducing lots of mutex-locks.
Person2 (referee)
 

Job batches and worker-threads:

The design-sketches as posted before do not yet show a complete picture, so let's show how job-queues and worker-threads will be used: jobs-and-workersAs it turns out, drawing a circle is really difficult!

Video-card queue(s):

In previous posts we talked about how doing multithreaded video-card API-calls has no real merit (besides allowing us being lazy software-engineers). These API-calls are asynchronous in nature (returning immediately - even before the job is done), but order is still ensured. This is important for the front-to-back (faster) / back-to-front (for translucency) rendering techniques. So let's move away from this easy topic and continue on some more difficult issues. Let's assume we are going to integrate virtual-texture technology in our engine. What modifications do we need to make to the design?

We want to upload new texture-data to the video-card as quickly as possible. One of the most problematic scenario's for this technology is: "What if the user quickly turns the camera around?". We want to optimally use the video-card bus and use LOD-like transitions on texture-data without impacting the FPS too much (we can't wait for all textures to get loaded before rendering a new frame). We want to schedule texture-uploads to VRAM without reducing FPS to a crawl at camera turnarounds. If we are using the same work-queue for VRAM-modifications and Draw-commands, then we are not able to use heuristics to streamline texture-updates (texture LOD-transitions). If you only consider video-cards with very high bus-speeds or don't care about uploading textures heuristically in the idle time between the rending of 2 frames then it is okay to forget about this separation.

Audio queue:

Usually we don't have to worry about this at all. The library that our engine uses for audio probably already supports async-calls for audio. It is unlikely that we need to implement this ourselves, but I added it to the sketch for sense of completion. If the used audio-library only supports blocking function-calls then you might want to consider adding this in. Sometimes there is even a dedicated thread for music (because you don't want a song not to play because a lot of sound-effects are going on), but let us ignore the minor details for now.

File-IO queue:

This is another one of those things that you absolutely do not want to blindly multitask in one big run-of-the-mill job-queue. With technology slowly moving away from the linear "load -> render" to a more continuous "load + render" you want to make sure that doing File-IO has barely any impact on rendering speed. Reading the content of 3 files simultaneously will not speed things up! And just as reminder for some people: "Reading some large file does not need to lock up a CPU-core. It can lock up a CPU-core if you're lazy programmer, but this definitely doesn't need to happen".

Network-queues:

Everyone that has written some (decent) client-side TCP-IP communication code knows that TCP-IP makes full use of the full-duplex nature of the wires. We can send and receive at the same time and even the order is guaranteed! Having multiple threads receiving data using the same socket is just... wrong. You can't have more than 1 thread on a recv or more than 1 thread on send for the same socket. To have multiple threads have access to these queues is, for lack of a better word, stupid. For a single socket with duplex communication you need but 2 threads. Only for server-side communication (with lots of connections) is the pooling of TCP-IP-connection threads a good idea, but the reason for it is completely different: network-load-balancing. In the design it can be seen that the 'recv-loop' sometimes directly adds a new message to the send-queue. This it not necessarily bad design, because some messages are just ping-like messages that have nothing to do with updating world-states. We could reduce the number of threads if we only use asynchronous versions (for Windows that'll be WSASend and WSARecv), but I doubt whether that really makes the performance better.

Physics+Ai queue(s):

This will contain all the CPU-heavy jobs and needs to be scalable for multicore CPU's. All those other threads described above are not that taxing on the CPU-cores, they just wait a lot for messages from the underlying hardware (supplied by the OS/drivers). The OS will use context-switches, when appropriate, to give threads that can actually do something time on a CPU-Core. I do need to explain a bit on why I have separated them in 2 queues though, that was not an obvious choice at all (and merging them into 1 can be justified in most cases). It all depends on the data-structures used for physics and AI. If there is severe overlap in used data then having 2 queues may be a bit weird. When AI data-structures are very different from Physics data-structures, you may just as well split them. Having a separate queue for AI makes it possible to prioritize one queue over the other at certain points in time (important for deferred computation when running several calculations at different IPS's). A separation of the queue also helps in simple bottleneck-detection (e.g. "Is it AI or Physics that slows down the game?"). Whenever possible I would split AI data-structures from Physics data-structures otherwise the code might quickly clutter.

arnold-the-thread

The AI/Physics have two-directional access to the queue because jobs can often be split in child-jobs. It is often not possible to divide this work in equal portions, because this requires us to know all the work beforehand. Most algorithms are too complex to have this perfect foresight, so we'll just split it to several decent-sized job-chunks. A thread, let's name him Arnold, will be able to do some work on the chunk and split some of its work into smaller (but not too small) jobs so other threads, like John and Bruce, can help that poor fellow out. We can't expect Arnold to do all the heavy lifting, now, can we?

Minor rant towards random (possibly fictitious) people:

If for some reason people are under the impression that applying a singular thread pool pattern to a game-engine solves all problems, then it would be best to just ignore them. Someone who just learned about data-redundancy will also tell you that having two render-states in memory should be avoided at all costs, but do you really want to listen? Still want to use one queue with multiple workers for everything? Well, it'll be pretty funny to see the FPS drop because the game tried to load a couple of files while also maintaining several network-connections. Having an FPS of 0 with 0% CPU-Load and 0% GPU-load must be pretty 'mysterious' to those guys. "How could it be that the render-jobs are not being picked up?". The same people will probably think that they can solve this minor issue by using a simple priority-queue or by cranking up the number of worker-threads. Ha! I rather live in the real world.

Inter-Thread Communication:

We don't want mutex-locks on data-structures that are also used by algorithms that run at different speeds or at a different phase (obviously). The solution to this issue comes from looking at how WndProc works for handling the message queue. If we want to run some code on the GUI thread (because the GUI is owner of the data-structures used) then in C# we would do something like this:

BeginInvoke-example1

Or, if you want to be really lazy and want to have but one general-purpose-function that can be used by any thread (including the GUI-thread), it'll be something like this:

BeginInvoke-example2

For non-C# people:

  • The InvokeRequired is basically checking the owning process-ID of the window-handle to the process-ID of the current thread.
  • The returned IAsyncResult is like a wrapped monitor (not the same though), so the caller has the option of waiting for the message to get handled. (If you always want to wait then Invoke is used, as opposed to BeginInvoke, but this is not generally accepted as best-practice)

In C++ (without GUI-libraries) you'd probably do a PostMessage with a user-defined message (e.g. WM_USER + X), supply the function-pointer and object-pointer as WPARAM and LPARAM so the WndProc knows how to handle the job-request.

Keep in mind that you surrender ownership of the argument (newText in our case) to the thread that is handling the Job. Debugging the surrender of ownership in multithreaded application can be very difficult. When the value is pushed on the stack, things are easy. When objects are allocated on the heap however, things get tricky. The C# example above is just easy because C#-strings are immutable and we don't have to delete these objects ourselves. In C++ I can recommend the following approach in helping you visualize and debug data-ownership.

Keep in mind that I'm not talking about mutex-locks around data-structures, but about having a single thread have ownership of data (and noone else is allowed to touch it until ownership is relinquished). For Debug-builds I often use caution about data-ownership and add this Debug-code:

  • For every data-structure with dynamic thread-ownership I keep a variable that contains the Process-ID and some inline setter-function (or macro) to change it.
  • For every public/protected function the first thing I do is to check the current process ID against this variable and throw an exception when it is different.

This is just a precaution to protect me from myself. I often am stupid (:|) and try to access some data that the thread has no ownership of. I'm not talking about the Windows message queue and definitely not about MSMQ! I'm talking about how the windows message-queue is used and how this technique can also be applied for our inter-thread communication needs. There is no need to use mutexes around big chunks of data and inter-thread communication can be quick without much overhead. Just keep in mind that it is best to keep the jobs in the queue sizable (in the sense of time spent to run the code). When this approach is used as a method to schedule every single tiny task for the sake of " nice code", I'm pretty sure that even the minor overhead will cause severe performance loss.

Person2 (#5903454104)
   
Person1 (referee)

Summary: Bourgond Aries

  • wonders how the OS is determining when context-switches are made;
  • thinks that it is okay to let players with a terrible video-card not to be able to play networked games;
  • thinks enough time is spent pondering pros and cons and that a master-framework should be design to facilitate testing;
  • proposes to use an existing rendering and physics engine and build a test-framework around this.
 
Person1 (Bourgond)

Well for this iteration there's a lot to go through. Let's start with some your proposal of

Alternative to used workflow: Modification to "Without MTR" ... can this still be called "Without MTR"? Don't think so :P:
The main thread is now basically the render-thread (plus state-copier)

This is an interesting assertion and I think it will work just as well as my previous model; if not better. I don't think one should be too concerned about the OS and its context switches.

Secondly; I agree with your opinion on:

Hypothetical guy asks: "What if I introduce threads somewhere else in my code, do I need to keep reducing these worker threads?"

But I wonder; is a thread that sleeps or performing a blocking operation explicitly showing the OS that it needs to yield? If so; the OS could perhaps skip its timeslice and give CPU timeslices to actual working threads. This would perhaps imply that it is not necessary to limit ourselves to the amount of threads equaling the CPU cores.

You also commented on a some continuity where one piece of hardware stalls much more than the others. To that I have to say:
When a game is networked and a guy with a terrible video card tries to play it; then I suppose that person should simply not be able to play them game if on the lowest settings; the game lags due to the CPU being stalled for too long.

You also proposed to simulate a load with sleeps; but as I commented above; I think we'd need to use fib(34) or something equivalent in order to simulate a real load. The OS kernel may simply let the thread yield and give all timeslices to any other threads.

Also the thing about order-independent multithreading. I think I've commented this before and it is the main reason why there is a copy of the state. With a copy; each thread can read from the entire copy whilst writing to their respective parts in newstate. Since the oldstate can not be modified during this operation, the newstate will be order-independent.

I then would like to consider the operations performed when a rendering tree is copied in order to multi-thread rendering. It is according to my knowledge so that a tree will only create/delete new nodes if a tree that is copied into another tree is different. I do not think std contains a copy-on-write. Therefore it can be possible to give every significantly large subtree (or every node) a "changed bit"; which notifies whether anything has changed. Assume nothing has changed in 1 frame, an accumulation counter tells that there are 0 changes. If a frame has changed the render state with 3 changes; we will perhaps need to iterate the tree until all 3 edits have been found and copied.

I think that we can in fact do:

*myNode = *otherNode;

if we overload the operator= and both nodes are classes. The operator= will call all subclasses' operator= which will afaik check the sizes of containers against each other in order to copy safely. Thus; same-sized container copying will require no new/mallocs. When such a change does come about; there will indeed be mallocs/new.

I think we've spent enough time postulating the different uses, implementations, and pros/cons.

Your other posts were interesting to read. To however get some conclusive answers; I think we need to implement some kind of master framework where everything we have discussed is tunable. I suggest we use a 3D rendering engine with a 3D physics engine and construct the framework around this.

 
   
 

Summary: Random guy

  • gives insights in context-switches made by operating systems;
  • thinks that people with inadequate GPU's should be able to play networked games, especially because not everybody is playing at the same FPS;
  • warns about memory copying when the game uses something like a cellular automata where the world-state can easily be several hundreds MB;
  • gives an example on the use of render-trees;
  • gives an example of using pointer-switching in a double-state design;
  • states that using an existing game-engine for testing different engine designs is troublesome.
Person2 (referee)
 

Blocking Operations

  • Sleep
    By design this will invoke a context-switch. It is even possible to do a Sleep(0) to force a context-switch if you so desire. It is very unusual to manually invoke context-switches like this, and in most scenario's I would disapprove of using Sleep like this. Quote from MSDN:
    • "After the sleep interval has passed, the thread is ready to run. If you specify 0 milliseconds, the thread will relinquish the remainder of its time slice but remain ready. Note that a ready thread is not guaranteed to run immediately. Consequently, the thread may not run until some time after the sleep interval elapses. For more information, see Scheduling Priorities."
  • File-IO
    Let's take Windows blocking fread as example. This function is fully blocking and will lock out other threads during it's task. It will take up one core until the job is done and no context-switch will occur on this thread during this period. This is most likely exactly how you'd like it (no CPU-heavy tasks will delay a file-loading task that has already been picked up). If as-fast-as-possible file loading is not important, then _fread_nolock can be used, which does not protect against context-switches.
  • Network-IO
    Let's take Windows blocking recv as example. This function blocks the thread that it is currently running on, but it will not lock up a CPU-core. The OS will most likely do multiple context-switches every time every time an actual packet can be read from the wire. It is possible to do a non-blocking receive (or send), but this has no tangible merit compared to the much nicer separated recv-thread.

People with inadequate CPU/GPU

I'm a bit more lenient on this topic, because there are so many different gamers out there. Some of them like to play at medium-fidelity graphics at 120FPS while others like to play at high-fidelity graphics at 50FPS. This topic becomes even more important for games with a lot of freedom (e.g. minecraft), where it is becoming increasingly difficult to design a game with a steady FPS.

Memory usage / Cellular Automata

If our game uses Dwarf-Fortress-like (or Minecraft-like) cellular automata and this is part of the physics world, it is reasonable to assume that the renderstate+physicsstate together takes up like 200MB in memory. Assuming a peak transfer-rate of 6400 MB/s and we do not need any mallocs or realocs it would take 31.25 milliseconds times 2 (read+write) for the copy action. This time locks up rendering and the FPS cannot be higher than 16 (even lower because rendering itself takes up a small amount of time). We need smart-copying, with dirty-flags or using our own lazy-copying-containers, to reduce the amount of copying. The CPU usage for this smart-copying must outperform the blind-copy, which is doubtful when the world is reasonably dynamic (e.g. A blind-read-write of 200MB might become a 75MB read + 40MB write + CPU-time).

If have written a very simple 200MB memcpy speed-test in C++ which can be downloaded here (source+exe), compiled as release-build on Microsoft Visual Studio 10 as x86-executable. For me personally, on an Acer Aspire 7740 with 4GB DDR3 SODIMM, the results are:

MyMemcpySpeedTest Just make sure you have well over 400MB free RAM so the memory blocks are not scattered in physical memory.

Real-world scenario's will have worse performance because we also need a lot of reallocs. We also need to do several memcpy's on parts on the renderstate, instead of one big one as the test above. For people interested in the inner workings of memcpy I can recommend this article.

The idea of triple buffering on rendering states is that you never want to wait and that we fully decouple the update-thread from the rendering-thread. By design it forces us to make a separation of "that what can be rendered" and "the world we need to simulate.".

If this cellular-automata-game was implemented in double-state buffering (as opposed to triple-renderstate-buffering) it would take up more memory and be slower. Not many people would use double-state buffering for such game and switch back to a single-threaded design (with BLS). I however see this as a challenge and ask myself: "How can I create a reasonably big server-side dynamic world with cellular automata with lots of clients that are unlikely to slow each others game down that does not use insane amounts of network-traffic and still feels responsive?". People may call me nuts for thinking it is possible, but I like nuts.

Render trees / Overloading operator=

Operator overloading like this is most likely a bad idea because it really obfuscates some of the code and it might have some nasty side-effects. In the example used it seemed to me that this separation was made for the render tree:

  • leaf node, containing actual information to be drawn
  • non-leaf node, a container for child-nodes

Just to clarify, this is not how render-trees are usually created because of the way matrix transformations are used for relative motion. As example of a very simplistic animation of a couple of enemies, a part of the render tree may look like this:

                                        /
                               EnemyCollection             
                               /             \           
                              /              ...         
                    HumanEnemy1                        
    ________________/ / |  \ \_____________            
   /                 /  |   \              \           
   LeftArm       RightArm  Head  LeftLeg     RightLeg      
 /                /           \              \         
LeftLowerArm   RightLowerArm   LeftLowerLeg   RightLowerLeg
|              |               |               |       
 LeftHand       RightHand        LeftFoot       RightFoot  
 / /|\ \         / /|\ \                                   
/ / | \ \       / / | \ \                                  
p r m i t       t i m r p                                  

For any node in the tree, its rendering position is determined by the matrix of it's parent multiplied by the matrix of itself. With OpenGL, the renderer will just be walking this tree depth-first and do a "glPushMatrix" (with a new multiplied matrix) when going down and "glPopMatrix" when going up. There is no need to do a push/pop on the EnemyCollection-node (because it doesn't contain relative position information), but if render-trees a going to look like this then it'll be tough to enforce data-alignment on all nodes.

It is also undoable to force one collection-type on the game-builder that is using your engine. For some people EnemyCollection could be:

  • a std::list<Enemy> with data-aligned objects for enemies.
  • a std::set<Enemy*> with pointers because there are lots of different enemy types of different sizes in memory.
  • a std::map<int, Enemy*> with integers as id's, because hell, why not?
  • a mycollection<Enemy*> because we are grouping enemies into octtree area's for some reason and this seemed like a fun thing to do.

The engine should probably be agnostic about the collection types used so it does not hinder game-development, unless the engine is for the creation of a specific type of game where we know beforehand what data-structures are best.

Pointer switching when having a double-state

In the work flows used before (to describe your double-state design based on state-machines), I think it is good to notice that pointer-switching is still possible. You will still have two complete states (including the physics state), but the amount of copying can be reduced by having a modification in your engine. In pseudo code that would be:

State state1;
State state2;
State *stateThatMayBeWrittenTo;
State *latestUpdatedState;

void Main() {
WaitTillWorkFinished();
if (latestUpdatedState == &state2) {
latestUpdatedState = &state1;
stateThatMayBeWrittenTo = &state2;
} else {
latestUpdatedState = &state2;
stateThatMayBeWrittenTo = &state1;
}
ReadInputAndQueueWork(); // Queues work that reads from latestUpdatedState
// and writes to stateThatMayBeWrittenTo
Draw(); // Draws latestUpdatedState (read-only)
}

All the work that the worker-threads will do must take into account that they are reading from a different state than the one they are writing to, but it'll definitely be quicker. This is basically what Erik McClure described in his post under "Multi Thread Low Load", but he didn't like that because of the 1 frame of input lag.

Testing framework

Creating a testing framework on an existing rendering-engine could be done, but I guessing you're talking about a fully fledged game-engine here (and not just about OpenGL/DirectX). Implementing our own engine-designs in an existing game-engine is very troublesome, especially if the game-engine is not open source. As it stands, our methods cannot be easily compared in a testing-framework, especially because I've implemented the triple-renderstate technique in C# with XNA's DirectX while yours is using C++ and OpenGL. Maybe I'll implement a double-state and single-state design later in my C#-engine so I can do some tests of my own, but that is not something I want to spend time on right now. I think it is more fun to tackle all the difficult stuff that we discussed in the chat, or I fear that I won't make any progress and get stuck on building tests.

Person2 (#5903454104)
 
 
 


Th-Th-Th-That's all, folks!

CubifyExtrapolation
  1. Attila says:

    Guys, what you just did is insanely good.
    Such open-minded knowledge-sharing touches me so deeply that I might fall asleep in tears.

    I wish I could add anything to the topic, but at the moment you are my teachers, and I am the silent student.

line