Digital Foundry Article Technical Discussion Archive [2015]

Status
Not open for further replies.
In TLOUR, to keep the CPU cores busy and then reach 60fps, they render several consecutive frames (parts of them obviously) during the same 16.6 ms frame time using fibers and heavily jobified code...and a lot of memory to keep track of several frames simultaneously.

http://gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine

No denying it, Naughty Dog loves challenging themselves and finding (creating) the paths on achieving their goals. Developer Gods they are... :yep2:
 
Fantastic presentation. Talk about a love letter to computer science. This also significantly changes my perspective on multicore processing. This will eventually need to be a requirement to understand for all programmers to learn, this single threaded stuff needs to go away.

Amazing that CPU code was approximately 100ms long for a PS3 game. Even on high power cpu at 3.2Ghz that's still 50ms. 2 cores @ 3.2Ghz would be 25ms. You are still off the mark.
 
Fantastic presentation. Talk about a love letter to computer science. This also significantly changes my perspective on multicore processing. This will eventually need to be a requirement to understand for all programmers to learn, this single threaded stuff needs to go away.
Those of us working in cluster and server farm environments knew this 15 or so years ago. Even then the writing was on the wall; it was more economical to go wider than faster. It was interesting reading developer's issues with PlayStation 3 architecture, particularly parallelisation (or 'jobyifying') a basic problem into many parallel jobs. As always with parallelism, the wider you go, the more latency becomes a technical barrier. If you think managing L1 and L2 cache is difficult, pan out and try managing L4 and even L5 cache.
 
Latency wise, I assume that if they use vsync or double/triple buffering on the PS3 version, that means the input lag will be higher than 60ms.
Other thing is that they are 1st party and basically can go over to ICE Team and say, bug here, please fix, probably helps a lot. Comparing that to the 3rd party multi-platform guy getting updates about fixes in the upcoming sdk from the presenter ;)

My impression in regards to the memory, is that they fixed that with the heap tag solution, and that its not a issue anymore. Ie they do have memory availble.
Question is can they optimize cpu stuff to cram even more out of it. I mean it would be disappointing if they already had maxed the PS4 :)

Also props to the PS3, it seems to have been/is a computational beast with its SPU's.

I am not any knowledgable about serious programming, I write bits and bobs with bash shell, php, python etc, but that fibre system was cool. And even better that its a multiplatform thing that all developers can use. I expect that it would be good on X1 also, but how does the CPU <-> GPU latency on PC fare with it or does it not matter?
 
Latency wise, I assume that if they use vsync or double/triple buffering on the PS3 version, that means the input lag will be higher than 60ms.
By latency I mean that the wider you go, more more latency becomes a problem. Single chip dual core: L1/L2. Multi-core multi-chip: L2/L3. Multi-blade: L4. Etc, etc. Reliance on data that could be two bus accesses away is latency hell, let alone an external I/O access like the network.
 
But if i understand this presentation correctly, you can't use this job system for large scale supercomputing, right?

In this presentation shared memory is assumed imo, which is not the case for the architectures I am using.

I need to decide which job every core does before hand basically to decompose the data. I can't allow to freely move a job from one core to another as this would typically mean take the whole data structure with you to another physical core+its memory.

Of course, I could use a hybrid shared memory parallelization in combination with MPI and try to use such a technique, which I'll try to think about.

What seems to be impossible for me to use is the mutiple frames stuff...I don't think I can do that in a simulation, as the whole new frame depends on the outcome of the last frame...
 
Furthermore, if I understand it correctly, they introduce an extra frame of latency for input...so two 60Hz frames means basically latency of a single 30Hz frame, right? So in theory, the 25ms frametime they had after the first optimization rounds would have a lower input latency than the final release...weird!!! Or am I wrong?
 
But if i understand this presentation correctly, you can't use this job system for large scale supercomputing, right? In this presentation shared memory is assumed imo, which is not the case for the architectures I am using.
I guess that entirely depends on the architecture. If you're using commercial technology like Cray CS or XC then you're reliant on shallow cache for sharing rather than memory because you're working with/around Xeon and Intel's reference bus designs. If you throw that away and implement your own bus/cache/memory system then you can make different design choices.
 
By latency I mean that the wider you go, more more latency becomes a problem. Single chip dual core: L1/L2. Multi-core multi-chip: L2/L3. Multi-blade: L4. Etc, etc. Reliance on data that could be two bus accesses away is latency hell, let alone an external I/O access like the network.

Sorry, I was not directly talking about your comment, but an earlier one by Globalisateur. And I am not going to claim I totally understand your scenario, but yes, that seem logical with my limited knowledge :)
 
I guess that entirely depends on the architecture. If you're using commercial technology like Cray CS or XC then you're reliant on shallow cache for sharing rather than memory because you're working with/around Xeon and Intel's reference bus designs. If you throw that away and implement your own bus/cache/memory system then you can make different design choices.

Ummm, I am using a Blue Gene with 250k+ cores atm :)

I am more concerned with the right load balance of my work for each thread (BG supports hardware based HT which is amazing)...here we do use dynamic load balancing. But this is quite rare...using such a job system just doesn't make sense imo on a large scale system?!?
 
Furthermore, if I understand it correctly, they introduce an extra frame of latency for input...so two 60Hz frames means basically latency of a single 30Hz frame, right? So in theory, the 25ms frametime they had after the first optimization rounds would have a lower input latency than the final release...weird!!! Or am I wrong?
Yes I believe you are correct. Update code is happening 1 frame ahead of render code which is happening 1 frame above display. You should feel a 33.3ms input lag on a 60fps game.
 
Ummm, I am using a Blue Gene with 250k+ cores atm :)
We have clusters in the farm that size but I can't think of any jobs we've run recently that have required more than a single cluster in the farm.

Blue Gene was originally designed for protein folding with a primary goal being power efficiency and I believe even the modern wide architecture implementations are designed to accommodate problems that can be broken into smaller jobs that don't require high-bandwidth data sharing across thousands of cores.

You have different problems to me but we all have problems :LOL:
 
Yes I believe you are correct. Update code is happening 1 frame ahead of render code which is happening 1 frame above display. You should feel a 33.3ms input lag on a 60fps game.

But the presenter say that the remastered version has less input lag than the PS3 version, since they are doing it at 60fps. Makes we wonder how much input lag is there in the PS3 version?
 
Last edited:
But the presenter say that the remastered version has less input lag than the PS3 version, since they are doing it at 60fps. Makes we wonder how much input lag is there in the PS3 version?

Is this related to triple buffering? I can't remember exactly if TLoU is triple buffered??
 
We have clusters in the farm that size but I can't think of any jobs we've run recently that have required more than a single cluster in the farm.

Blue Gene was originally designed for protein folding with a primary goal being power efficiency and I believe even the modern wide architecture implementations are designed to accommodate problems that can be broken into smaller jobs that don't require high-bandwidth data sharing across thousands of cores.

You have different problems to me but we all have problems :LOL:


True!! But I am surprised that you guys have clusters this size! The cluster I am using is one of the three national computing centers in Germany (Jülich Forschungszentrum).
 
Is this related to triple buffering? I can't remember exactly if TLoU is triple buffered??
Yea, I just googled it, you guys are correct. It is triple buffered @ 30fps.
So the remaster is quite a bit faster in latency, but I suppose one could say equivalent to triple buffered 60fps ;)
 
True!! But I am surprised that you guys have clusters this size! The cluster I am using is one of the three national computing centers in Germany (Jülich Forschungszentrum).
My facility isn't a matter of public record :nope:
 
Status
Not open for further replies.
Back
Top