*rename* What tasks might be offloaded to the gpu in the future?

fehu

Veteran
I've read many post about nextgen console hardware, most of them are simply (actual spec)*2, others are very conservatwiiry or scifi like machines.

This made me wondering about the load balance between cpu and gpu in moder console, and in future MES (monster entertaiment system).
I see that the xcpu is crapped by the slow shared cache, and sometime is a bottleneck for the gpu, while cell has so much spare power that can easily help the rsx with tipical graphic task.
Microsoft thinked that a weak cpu was enought, and sony that more is better...
At the moment both are right, but i see too that many task are being tossed to the gpu with dx11 and OGL3.

What are the kind of task that probably will fly to the gpu in the nextgen?
Will the core proliferation continue, or at some point it will be redundant to te gpu?


(sorry if its not too clear, but i'm speedwriting during the pause)
 
This is something pretty off topic I posted in the ATI GPU Transcoding thread over in the algorithm forum. It ended up being mainly a critique of the current state of GPGPU, versus host processing and the need for a new paradigm. I think it might be more relevant here:

Based on all these quality questions and uncertainty about what is actually being done on the GPU I'm starting to think GPGPU video encoding may just be a non-starter at this point.

I was thinking last night that I think the really optimal system architecture needs something more that just a strong multicore OoOE CPU and a fast GPU with tons of shaders. I mean, this video compression on the GPU idea has been around since the first AVIVO converter for Radeon X1800 cards. But it's still little more than an ugly hack that sacrifices quality to a great extent.

What you can do on DX10 class shader hardware just seems to be too limited to replicate the quality possible on a CPU. I'm not a programmer, so I couldn't tell you if it's the programming model, memory constraints or what. We all know the more specialized execution hardware is, the less flexible it is. The more flexible it is, the less performance potential.

I'm starting to think there needs to be an intermediary level of hardware standing between a good general purpose and a high performance GPU. What I'm thinking of is something like a chip integrating a gang of Cell-style SPEs connected directly to both the CPU and GPU by a very high speed interconnect. Since both AMD and Intel are operating without a FSB anymore this is actually more possible in the PC space than ever before. I must admit, though, I was at first thinking of the console space where custom interconnects are the norm.

I actually got to thinking about this because of things like the Physx PPU and Toshiba's SPURs Engine. We even sometime get threads here asking questions about a game being "totally hardware accelerated" or the introduction of an "AI Accelerator". What I'm thinking about would actually make those questions moot while gaining significant performance for certain tasks which could be a lot faster given the right kind of vector unit, but which still don't work too terribly well on the GPU. Besides, why would you want to take performance away from the graphics anyway?

When I look at Larrabee I wonder if it will really deliver performance competitive with nVidia or ATI's ALU monsters. But it should be gangbusters for things like video processing and encoding or physics simulation or sound processing.

And who was it that was gonna market a SPURs Engine PCI-E add in card with video processing software. I look at that and wonder why someone doesn't write a HAVOK or Physx driver for it. I wonder why it doesn't have digital audio outputs and sound drivers that do positional audio and encode Dolby Digital and DTS output in real time like sound is handled on the PS3.

I also look at the Physx acceleration on nVidia GPUs in Mirror's Edge for the PC and wonder how some really pretty flapping fabric or plastic will have any effect on fundamentals of game play. I think GPU physics look pretty and that's pretty much as far as it goes.

So here is what I envision as an optimal form of system topology.

Code:
2-4 Core OoOE Processor
with large budget of L2/L3 cache.--------integrated-----------------DDR3 Memory
Effects strong general purpose--------memory controller----------128-256 bit bus
performance and coordination.
                   l
                   l
                   l---------------------QPI/HT/FlexIO class interconnect.
                   l
                   l
Vector Multi-Processing Unit,               Ring bus/EIB high speed on chip connection 
Larrabee or SPURs Engine---------------that also makes it possible for quick CPU-GPU 
or other ASIC in that mold                   communication bypassing Processing Elements. 
                   l                                 
                   l
                   l---------------------PCI-E 2.0 or whatever is in vogue. (faster is better)
                   l
                   l
High performance AMD or
nVidia DX10/DX11 class------------------Whatever VRAM setup up they want
GPU and video output.                        256-512bit GDDR3-GDDR5 and up.

I think you get the idea. An important factor is for the vector chip to have high speed access to both pools of Ram, lacking it's own memory beyond local store/cache. As a note, the VMPU is basically replacing the classic north bridge position. Sitting at the center it makes the most obvious place to hang an I/O chip with a modest HT or PCI-E connection for USB/Firewire/Ethernet/SATA/Audio (though audio should also be possible through the GPU for HDMI).

I think the advantage should be obvious over the silliness of putting PPUs, SPURs Engines and Sound cards on add in cards which limit their ability to combine with the CPU or GPU in the way that is possible in something like the PS3. Obviously that system has deficiencies in PPE and GPU performance, but we've seen cool uses of the SPUs for post processing effects, vertex processing, etc, due to a tighter integration of the GPU and Cell.

As I mentioned before, I'm not a programmer or an electronic engineer, so maybe this is pie in the sky, but I think this kind of architecture offers so many cool possibilities beyond the current paradigm. For example, if the physics simulation is centralized on the VMPU, both the GPU and CPU should be able to access those results very quickly to both enhance visuals and enhance game play. It should offer a superior audio solution and realtime DTS/DD, a feature that should be simple in software even on today's systems but which is still far too rare. Like the oft wished for Cell plus Xenos architecture, if you are shader limited you could move vertex calculations to the VMPU and use all your ALUs for pixel shading. It flexible enough and powerful enough that the possibilities should be enormous. Ray tracing, Distributed computing, etc, all stand to benefit from this kind of tiered performance architecture. And finally, to bring this back on topic, it should allow for high quality transcoding at a speed far greater than possible even on the fastest Quad Cores. Not to mention it would not peg every core at 100% for hours making multitasking nicer.
 
That was an impressive post above.

To the OP, I think it depends largely on is exactly how close will the CPU & GPU actually get in 8th gen consoles. Will they be separate or will they be one single chip.

This generation, I don't think it makes much sense to accelerate so many general purpose applications on the GPU, as in my opinion the shader power is of much higher demand, and current CPUs are very fast. Will we be saying the same thing next time round... Tbh, I'm seeing a higher likelihood that more graphical operations are offloaded onto a general purpose CPU to give more flexibility and performance (assuming the chips are separate), if they are one homogeneous chip then it wouldn't really matter either way, a balance could be found based on requirements.
 
I was supposing that for techincal and cost reason nextgen will have separated cpu and gpu, and nothing between them.
 
I was supposing that for techincal and cost reason nextgen will have separated cpu and gpu, and nothing between them.
Ok. If we're to assume they're separate I feel what I said earlier with regards to general purpose processing staying closer to the CPU will apply (given the power & flexibility will be there), this is my personal opinion though.
 
I also look at the Physx acceleration on nVidia GPUs in Mirror's Edge for the PC and wonder how some really pretty flapping fabric or plastic will have any effect on fundamentals of game play. I think GPU physics look pretty and that's pretty much as far as it goes.

The flapping fabric and plastic in PhysX Mirrors Edge wasn't supposed to add anything fundamental to the gameplay - the gameplay fundamentals all had to be delivered on consoles and none nVidia, none dx 10 systems.

Anti aliasing doesn't add anything to the fundamental gameplay of any PC game where I turn it on either, but anything that helps with presentation and aids with (psychological) immersion is a bonus. Stuff moving and responding either "correctly" or in a detailed fashion can help with this too.

Stating the obvious a bit I suppose, but highly complex or massive scale physics won't change fundamentals of gameplay until someone makes a game where it does (and it'll most probably be a game with "sandbox" elements).
 
something that i forget is that, in my opinion, at least microsoft will use a DX12 class gpu

Next console will be out in 2011 or maybe 2012 as both sony and microsoft want to stay low for the next year, and earn as much as they can from this generation to fund R&D.
By that time DX11 will be mainstream as DX10 is now, and a generational jump (as for xenos) is probable, at least because ms is the one that finalizes the library.

So i'm expeting something with more potential than even future (and inexplored) hight end dektop gpu, but how much?
If gpu can calculate graphic and physic alone (maybe audio too?), what is left to the cpu?
What i'm forgetting that can be moved to the cpu?
 
DX10 is mainstream? Ha. Show me one game that renders exclusively DX10. The biggest difference in visuals from DX9 to DX10 I have seen in Crysis, and that was not much. If it is true that DX10's biggest advantage is doing more of the stuff that DX9 did at the same time/with the same resources, then DX10 is quite the let down.

Do wonder why we are even looking forward to DX11 and onwards. My money is on OGL as always.
 
DX10 is mainstream? Ha. Show me one game that renders exclusively DX10. The biggest difference in visuals from DX9 to DX10 I have seen in Crysis, and that was not much. If it is true that DX10's biggest advantage is doing more of the stuff that DX9 did at the same time/with the same resources, then DX10 is quite the let down.

Do wonder why we are even looking forward to DX11 and onwards. My money is on OGL as always.
I hope you're not a betting man. OGL is pretty much dead for gaming unless it is the only option: give developers a choice between DX and OGL, and 9 times out of 10 I'll guarantee you they go DX. OpenGL 3.0 was the final nail in the coffin for games developers.

I don't suppose you've programmed with either?
 
The flapping fabric and plastic in PhysX Mirrors Edge wasn't supposed to add anything fundamental to the gameplay - the gameplay fundamentals all had to be delivered on consoles and none nVidia, none dx 10 systems.

Anti aliasing doesn't add anything to the fundamental gameplay of any PC game where I turn it on either, but anything that helps with presentation and aids with (psychological) immersion is a bonus. Stuff moving and responding either "correctly" or in a detailed fashion can help with this too.

Stating the obvious a bit I suppose, but highly complex or massive scale physics won't change fundamentals of gameplay until someone makes a game where it does (and it'll most probably be a game with "sandbox" elements).

Well, clearly. My question is whether GPU based physics will ever allow for that kind of massive, system wide, gameplay effecting simulation. My contention is that what nVidia has done with Physx is fine for the pretty-making, but the physics being calculated are going strait to the displayed output and not coming back to the CPU to influence the way the game is actually played.
 
Well, clearly. My question is whether GPU based physics will ever allow for that kind of massive, system wide, gameplay effecting simulation. My contention is that what nVidia has done with Physx is fine for the pretty-making, but the physics being calculated are going strait to the displayed output and not coming back to the CPU to influence the way the game is actually played.
Potentially.

Yet the biggest issue is practicality when GPUs are ultimately dedicated processors for graphics.

..the further question becomes, will that ever be considered practical.
 
Almost every massively parallel task could be offloaded to the GPU. This includes common CPU tasks such as: physics simulation, collision detection, particle animation, object and particle depth sorting (and all other large sorting operations), object animation setup (complex animation systems), curved surface and line interpolation (splines, nurbs, etc), massive group AI systems for simple entities (not requiring much branching), water and cloth simulation, object and terrain deformation (as multilayered pixel shader displacement mapping techniques get more popular), large data structure generation, manipulation and combination (only appliable for some data structure types), etc, etc. Basically anything that does not need much branching or does not have lots of data dependencies (making it easy to split the task to hundreds of GPU threads).

I would like the future console to have relatively slow 2-4 core CPU with no dedicated vector math units (but with good branching performance) and a pair of fast GPUs with full programmer control (no SLI or Crossfire AFR modes or any other hacks like that). The GPUs should be optimized for fast memexport/streamout performance directly to the CPU accessable system memory (or straight to the CPU cache), and there should be a fast path to transfer the CPU processed data to the GPUs (CPU cache -> GPU memory). Both GPUs should have their own very fast render target memories (similar to Xbox 360 EDRAM), and the GPUs should be completely asynchronous of each other. The programmer could for example render and blur shadow depth maps on other GPU and render the scene (deferred shading) g-buffers on the other. During the deferred shading light rendering the other GPU could for example sort and render the particle systems (to another buffer that is combined later) or even start calculating the collision and physics for the next frame. Games with heavy game logic and physics calculations could even use the GPU 2 completely for game logic, and do all graphics rendering on the GPU 1.

Of course a manufacturer supplied optimized GPGPU library should be available in the console SDK to speed up the technology adaptation and an powerful and easy to use dual GPU profiler would be needed in the console SDK to help the developers to optimize their games properly to the platform.
 
Potentially.

Yet the biggest issue is practicality when GPUs are ultimately dedicated processors for graphics.

..the further question becomes, will that ever be considered practical.

The answer would be hardware which sits somewhere between the CPU and GPU in terms of generalisation vs specialisation, providing the acceleration that these types of parallel computations require, but not being so far down the pipeline that it's only visual. I guess you could call the SPEs in the Cell a start towards that kind of setup.
 
Exactly. The video encode benchmarks using SpursEngine add in cards for H.264 coming out of Japan are pretty impressive for a 1.5 Ghz chip with only 4 SPEs. We're talking about 3-8x as fast as a good Core 2 Quad. Even i7 doesn't get you close to that performance level. And the SpursEngine doesn't seem to have the quality compromise associated with GPU based encoding solutions now available.

For something like video encoding you could get more bang for your buck with a couple big OoOE cores coupled with a series of SPE-style vector units than you'd get with an i7 Quad. That, coupled with the continuing failure of GPGPU to really live up to its promise lead me to speculate about this differing direction.

The Cell is certainly a model close to what I've suggested, but it is weak on the general processing end, and in the PS3 coupled with a less than ideal GPU. That's not to say I think the SPE programming model is optimal, but that the SPEs represent a class of execution hardware that could beneficial if adopted on a greater scale across multiple platforms. The paradigm is about greater flexibility than a GPU, greater performance than a CPU and the ability to communicate efficiently with both to make these things practical.
 
...and a pair of fast GPUs with full programmer control (no SLI or Crossfire AFR modes or any other hacks like that). ...the GPUs should be completely asynchronous of each other. The programmer could for example render and blur shadow depth maps on other GPU and render the scene (deferred shading) g-buffers on the other

Hear hear. There's a lot of parallelism to be extracted in this way, and if you don't need it, you can always use them to tile.
 
The answer would be hardware which sits somewhere between the CPU and GPU in terms of generalisation vs specialisation, providing the acceleration that these types of parallel computations require, but not being so far down the pipeline that it's only visual. I guess you could call the SPEs in the Cell a start towards that kind of setup.
Indeed.

I stand firmly in the belief that the ideal systems of the future will have both more general purpose GPUs and CPUs still characteristically general purpose, yet much faster vector processing capabilities, indeed like the RSX/Cell type of set up, but better...
 
Almost every massively parallel task could be offloaded to the GPU. This includes common CPU tasks such as: physics simulation, collision detection, particle animation, object and particle depth sorting (and all other large sorting operations), object animation setup (complex animation systems), curved surface and line interpolation (splines, nurbs, etc), massive group AI systems for simple entities (not requiring much branching), water and cloth simulation, object and terrain deformation (as multilayered pixel shader displacement mapping techniques get more popular), large data structure generation, manipulation and combination (only appliable for some data structure types), etc, etc. Basically anything that does not need much branching or does not have lots of data dependencies (making it easy to split the task to hundreds of GPU threads).

I would like the future console to have relatively slow 2-4 core CPU with no dedicated vector math units (but with good branching performance) and a pair of fast GPUs with full programmer control (no SLI or Crossfire AFR modes or any other hacks like that). The GPUs should be optimized for fast memexport/streamout performance directly to the CPU accessable system memory (or straight to the CPU cache), and there should be a fast path to transfer the CPU processed data to the GPUs (CPU cache -> GPU memory). Both GPUs should have their own very fast render target memories (similar to Xbox 360 EDRAM), and the GPUs should be completely asynchronous of each other. The programmer could for example render and blur shadow depth maps on other GPU and render the scene (deferred shading) g-buffers on the other. During the deferred shading light rendering the other GPU could for example sort and render the particle systems (to another buffer that is combined later) or even start calculating the collision and physics for the next frame. Games with heavy game logic and physics calculations could even use the GPU 2 completely for game logic, and do all graphics rendering on the GPU 1.

Of course a manufacturer supplied optimized GPGPU library should be available in the console SDK to speed up the technology adaptation and an powerful and easy to use dual GPU profiler would be needed in the console SDK to help the developers to optimize their games properly to the platform.

And here I thought such an idea was pure insanity... or maybe we drink the same kool-aid? ;)

More discussion on what things can be done on GPUs that are a positive or lateral move in regards to performance per-$ (or mm2), as well as the expected workload/bottlenecks next gen, would be a good discussion to see where possible resources are allocated.
 
Wouldn't CPU without vector units be lacking in regard to compression/decompression power?
If next gen is about off load bunch of things to a potent gpgpu and cpu is left with remain serial/branchy tasks wouldn't it be better to have 1/2 relativelay fast CPU cores sharing the same amount of cache two or four cores would have?

Wouldn't it be the best solution to have a single chip to ensure fast cpu/gpu communication?
In this case silicon budget would be tigher so fewer/faster cpu cores dealing with the serial/branchy of the code would free some silicon for the gpu part of the chip.

I read this thread and the related article yesterday:
http://www.realworldtech.com/index.cfm (thread programming the larrabee).
http://www.spectrum.ieee.org/jan09/7129.
It looks like some persons consider the larrabee as a gpgpu before being a gpu.
If bunch of calculations are offloaded to the gpu, it could make sense to use a "jack of all trades" kind of gpu? (I mean not being the king of the hill on graphical tasks but making up by the others possibilities it offers).
 
I'd guess that nearly everything SPU could go GPU in the future. Here is a list of SPU work ripped from a making of Killzone 2 video which showed an in game debug screen (numbers are my best guess),

Code:
SPU TIME
----------------------------
AI.Cover ................... ........ 0.00%
AI.LineOfFire .............. ........ 0.00%
Anim.EdgeAnim .............. 33 ..... 2.01%
Anim.Skinning .............. 152 .... 30.68%
Gfx.DecalUpdate ............ 9 ...... 0.78%
Gfx.LightProbes ............ 396 .... 9.00%
Gfx.PB.DeferredSchedule .... 1 ...... 0.60%
Gfx.PB.Forward ............. 2 ...... 1.69%
Gfx.PB.Geometry ............ 1 ...... 18.67%
Gfx.PB.Lights .............. 1 ...... 0.66%
Gfx.PB.ShadowMap ........... 1 ...... 4.20%
Gfx.Particles.ManagerJob ... 1 ...... 3.14%
Gfx.Particles.UpdateJob .... 130 .... 12.33%
Gfx.Particles.VertexJob .... 70 ..... 20.64%
Gfx.Post.BloomCapture ...... 12 ..... 2.80%
Gfx.Post.BloomIntegrate .... 8 ...... 1.52%
Gfx.Post.DepthOfField ...... 64 ..... 12.12%
Gfx.Post.DepthToFuzzy ...... 8 ...... 0.67%
Gfx.Post.Downsample ........ 29 ..... 0.61%
Gfx.Post.GrainWeight ....... 1 ...... 0.51%
Gfx.Post.HBlur ............. 45 ..... 3.02%
Gfx.Post.ILR ............... 1 ...... 0.63%
Gfx.Post.Modulate .......... 27 ..... 1.3?%
Gfx.Post.MotionBlur ........ 46 ..... 11.31%
Gfx.Post.Unlock? ........... 1 ...... 0.01%
Gfx.Post.Upsample .......... 108 .... 9.47%
Gfx.Post.VBlur ............. 46 ..... 3.73%
Gfx.Post.Vg??lle ........... 1 ...... 1.18%
Gfx.Post.Zero .............. 16 ..... 0.64%
Gfx.Scene.Portals .......... 3 ...... 30.72%
Mesh.Decompression ......... ........ 0.00%
Physics.Collide ............ 4 ...... 2.48%
Physics.Integrate .......... 4 ...... 2.11%
Physics.KdTree ............. 8 ...... 20.50%
Physics.Raycast ............ ........ 0.00%
Snd.MP3.Stereo ............. 2 ...... 2.60%
Snd.MP3.Surround ........... 2 ...... 7.51%
Snd.?Synth ................. 35 ..... 3.23%
Snd.Reverb ................. 14 ..... 4.02%
---------------------------- 
Total Time ................. 1232 ... 227.46%
 
Back
Top