Probably recognize the strengths or weaknesses of the hardware and design the software accordingly.
That's a totally different discussion. On why declarative job systems (D3D/OGL) suck so much for modern GPUs. Not because GPUs are "slow".
Can you clarify what the consoles are using, at least for the first half of the upcoming gen?
That's side-tracking anyways.
Typical game workload that needs performance is graphics and everything around it: collisions, simulation, animation, particles etc.
What is the speedup for rigid body physics running on the GPU versus CPU?
How about data management for the streaming system, or latency-sensitive input processing, or the high-speed management whole virtual memory subsystem the GPU relies on?
All other things can run anywhere, their impact is less than 5% (if you coded the game right).
So, if for whatever reason things don't hit this 5% figure, it must be bad code.
Is there an example of a well-coded game you can cite?
Mostly, bad code. People in non-gamedev world usually do not optimize anything.
That's only the case if bad code is defined as any code that doesn't saturate the memory bus.
There are good algorithms that don't hit main memory for the majority of their accesses, and bad ones that do.
The cost for off-die access is so high that for many reasonable or practical data sets it is preferable to go for an algorithm that may be asymptotically inferior to a more parallel cache-thrasher, because it is not necessary or reasonable to bloat the working set enough to scale past the inflection point.
I'm not sure why it's a good idea for an interactive system with millisecond time budgets to saturate anything to that extent, since that either leaves no room for demand spikes or has a baseline that is way too high.
And they perform equally as good if your reads are hand picked to be in cache at the exact time.
Very similar to what the SPU guys tried to teach the masses.
Previously, I went into how the cach hierarchy gives maybe a dozen bytes of cache storage per work item.
I want to see the optimizations that can reduce everything down to that.
I can understand when you've tried and did not succeed.
The problem is that people usually hear this argument and then don't even try.
But if you look into the future the number of hw threads/ports/jobs per core only increases.
No way to stay with "old CPU" paradigm any longer anyway.
As nice as that may be, the designers of the hardware in question do not agree, so the platform in question does not do what you want.
Draw call overhead exists only because of "peculiar" D3D design.
You can draw things on Orbis without any overhead. Just by assembling contexts by yourself.
Where are the contexts assembled?
Are you really sure there's never overhead iterating through every single primitive instead of utilizing an instruction or command sequence that leverages a whole hardware pipeline optimized for it?
If you have a small task, that does not need bandwidth, do it on CPU. What's the problem?
You see, that's the old way of thinking. The future is thousands of threads.
You have 6 SPU "tasks" on PS3 and 256k of storage for each.
And then you have 64 independent "thread pipelines" in Orbis with ****k cache per task.
What exactly is "better" here?
It's 256kB per SPE, which is an independent front end and execution pipeline. Within those bounds, it has a straightline speed quadruple what a CU can perform physically, before noting that the CU cannot perform sequential issue faster than once every four slow cycles.
For the GPU, it's 64 front-end command pipelines that do not possess resources of the own and have not been disclosed as having the necessary autonomy beyond taking what commands the CPU runtime gives them, and using those to arbitrate with the scheduler and CU status hardware in the GPU. The CUs then perform the work.
I've already noted that it's **Bytes per task with Orbis.