Asynchronous Compute : what are the benefits?

pMax · Sep 23, 2013

itsmydamnation said:
To do something else you have to have that data and instruction at the execution unit. That takes a very long time on a gpu.

not on an APU - the memory access should be the same for CPU and GPU (well, more or less).
You are not supposed to go through PCIE from discrete card to memory, after all.

So if you have lots of small complex jobs waiting on fetching complex memory (noncontiguous) access etc you can have all your ALU's waiting for that data and you stall the GPU.

Right, GPU is not suited for many small, non-uniform jobs.
But if they are not many, and there are other jobs that hide that bad latency, you could cover it, I guess.
Of course GPU is not the best to run those kind of threads, but at least you do not need to DMA its small memory buffer in and out all times before it finishes to work with it...

Arwin · Sep 23, 2013

Was this follow up actually posted?
http://www.vgleaks.com/more-exclusi...plementation-and-memory-enhancements-details/

Bumpyride · Sep 23, 2013

I think the interesting aspect of this isn't that some problems are better for CPU and some better on GPU. The shared memory space makes sharing work between the two more efficient but it also makes approaches to programming possible that weren't before.

I've only done two big GPGPU projects in my life and one of them would have been implemented entirely differently on an APU. It was an adaptive beamforming project that involved a large multiplication to calculate a covariance matrix followed by a good bit of linear algebra on the smaller resulting matrix. The covariance matrix was much smaller so the GPU was used to brute force the multiplication and then it was passed back to the CPU for everything else. It would have been better to do the more irregular sorts and searches on the CPU while doing all of the coarse multiplication work on the GPU - especially if it meant not having to copy anything between device and host memory.

My point is, within one algorithm, there were jobs that were better for the GPU and jobs that were much better for the CPU. Having both work efficiently at the same time would have structured things a lot differently than just splitting the algorithm in two and doing more work on the CPU than you would want because it would take too long to copy everything back.

I have no idea how much faster this would have been in the end, but I don't doubt per unit of GPU and CPU resources (obviously the APUs are weaker than other available pairs of CPU and GPU) it definitely would have been faster.

I can't wait to see if more and more of this kind of thing crop up as people optimize their software for the next gen systems.

psorcerer · Sep 23, 2013

3dilettante said:
This is backwards.
The hardware was developed with a workload in mind.

Ok. Then what?

I think the system that allows jobs to be farmed out to the GPU has a performance impact.
GPU drivers on desktop systems can themselves become a performance limiter, and that's with cores much more powerful than Jaguar.

That's a totally different discussion. On why declarative job systems (D3D/OGL) suck so much for modern GPUs. Not because GPUs are "slow".

It's a multimedia example, and for user-facing functionality the latency factor weighs heavily.
Sony also expects the GPU to perform the work for image recognition for its camera.

That's side-tracking anyways.
Typical game workload that needs performance is graphics and everything around it: collisions, simulation, animation, particles etc.
All other things can run anywhere, their impact is less than 5% (if you coded the game right).

What's the rate for CPUs?
Why does their utilization rate of mere tens of GB/s of memory bandwidth rarely peak outside of benchmarks?

Mostly, bad code. People in non-gamedev world usually do not optimize anything.

GPUs almost assume that every wavefront memory operation could take a full trip to memory, and they can do so without affecting arithmetic throughput if enough math operations are available.

And they perform equally as good if your reads are hand picked to be in cache at the exact time.
Very similar to what the SPU guys tried to teach the masses.

Gipsel said:
To sum it up, sometimes it isn't as easy as saying that devs just have to write better code. There may be fundamental restriction to what one can do.

I can understand when you've tried and did not succeed.
The problem is that people usually hear this argument and then don't even try.
But if you look into the future the number of hw threads/ports/jobs per core only increases.
No way to stay with "old CPU" paradigm any longer anyway.

sebbbi said:
CPU beats GPU in all problems that cannot be sliced to tens of thousands of independent threads. If you try to run large amount of smaller problems on the GPU, you will stall the GPU (and the CPU because of draw/dispatch call setup overhead). It's much better just to crunch these small problems using AVX (preferably inside the CPU L1 cache).

Draw call overhead exists only because of "peculiar" D3D design.
You can draw things on Orbis without any overhead. Just by assembling contexts by yourself.
Theoretically you can even draw primitive-by-primitive without any performance penalty whatsoever.

Games have lots of different types of entities that do not come in the required 10k+ numbers (needed for efficient GPU execution). Examples: currently active AI enemies (<50), ray casts needed for a single game logic frame (<100), path finding requests per frame (<50), active physics objects (game logic rigid bodies, not particle debris physics, <500), script objects running (<100), triggers running (<100), etc. Of course there are specific game types that have huge armies of enemies or are purely based on physics simulation, and those types of games might benefit more from GPU processing (assuming the branching is coherent and the dependencies are simple).

If you have a small task, that does not need bandwidth, do it on CPU. What's the problem?

Shifty Geezer said:
SPE code can effectively operate as standard multicore, running multiple independent jobs simultaneously even using the same basic algorithms as CPU code (though with data optimisations), or running the same task in parallel across cores when it's an easy fit. GPGPU requires a completely different approach, running the same task in parallel across many multiple processors. It's not always easy to a take a fundamentally linear task or task with data dependencies and break it into tiny units each processed in parallel. Sometimes the algorithm that can deal with a task broken down like that is far less efficient that a linear algorithm run on a CPU core, such that it makes no sense to move everything to GPGPU.

You have 6 SPU "tasks" on PS3 and 256k of storage for each.
And then you have 64 independent "thread pipelines" in Orbis with ****k cache per task.
What exactly is "better" here?

psorcerer · Sep 23, 2013

Shifty Geezer said:
is insulting to the industry.

Last time I've checked people in ND and Sony SM shared this worldview.
So, maybe it's insulting, but not to the whole industry anyway.

taisui · Sep 23, 2013

If writing better code is the solution, then we'd just have in-order processors, why bother with the micro-ops optimization even? Oh wait, that's because most devs write bad code.

It's all so obvious now, nevermind the hundreds of thousands of pioneers in the CS field, because all you need is just to write better code.

function · Sep 23, 2013

psorcerer said:
Last time I've checked people in ND and Sony SM shared this worldview.
So, maybe it's insulting, but not to the whole industry anyway.

You've never spoken to them about their "worldview".

psorcerer · Sep 23, 2013

taisui said:
If writing better code is the solution, then we'd just have in-order processors, why bother with the micro-ops optimization even? Oh wait, that's because most devs write bad code.

Sturgeon's Law.
Does it mean you should not strive for perfection?
And besides that, modern CPUs have a lot of legacy code to support, and it's not so easy when memory access is 150x slower per clock than it was in 80s.
And then they still try: Netburst or Cell...

tuna · Sep 23, 2013

taisui said:
If writing better code is the solution, then we'd just have in-order processors, why bother with the micro-ops optimization even? Oh wait, that's because most devs write bad code.

We could have better compilers instead.

3dilettante · Sep 23, 2013

psorcerer said:
Ok. Then what?

Probably recognize the strengths or weaknesses of the hardware and design the software accordingly.

That's a totally different discussion. On why declarative job systems (D3D/OGL) suck so much for modern GPUs. Not because GPUs are "slow".

Can you clarify what the consoles are using, at least for the first half of the upcoming gen?

That's side-tracking anyways.
Typical game workload that needs performance is graphics and everything around it: collisions, simulation, animation, particles etc.

What is the speedup for rigid body physics running on the GPU versus CPU?
How about data management for the streaming system, or latency-sensitive input processing, or the high-speed management whole virtual memory subsystem the GPU relies on?

All other things can run anywhere, their impact is less than 5% (if you coded the game right).

So, if for whatever reason things don't hit this 5% figure, it must be bad code.
Is there an example of a well-coded game you can cite?

Mostly, bad code. People in non-gamedev world usually do not optimize anything.

That's only the case if bad code is defined as any code that doesn't saturate the memory bus.
There are good algorithms that don't hit main memory for the majority of their accesses, and bad ones that do.
The cost for off-die access is so high that for many reasonable or practical data sets it is preferable to go for an algorithm that may be asymptotically inferior to a more parallel cache-thrasher, because it is not necessary or reasonable to bloat the working set enough to scale past the inflection point.

I'm not sure why it's a good idea for an interactive system with millisecond time budgets to saturate anything to that extent, since that either leaves no room for demand spikes or has a baseline that is way too high.

And they perform equally as good if your reads are hand picked to be in cache at the exact time.
Very similar to what the SPU guys tried to teach the masses.

Previously, I went into how the cach hierarchy gives maybe a dozen bytes of cache storage per work item.
I want to see the optimizations that can reduce everything down to that.

I can understand when you've tried and did not succeed.
The problem is that people usually hear this argument and then don't even try.
But if you look into the future the number of hw threads/ports/jobs per core only increases.
No way to stay with "old CPU" paradigm any longer anyway.

As nice as that may be, the designers of the hardware in question do not agree, so the platform in question does not do what you want.

Draw call overhead exists only because of "peculiar" D3D design.
You can draw things on Orbis without any overhead. Just by assembling contexts by yourself.

Where are the contexts assembled?
Are you really sure there's never overhead iterating through every single primitive instead of utilizing an instruction or command sequence that leverages a whole hardware pipeline optimized for it?

If you have a small task, that does not need bandwidth, do it on CPU. What's the problem?

You see, that's the old way of thinking. The future is thousands of threads.

You have 6 SPU "tasks" on PS3 and 256k of storage for each.
And then you have 64 independent "thread pipelines" in Orbis with ****k cache per task.
What exactly is "better" here?

It's 256kB per SPE, which is an independent front end and execution pipeline. Within those bounds, it has a straightline speed quadruple what a CU can perform physically, before noting that the CU cannot perform sequential issue faster than once every four slow cycles.
For the GPU, it's 64 front-end command pipelines that do not possess resources of the own and have not been disclosed as having the necessary autonomy beyond taking what commands the CPU runtime gives them, and using those to arbitrate with the scheduler and CU status hardware in the GPU. The CUs then perform the work.
I've already noted that it's **Bytes per task with Orbis.

Esrever · Sep 23, 2013

tuna said:
We could have better compilers instead.

or better hardware instead.

taisui · Sep 23, 2013

Esrever said:
or better hardware instead.

or just write better code.

3dilettante · Sep 23, 2013

taisui said:
It's all so obvious now, nevermind the hundreds of thousands of pioneers in the CS field, because all you need is just to write better code.

tuna said:
We could have better compilers instead.

Esrever said:
or better hardware instead.

taisui said:
or just write better code.

tuna said:
We could have better compilers instead.

Esrever said:
or better hardware instead.

taisui said:
or just write better code.

...

Oh noes, it keeps going deeper!
Which, by the way, GPUs are not good at.

gurgi · Sep 23, 2013

3dilettante said:
...

Oh noes, it keeps going deeper!
Which, by the way, GPUs are not good at.

write better posts?

AlNom · Sep 23, 2013

But can you express your ideas?

psorcerer · Sep 23, 2013

3dilettante said:
Probably recognize the strengths or weaknesses of the hardware and design the software accordingly.

I'm cool with that.

Can you clarify what the consoles are using, at least for the first half of the upcoming gen?

Dunno about Xone, but PS4 let's you go quite low level when writing the code. No "constants" or "buffers", for example.

What is the speedup for rigid body physics running on the GPU versus CPU?
How about data management for the streaming system, or latency-sensitive input processing, or the high-speed management whole virtual memory subsystem the GPU relies on?

Rigid body physics can be indeed too simple for GPU, but for example spring-based simulations of soft bodies looks like a good contender.

So, if for whatever reason things don't hit this 5% figure, it must be bad code.
Is there an example of a well-coded game you can cite?

Most of the code revolves around drawing things anyway. It's the most performance hungry part of any game I know.

That's only the case if bad code is defined as any code that doesn't saturate the memory bus.
There are good algorithms that don't hit main memory for the majority of their accesses, and bad ones that do.
The cost for off-die access is so high that for many reasonable or practical data sets it is preferable to go for an algorithm that may be asymptotically inferior to a more parallel cache-thrasher, because it is not necessary or reasonable to bloat the working set enough to scale past the inflection point.

I agree. But the games are developing in another direction: tighter lighting simulations -> bigger datasets, tighter collision/forces simulations -> bigger datasets, and so on.

I'm not sure why it's a good idea for an interactive system with millisecond time budgets to saturate anything to that extent, since that either leaves no room for demand spikes or has a baseline that is way too high.

You have a luxury of saturating things and getting away with it on consoles (at least for now). It's not a "general computing" machine.

Previously, I went into how the cach hierarchy gives maybe a dozen bytes of cache storage per work item.
I want to see the optimizations that can reduce everything down to that.

That threads number was inflated too much.
Each CU has 4 SIMD units 16 "work items" wide, 18 CUs. Each memory controller has 128k of L2.
So basically we can do a granularity of 128 * 4 / (18 * 4) = 7k per "thread".
On modern Haswell CPU it's 4 SIMD/FMA ports for each 256k of L2 = 64k per "thread".
Yes CPUs are much better, but only approx. one order of magnitude better.

Where are the contexts assembled?
Are you really sure there's never overhead iterating through every single primitive instead of utilizing an instruction or command sequence that leverages a whole hardware pipeline optimized for it?

Why do you need to "iterate"? CUs work just like regular CPU cores: you pass number of SIMD instructions, they get executed, you write it to memory.
No need for any "context" or "buffer" anyway. AMD driver does all this "context" mumbo-jumbo on PC just to keep compatibility with the D3D architecture. Inside it's just a compiler that spits out imperative code.

You see, that's the old way of thinking. The future is thousands of threads.

For tasks that need performance.

3dilettante · Sep 23, 2013

psorcerer said:
That threads number was inflated too much.
Each CU has 4 SIMD units 16 "work items" wide, 18 CUs. Each memory controller has 128k of L2.
So basically we can do a granularity of 128 * 4 / (18 * 4) = 7k per "thread".
On modern Haswell CPU it's 4 SIMD/FMA ports for each 256k of L2 = 64k per "thread".
Yes CPUs are much better, but only approx. one order of magnitude better.

The SIMD units are 16-wide, but operate on a 4-cycle vector issue. The per-instruction width of the wavefronts is 64-wide.
The CUs themselves can host 40 wavefronts each.

For comparison, Haswell with HT has two threads and AVX-256 gives a per-instruction width of 8.
The 256KB of L2 is backed up by an L3 of 8MB. Even if just dividing it up per-thread on die, it's 1MB.
Since HT is optional, it can be 2MB per-thread.

Ethatron · Sep 23, 2013

psorcerer said:
For tasks that need performance.

Some problems are excessively performance hungry, and totally unparallelizable. What you're conjecturing is: "you can break every performance problematic algorithm into trivial to calculate pieces".

psorcerer · Sep 23, 2013

3dilettante said:
The CUs themselves can host 40 wavefronts each.

That's akin to Haswell 60-entry scheduler, per core.

For comparison, Haswell with HT has two threads and AVX-256 gives a per-instruction width of 8.
The 256KB of L2 is backed up by an L3 of 8MB. Even if just dividing it up per-thread on die, it's 1MB.
Since HT is optional, it can be 2MB per-thread.

Caches farther than L2 are not effective in games. Can be tested in real world.

Haswell has 3 AVX-256 ports per core that can run in parallel.

3dilettante · Sep 23, 2013

psorcerer said:
That's akin to Haswell 60-entry scheduler, per core.

If you want to pull that, then the per-wavefront instruction queues in each CU can readily expand the GPU's number of buffered instructions.
The wavefronts possess wholly private context and their own instruction pointer. They're as close a match to the threads in Haswell as there can be.

Haswell can drop down to one thread without a problem.
The CUs absolutely cannot drop below 4 wavefronts without idling fractions of their units.

Caches farther than L2 are not effective in games. Can be tested in real world.
Haswell has 3 AVX-256 ports per core that can run in parallel.

The Haswell L3 is maybe 30-40 cycles away at over 3 GHz.
The global L2 of GCN may get under that in cycle terms, but by much, and that is not in wall clock time.

Asynchronous Compute : what are the benefits?

pMax

Arwin

Now Officially a Top 10 Poster

Bumpyride

psorcerer

psorcerer

taisui

function

None functional

psorcerer

tuna

3dilettante

Esrever

taisui

3dilettante

gurgi

AlNom

Moderator

psorcerer

3dilettante

Ethatron

psorcerer

3dilettante

Similar threads