Asynchronous Compute : what are the benefits?

psorcerer · Sep 23, 2013

Ethatron said:
What you're conjecturing is: "you can break every performance problematic algorithm into trivial to calculate pieces".

That's not true that I'm conjecturing that.
Although the statement itself is very much true in this form.
The more "interesting" statement should be: "you can break every performance problematic algorithm into trivial to calculate pieces, while maintaining minimal branch complexity".
This one is indeed not so trivial to achieve.

Shifty Geezer · Sep 23, 2013

psorcerer said:
Typical game workload that needs performance is graphics and everything around it: collisions, simulation, animation, particles etc.

I beleive ERP has said a couple of times that Havok looked at GPGPU physics and gave up, because it's a lousy fit. Too many interdependencies and GPUs struggle. There may be algorithms that can solve that, but it's not a trivial matter to just write better code. Similarly plenty of high-end audio software companies have looked at GPGPU for audio and done very little because it's not a good fit. There's slow progress, but it's naive to think that the reason there are no decent GPGPU soft-synths and professional DAWs come with monster CPUs is because no-one can be bothered to write decent code.

You have 6 SPU "tasks" on PS3 and 256k of storage for each.
And then you have 64 independent "thread pipelines" in Orbis with ****k cache per task.
What exactly is "better" here?

They're different. A CU, or a CU SIMD, is not a processor with the same flexibility as a SPU. The GPU also requires the data to be split into hundreds/thousands of units. Some tasks just don't fit that very well. I point to the origins of GPGPU as to why your casualised viewpoint is intrinsically naive to the difficulty of the task. GPUs presented lots of performance, but to use it, data had to be arranged and addressed in a certain way. It took clever people lots of time to slowly develop new algorithms, looking at tasks from a non-human perspective and breaking them down into something pretty foreign to standard thinking to find something that matched the obscure hardware of the GPU. Now although GPUs have come leaps and bounds in opening up their processing to more general tasks, you still have the same fundamental design issue. Coming up with completely new ways to look at data and problems and algorithms is not easy. If you do find it that easy (incidentally, what is your coding experience?) then you should get out there and write amazing GPU accelerated tools and libraries and you'll clean up! For everyone else, reinventing the wheel will take time.

psorcerer · Sep 23, 2013

3dilettante said:
Haswell can drop down to one thread without a problem.

I think we just need to calculate the worst case performance of both.
I.e. one thread, one AVX port, one core for Haswell.
And one wavefront (duplicated needed amount of times, essentially working on the same data), one CU, one SIMD.
Then just multiply that by the number of such low performance tasks that each can run in parallel.

psorcerer · Sep 23, 2013

Shifty Geezer said:
I beleive ERP has said a couple of times that Havok looked at GPGPU physics and gave up, because it's a lousy fit.

That was before GCN-like architectures came. VLIW was indeed a nightmare for things that Havok does. And the compilers were not helping much...

GPUs presented lots of performance, but to use it, data had to be arranged and addressed in a certain way. It took clever people lots of time to slowly develop new algorithms, looking at tasks from a non-human perspective and breaking them down into something pretty foreign to standard thinking to find something that matched the obscure hardware of the GPU.

Programming as a whole is as "foreign" to human thinking as it gets. And all the linearization techniques that people use in graphics programming are even more "foreign". I don't see GPGPU as task more complex than writing a good modern render/lighting pipeline.
And I think that GPGPU was underutilized, because compilers/drivers support for it in Windows was really awful, and you had no direct access to hardware to fix that.

Now although GPUs have come leaps and bounds in opening up their processing to more general tasks, you still have the same fundamental design issue. Coming up with completely new ways to look at data and problems and algorithms is not easy. If you do find it that easy (incidentally, what is your coding experience?) then you should get out there and write amazing GPU accelerated tools and libraries and you'll clean up! For everyone else, reinventing the wheel will take time.

My problem is not with "will take time" but with: "let's not reinvent the wheel and stick to single-threaded stuff, because Intel told us so".

Gubbi · Sep 23, 2013

Shifty Geezer said:
I beleive ERP has said a couple of times that Havok looked at GPGPU physics and gave up, because it's a lousy fit. Too many interdependencies and GPUs struggle. There may be algorithms that can solve that, but it's not a trivial matter to just write better code.

It might not even be possible.

Physics and other problems using box solver all have the same problem: While the linear solver phase flies like shit off a silver shovel, the boundary condition checking (ie. collision checking in physics) doesn't because branch coherency tanks.

Cheers

Shifty Geezer · Sep 23, 2013

psorcerer said:
Programming as a whole is as "foreign" to human thinking as it gets. And all the linearization techniques that people use in graphics programming are even more "foreign".

Far from it. In approaching a problem to solve on computer, the natural solution is to follow the same linear thought process of the coder. "First I need to do this. Then I want that. Then I get that info and combine it with this. I accumulate all those results and compute my result by doing x, y, z on the results." Plenty of coders have a notebook on which they scribble linear thoughts and snippets. I myself find it far easier to focus on solving a problem with a pencil and paper rather than sitting at a computer and IDE. Taking a fundamentally linear problem, like computing a waveform, and turning it into a massively parallel task, requires learning a whole new thought process beyond those acquired naturally while growing up. Anyone can code given the right tools; it's as primitive as writing. Whereas creating multithreaded code is an engineering task that requires education and experience. Hell, even writing efficient in-order code can prove somewhat challenging, and a damned sight more costly than banging out code and letting the machines deal with it. Considering we use machines to simplify everything else in life, why not let the CPUs worry about the order of instructions instead of requiring people to micromanage them?

My problem is not with "will take time" but with: "let's not reinvent the wheel and stick to single-threaded stuff, because Intel told us so". :smile:

I don't know that anyone's said that. The arguments against your, "your program can run faster on GPU if you just write it right," are numerous, none of which is, "because Intel told us what to write." Intel's contribution to console coding has been zero for 40 years. Tell a lie, the original XB has an Intel core. But effective code on that console still benefited from the low level optimisations devs learnt coding in ASM on a wide range of CPUs.

3dilettante · Sep 23, 2013

psorcerer said:
I think we just need to calculate the worst case performance of both.
I.e. one thread, one AVX port, one core for Haswell.

That isn't an appropriate worst-case. Technically, the Haswell core would become inoperable, because there are AVX functions that are not supported across all ports.
For a single thread, Haswell gets two 256-bit FMA units, 16 software-visible 256-bit registers, and 16 software-visible integer registers. There are 168 internal rename registers for both types.
The core operates at over 3 GHz.
There is a 32KB L1, 256KB L2, and I'll leave you to decide where in the 2-8MB range you want to pick for the L3.

Treating scalar regs as noise, its 1KB Vregs + 32KB + 256KB + 2-8MB, for one thread running at >3 GHz.
With two FMAs per clock, it is >96 GFLOPS.

The L1 has 64 bytes of read per cycle and 32 write, full speed.
The L2 can supply it with 64 bytes per read per cycle.
The L3 ring stop can provide 32 bytes per cycle.
This is at over 3GHz.

And one wavefront (duplicated needed amount of times, essentially working on the same data), one CU, one SIMD.

This is not a worst-case for the CU. The worst case is one wavefront.
Technically, the worst case for both the CPU and CU would be where only one SIMD lane is used, but I'll leave that one out because the CU would fall to 1/256 of its throughput, whereas Haswell drops to 1/16.

16 FMADDs per cycle at 800MHz is 25.6 GFLOPS.

That aside, restricting things to one SIMD means a maximum of 10 wavefronts.

There are 512 scalar registers per SIMD. At most 103 scalar registers per wavefront.
At 10 wavefronts, it is 51 per, although depending on what mode being operated on there are half as many and some number devoted to wavefront masks and the like.
There are 256 256-byte vector registers in total, if divvied up equally there are ~25 per wavefront.
16KB L1, 512KB L2.

2KB+64KB+16KB+512KB/10 = ~60KB per thread (unless you count 64 work items per wavefront as threads) running at 800MHz.

The choice to restrict the CU to one SIMD has one impact, where only one vector memory operation can begin every 4 cycles. This may or may not impact the CU's bandwidth since I'm not clear on how the memory pipeline overlays instruction issue. At one SIMD there's no way to start a write in the next cycle like if there were multiple SIMDs.

64 bytes read from the L1.
64 bytes from L2.
This is at 800MHz.

Then just multiply that by the number of such low performance tasks that each can run in parallel.

Better make sure you don't hamstring the CU any further. I might just find some code that can do otherwise parallel chunks serially.

Graham · Sep 23, 2013

irritable mod syndrome

Please bring this thread back on topic. Please take theoretical Haswell performance discussion, theoretical worst case stuff and the "you're writing it wrong" nonsense elsewhere. If it's technical and interesting, create a new thread. If you are just trying to win an argument, go shout at clouds.

Andrew Lauritzen · Sep 24, 2013

I just want to interject that this is thread is currently on top of the Beyond3D lolz list. Carry on.

SlimJim · Sep 24, 2013

Graham said:
irritable mod syndrome

Please bring this thread back on topic. Please take theoretical Haswell performance discussion, theoretical worst case stuff and the "you're writing it wrong" nonsense elsewhere. If it's technical and interesting, create a new thread. If you are just trying to win an argument, go shout at clouds.

That would be approximately 40x less irritating to mods right?

itsmydamnation · Sep 24, 2013

SlimJim said:
That would be approximately 40x less irritating to mods right?

that depends if the irritation can be broken up into small parallel pieces to be processed individually or if its a serial irritation with lots of complex irritants. Then again he could just be crap at processing irritation thus it take far longer and irritates Graham far more.

Graham should learn to process irritation better!

AlexV · Sep 24, 2013

I am afraid that Graham sucks at writing proper irritation, he is all about legacy irritation that is really trivial. And this thread (which was sort of starting badly given the false premise that the magic of Cerny had been poured into the willing cup that is the PS4 in the form of great asynchrony that no other shall master) is going down the drain...but since we are posh like that we have 100000000000 atom sized holes that go into quark sized pipes, instead of an old, decrepit, kitchen sink, so it takes more than it should. I have trouble finding the redeeming qualities in it beyond 3dilettante's quality posts (as usual) and making Lords of Shadow laugh. I will lock it soon if nothing takes it away from its current course. Maybe I can parallelise the process of locking it...

Arwin · Sep 24, 2013

This thread is at least surely doing a great job at unrolling loops ....

Anyway, it did make me wonder - if I understand correctly from the article from the Ubisoft programmer on Digital Foundry and the leaked documentation, the GPU can actually use the same cache that is typically reserved for the CPU? Is that correct?

Betanumerical · Sep 24, 2013

Arwin said:
This thread is at least surely doing a great job at unrolling loops ....

Anyway, it did make me wonder - if I understand correctly from the article from the Ubisoft programmer on Digital Foundry and the leaked documentation, the GPU can actually use the same cache that is typically reserved for the CPU? Is that correct?

It appears that the GPU can query (snoop) the cache on the CPU but I don't think it can write to it (well atleast i haven't seen anything indicating as such) the more interesting parts IMO are the selective cache invalidation and the cache bypass.

Arwin · Sep 24, 2013

Betanumerical said:
It appears that the GPU can query (snoop) the cache on the CPU but I don't think it can write to it (well atleast i haven't seen anything indicating as such) the more interesting parts IMO are the selective cache invalidation and the cache bypass.

http://www.vgleaks.com/playstation-4-includes-huma-technology/

Seems to suggest that it is possible:

vgleaks said:
- The last two types are accessible by both CPU and GPU

- System coherent (SC)

A “SC” buffer is memory read from and written to by both CPU and GPU, e.g. CPU structure GPU reads, or structures used for CPU-GPU communication
SC buffers present the largest coherency issues. Not only can L1 caches lose coherency with other, but both L1 and L2 can lose coherency with system memory and the CPU caches.

But I'm not actually sure this buffer is actually cached by L1/L2, or if it is a memory structure in regular memory that the CPU accesses through caches and the GPU directly.

Let's try to actually answer the OP as well in more general terms:

- I think the whole point of asynchronous compute is that compute work can be scheduled also in the GPU's 'downtime' (there are times in the rendering pipeline where the CUs are not used)
- they can be scheduled in such a manner that they don't block the main graphics rendering (for instance by having more fine-grained control over task priority)

And in general, the discussion over CPU vs GPU is a false one for this thread anyway, as the whole point of these heterogeneous chips is that the CPU and GPU parts can work together really efficientliy, so we need to look at advances in hybrid realtime algorithms.

Irrespective of that, I also very strongly doubt that the use of raycasts in current-gen games will not remain in the 60s per frame, as bkilian suggested, even for AI. And since the raycasts can be done against geometry easily in CUs, you could imagine a job scheduler that routes all raycasts needed by AI, audio, etc. through a single job distribution system?

Gipsel · Sep 24, 2013

System coherent writes from the GPU bypass all GPU cache levels (writethrough is selectively enabled) and invalidate the respective line in the CPU caches, if necessary. Reads snoop the CPU caches. But afaik there is no way, that the GPU writes directly to CPU caches or the CPU snoops GPU caches.

taisui · Sep 24, 2013

AlexV said:
(which was sort of starting badly given the false premise that the magic of Cerny had been poured into the willing cup that is the PS4 in the form of great asynchrony that no other shall master)

You mean the part that he gave the impression that there are 64 ACEs?

3dilettante · Sep 24, 2013

He said there were originally two sources of compute commands, and now there are 64 queues.
It's a possible level of purposeful imprecision or just a difficulty in finding the exact right words to say, but GCN originally only had two buffers for compute commands.

Betanumerical · Sep 24, 2013

3dilettante said:
He said there were originally two sources of compute commands, and now there are 64 queues.
It's a possible level of purposeful imprecision or just a difficulty in finding the exact right words to say, but GCN originally only had two buffers for compute commands.

http://www.vgleaks.com/orbis-gpu-compute-queues-and-pipelines/

Seems to suggest its 8 ACE's with 8 Pipes each.

3dilettante · Sep 24, 2013

Those 8 pipes per each of 8 ACEs are exposed as a total of 64 user-level queues, which is the software-visible difference for the compute front end.

Asynchronous Compute : what are the benefits?

psorcerer

Shifty Geezer

uber-Troll!

psorcerer

psorcerer

Gubbi

Shifty Geezer

uber-Troll!

3dilettante

Graham

Hello :-)

Andrew Lauritzen

Moderator

SlimJim

itsmydamnation

AlexV

Heteroscedasticitate

Arwin

Now Officially a Top 10 Poster

Betanumerical

Arwin

Now Officially a Top 10 Poster

Gipsel

taisui

3dilettante

Betanumerical

3dilettante

Similar threads