Dazed and Computed / did the GPUs took the wrong road?

liolio · Feb 15, 2013

This is kind of a follow up to threads like "end of the GPU roadmap" or "software rendering thread" or quiet some "larrabee" thread.

I open this thread because off late after reading the aforementioned threads, or a presentation from Dice/Repi, quiet some questions have raised into my head.

I open it here because it affects consoles too, as it seems that both the ps4 and the next xbox are to use GCN based GPU (and I'm not that comfortable posting elsewhere).

To give some back ground to that thread I may go through a few "points" both to explain my POV so others can either clear what could be misunderstanding or to give a proper direction to the discussion that I wish would follow between the top members here.

First, I'm not a believer in GPGPU. Especially in the console forum there is a lot of enthusiasm about HSA but looking at the whole thing (from the outside that's it) I can't help but think that outside of few applications, it is going nowhere and it is going nowhere fast.

Starting with hardware, AMD is loud about HSA but I don't see the same level of enthusiasm coming form the others actors.
For example both Intel and ARM have gpu that share the memory space as the CPU. Neither are actively trying to turn that into a major "win".
There could be reasons for that (outside of technical merit), in both case the existing API do not present that kind of features.
Intel is more about selling their CPU than anything else.
Still I find here quiet a paradox, the "game changer" in GPGPU computing is kind of already here but nobody really rush to make "the win" out of that tech happen.

There is also the cost of GPGPU on our graphic processing units. We kind of forget but G in GPU stands for graphic, not "Gompute".

Throughout, the year a lot of thing have been introduced in our GPU for the sake of "compute".
So far I've not been able to find somebody that would give estimates about how much it "costs us" in term of silicon (and somehow retail price).
I've lot of question here, what is the cost of supporting integer computation? Is that really needed for the sake of graphic alone? What is the cost of the always improved compliance to IEEE standard (wrt to precision)?
I would guess that one would say not much on feature basis but all together it may add up to something, something "we" (consumers of realtime graphic) really need.

Another thing that bother me (and could really well be a misunderstanding) is the "way" GPU are evolving. I was reading yesterday the links Pete gave wrt to the latency access to the various "on chips memory pool", actually it looks like the results are discussed right now, though what striked me is actually how many "on chips memory pools" there is on the GPU.

To put the two previous points together and move forward, I feel like the GPU are getting "bloated". Worse I wonder if the way they are "growing" is really "sane". X86 CPU were for a long time criticized because they have to deal with X87 legacy, and overall legacy were holding a burden on their design.
Now, when I look at GPU, at it is quiet possibly a misunderstanding of mine, I don't think the situation is any better. In fact I think it is worse.

If I look at AMD GPU evolution, first were added local and global data share, the texture cache acted (tried to) ~a std cache when required. It was not good enough, with GCN AMD introduced a proper L1 cache.
I mean CPU are criticized but that look "worse" to me than supporting X87 (or what not) looking forward. It is the whole data path that is turning into a "mess".
CPU no matter the fact that they deal with backward compatibility too, looks a lot cleaner in how they have been integrating functional units and evolving throughout the years.
I mean you the "path" is straight forward, you have your caches the front end the alus, etc.
If I look at Intel last processor "everything is shared". I mean no matter what the CPU is to do, it get instruction from the I$, data from D$, the CPU has only one pool of instructions to be dispatch to the different execution units.
Now GPU... as I get it you get thing from the L1 texture cache, other from the L1, the LDS, GDS, constant cache /and what not.

So I wonder, CPU are not "completely clean" either, with time integration of various units have improved (/was not that clean early on), but it seems to me that they always been pretty clean when it comes to data path. And the way the memory sub-system evolved looks pretty clean to me too, they had L1 cache, then L2, L3, etc. so a pretty vertical integration. L1 got split (in most arch) in two, instruction and data but it is still pretty lean. GPU seems more horizontal and 'inferior" in how different memory pools are added on chips, as such I wonder about the impact that could have going forward in their evolution.

Pretty much I wonder if there is something that could hinder the "growth/evolution" of GPU vs CPU.
Could it be easier to grow something general purpose (more cpu like) than some specialized piece of hardware?
The advantage of the CPU could be that ultimately they were always general purpose, that is their main purpose and it never really changed, adding stuffs was adding "bonuses" and it shows in our lean the integration looks (at least to me...). On the other hand, GPU were about graphic but they are not really adding things they are evolving toward something else gradually and without that much of a clear view about where they are heading.

I will try now to speak about software, I read the threads multiple times the threads I mentioned at the beginning of this post, I also read the other day the presentation about 5 (25) challenges in real time rendering looking forward.
It seems to me that developers wants more and more CPU like GPU. I think that Andrew Lauritzen posts are pretty enlightening on the matter. The point is to me, too which extend that was not general purpose by design can evolve into something that is really general purpose, especially within the restrictions the GPU are design to function with (you have to run old code, for general purposedness you need LDS/GDS and what not, next are proper cache but you need the keep everything for BC purpose, etc.).

If I look again at how AMD gpu evolved, I see that from VLIW4/5 to GCN, a SIMD/CUs keeps tracks of 5 times more threads/wave front than the previous architure, for the sake of "compute".
In the same time, what I understand from reading a thread like the one about software rendering (or one of the numerous one about Larrabee) is that TLP only get you that far, at some point you need to exploit pretty much everything, ILP, TLP, Data level parallelism, data locality (and you want cache).
Another thing is (my gross understanding) that having many threads is not helping as workload even for graphic becomes less and less data parallel, quiet the contrary.
Back to concrete example, what AMD, their last effort being GCN, did with their GPU in their quest of "general purposedness"? You named it they multiplied by 5 the number of thread/wavefronts the GPU needs to function. It turns out well , with massive wins on nowadays (GPU) compute performances. But looking forward does that make sense?
I might be wrong (I'm used to it) but it looks like in their quest of "general purposedness" GPU evolve with pretty "short view" about where they are heading, today win is tomorrow headache.

So if I look at the next generation of console, putting R&D cost aside for the moment, I see that both Sony and MSFT used GCN based GPU. That comes a quiet a cost in silicon vs previoous AMD generation of product. One could argue that GCN does significantly better with graphic, but no one knows (AMD does I guess) out of the 50% more transistors which one are the one that are responsible for the increased performances in 3d performances and in which quantity.
(please pass on the example like BF3 using "compute" to improve graphics, as I see it the compute part in BF3 is more about doing graphic works outside off the usual "graphic pipeline, quiet different than texture decompression in civilization 5 for example or true compute workloads. By the way if the price the GPU pays is for decompress textures it is stupid, dedicated hardware would make a lot more sense).

Juniper (or an hypothetical 10SIMD vliw4 design) weights 1 billions transistors, the matching part using GCN weights 1.5 billions, what those 500 millions transistors could have bought in term of graphic performances only is unknown. The total could be more as manufacturers may also "sit" completely on compliance with IEEE standard (precision, rounding, etc.).
We are speaking here almost the de facto lowest end for what a gamers (consumers of realtime 3d graphic would use), imagine the figure for mid range and high end card 8O.

Overall the point of all of this is that, do you think that GPU can continue to evolve the way they are or something needs to happen? (Evolution made sense but in some aspects it is becoming an hindrance).

There could be a split between real GPU and truly general purpose "compute" hardware.
The truly general purpose hardware would look pretty close to actual GPU, it would be more lean with regard to data path and special units. Thing is I think it is set to loose to CPU and pretty soon (outside of a few workloads which may be enough as market/niche for quiet some time).
I would expect the heirs of larrabee and power a2, to seal the fate (mostly) of GPGPU computing, ma be within 5/10 years depending on how aggressive and successful the manufacturers of such products are.

For the GPU, a real GPU, as this point could consist of something significantly slimmed down, the bulk of computation being handle by CPU type of units.

Either way, I fail to see to see how GPU can succeed at becoming really general purpose the way they are evolving now at the same time, for gamers I think we are paying what could a high price for something we don't need, and on top of it that for the sake of battle GPU are not set to win (which doesn't mean that Big cores ala Haswell are to do be the answer to everything).
A pretty naughty side issue, is the matching market dynamic, even here we can read posts of researchers like Andrew Lauritzen, that are researching solutions to please ours eyes, actually find solutions but have no proper hardware to run what they find.

What we have is that graphic evolutions is locked down by both market dynamic and the matching API, the main driver that push the evolution of the hardware and the API is something that could very well be set to fail (ie turning GPU into general purpose devices, but how so?... well a "really" general vector machine? One may wonder and that could be the main reason about why I think it is set to fail it seems that the workload of tomorrow are less and less data parallel and anything too wide (relying to heavily on wide vectors processing) would be more a bother than anything else.

anexanhume · Feb 15, 2013

Holy wall of text Batman. I disagree with some of your base assumptions.

I don't think GPGPU is all bad. While it may not be best optimized for typical graphic loads, you have to remember that programmers can adapt to this, and GPGPU means GPU makes can expand their install base and high dollar enterprise customers. I also think it makes them a little more flexible when it comes to paradigm shifts. Their generic nature may make them more suited to shifts in philosophy such as mega textures and mega meshes taking off, deferred rendering, SVO, etc. Especially when ray tracing comes along. You don't have to try and sell a consumer a whole new card for ray tracing capabilities (a la the Ageia card with PhysX) when it comes along if GPGPU is helping your card not be so unoptimized for that huge paradigm shift.

I also think legacy isn't as much a deal as you think. People were claiming ARM was going to eat Intel's lunch because x86 and all the fat associated with it would prevent them from competing in a low power environment. Medfield is starting to show that's not true, and I expect it will get better with successive generations.

There's also rumblings of Microsoft sending Direct3D on its merry way and people love to quote the superior performance of L4D2 on linux with just a little tweaking in openGL.

So, going forward, I don't know that there's a clear "this is the best decision for graphics no question, but we'll do this instead because GPGPU."

Sony, Microsoft and Nintendo will also keep throwing huge NRE money at them for the foreseeable future as well. They're directly coupled to what the console market wants and thus what graphics applications want.

While they both like to play in compute, they both know that competitiveness to each other's offers in gaming is what still drives their core and their bottom line. They'll keep catering to that in the future, especially if things like the steam box and onerous business models or expensive consoles drive people back into the waiting arms of the PC.

liolio · Feb 15, 2013

anexanhume said:
I don't think GPGPU is all bad. While it may not be best optimized for typical graphic loads, you have to remember that programmers can adapt to this, and GPGPU means GPU makes can expand their install base and high dollar enterprise customers. I also think it makes them a little more flexible when it comes to paradigm shifts. Their generic nature may make them more suited to shifts in philosophy such as mega textures and mega meshes taking off, deferred rendering, SVO, etc. Especially when ray tracing comes along. You don't have to try and sell a consumer a whole new card for ray tracing capabilities (a la the Ageia card with PhysX) when it comes along if GPGPU is helping your card not be so unoptimized for that huge paradigm shift.

I think they are still primarily design for graphics that the whole issue. How can it become clean, lean?

I think there is an issue in how they are evolving (pretty short sigh based on market constrains /API evolution). I may have something to do with how jump from T&L to programmable shaders happened. Possibly both too fast and in a too incremental manner.
It was never "OK, in 20xx we want a completely programmable unit (no restriction on shaders length, data type, etc.)". The manufacturers may have come with something lean, I would think based on CPU+SIMD or sort of SPU type of hardware with fixed function units for tricky operations.

The whole issue is that the GPU did not (and for still haven't) change, it grown. The basis is still the same and assume that ++90% of what they do is data parrallel, and then try to shoehorn so called general purposedness into the device.

I think Nvidia knows that is going nowhere, I suspect they are looking toward larrabee type of approach and already acknowledge the fact that ultimately if GPU survives it will in a pretty "dimished" form packing together whatever fixed function hardware the general purpoe hardware does really a awful job with.

GPU can't still touch what CPU are doing as far as compute is concerned, actually they can even function without taking quiet some resources from the CPU.

I fail to see how there could a paradigm shift with GPU without breaking away with legacy (what Nvidia may try to do for exophase but does that qualify as a GPU I wonder), look at my example about GCN. You want your GPU to me more general purpose, handle case where there are lot of dependancy, first step => multiply by 5 the number of threads => short win but as I understand thing they can go away with that mid term => overall it is a bit of a waste.
I come to agree with Sweeney, (even if it affect foremost the drivers team) why reinvent the wheel constantly.

patsu · Feb 15, 2013

IMHO, business people may want quick answers. Technically speaking, too early to call. Wait for the games and apps to be optimized first.

Gipsel · Feb 15, 2013

liolio said:
I think they are still primarily design for graphics that the whole issue. How can it become clean, lean?

If you look at GCN, the architecture is actually pretty clean.

They just added some fixed function blocks to accomodate the graphics portions, which probably still suck on general purpose hardware (setup/rasterizers, ROPs with the specialized ROP caches and the HiZ stuff, the Tessellators are really tiny and in the TMUs only the filters and texture de-/compression logic block is there for graphics). Everything else basically forms a nice multithreaded vector architecture optimized for throughput computing. For instance, they ripped out the traditional constant cache (yes, it is gone, there is no dedicated hardware for this anymore, constants are delivered by the scalar ALU through its L1 and the L2 cache).

liolio said:
It was never "OK, in 20xx we want a completely programmable unit (no restriction on shaders length, data type, etc.)". The manufacturers may have come with something lean, I would think based on CPU+SIMD or sort of SPU type of hardware with fixed function units for tricky operations.

That's basically what modern GPUs try to be.

liolio said:
The whole issue is that the GPU did not (and for still haven't) change, it grown. The basis is still the same and assume that ++90% of what they do is data parrallel, and then try to shoehorn so called general purposedness into the device.

I don't see it that way. The GPUs have come a really long way from being just some fixed function hardware to (almost) fully programmable devices.

liolio said:
I fail to see how there could a paradigm shift with GPU without breaking away with legacy (what Nvidia may try to do for exophase but does that qualify as a GPU I wonder), look at my example about GCN. You want your GPU to me more general purpose, handle case where there are lot of dependancy, first step => multiply by 5 the number of threads => short win but as I understand thing they can go away with that mid term => overall it is a bit of a waste.

They want to increase the usefulness for all problems which have a really high data level and thread level parallelism (graphics is a subset of this). And where did you get the number of 5x higher number of required threads? A GCN CU needs a comparable amount of threads to hide a comparable memory latency for instance (in pratice maybe 50% more as it is simply faster than the old VLIW designs). If the ALU latencies itself would be a problem, nV GPUs would be awful.

As I said, you see modern GPUs in a too bad position. That's unjustified. They are basically almost there, where you want to have them.
And I think you meant Exascale, not one of the members on this board.

liolio · Feb 15, 2013

patsu said:
IMHO, business people may want quick answers. Technically speaking, too early to call. Wait for the games and apps to be optimized first.

Whic games which apps? I mean gpgpu has been there for a while, even something like photoshop barely use it.
Intel is not part of hsa, nvidia either, I'm not sure about qualcom (and their adreno), at least Open CL was adopted by all the actors /most.
How HSA is going to do better?

Honestly compite is going to be used in next gen but I would.bet mostly to do graphic.related stuffs outside of.the graphic pipeline ala BF3. The relevance of compute elsewhere I doubt it and it seems investors too.

I would not be surprised if a lot of companies are evaluating xeon phi at the moment, and within a couple of year depending on the follow up Intel.gives it, shift to that kind of product (x86 or ARM based).

patsu · Feb 15, 2013

liolio said:
Whic games which apps? I mean gpgpu has been there for a while, even something like photoshop barely use it.

Well if they can achieve their target without using GPGPU specific features, I don't see anything wrong. Kinect is done on 360 without modern GPGPU features too.

As for which game/app, wait for enough samples !

I have seen a few folks avoiding GPGPU because of unnecessary overhead. We will see if HSA addresses these overheads adequately.

rpg.314 · Feb 16, 2013

liolio said:
(please pass on the example like BF3 using "compute" to improve graphics, as I see it the compute part in BF3 is more about doing graphic works outside off the usual "graphic pipeline, quiet different than texture decompression in civilization 5 for example or true compute workloads. By the way if the price the GPU pays is for decompress textures it is stupid, dedicated hardware would make a lot more sense).

So what? Compute is compute. Doesn't matter in the slightest if you shade pixels or multiply matrices.

What makes you think GPUs don't have texture decompression hw? Even LRB had it.

liolio · Feb 16, 2013

Gipsel said:
If you look at GCN, the architecture is actually pretty clean.
They just added some fixed function blocks to accomodate the graphics portions, which probably still suck on general purpose hardware (setup/rasterizers, ROPs with the specialized ROP caches and the HiZ stuff, the Tessellators are really tiny and in the TMUs only the filters and texture de-/compression logic block is there for graphics). Everything else basically forms a nice multithreaded vector architecture optimized for throughput computing. For instance, they ripped out the traditional constant cache (yes, it is gone, there is no dedicated hardware for this anymore, constants are delivered by the scalar ALU through its L1 and the L2 cache).

Damn it, how could I forget, I read again most the "software rendering thread" yesterday, especially that post... it should have prevented me to think that legacy hardware could prove problematic for the GPU. I feel dumb now... (though I'm used to it too...

).

That's basically what modern GPUs try to be.

Well that is the whole reason of my questioning, they are still not there and I may be wrong but it comes at quiet a cost for graphic performances (the way graphic are done now).
There have been many attempt by now at making GPGPU into a win, last being Open CL, so far it got mostly nowhere. I'm not sure HSA (with less support than Open CL) is going to change that.

I don't see it that way. The GPUs have come a really long way from being just some fixed function hardware to (almost) fully programmable devices.
They want to increase the usefulness for all problems which have a really high data level and thread level parallelism (graphics is a subset of this). And where did you get the number of 5x higher number of required threads? A GCN CU needs a comparable amount of threads to hide a comparable memory latency for instance (in pratice maybe 50% more as it is simply faster than the old VLIW designs). If the ALU latencies itself would be a problem, nV GPUs would be awful.

They have come a long way but when I read either Andrew Lauritzen's posts or Nick's one, I can't help but think that they have pretty much hit their limit. In this post for example, Andrew makes a really strong point, he made quiet some others points in that very thread too (and I would expect that much of such a bright mind

) about job stealing algorithm, etc.
To get efficient at the kind of complex tasks Andrew is speaking about, they need to change massively to the point they would pretty much be CPU. That means compromising a lot of their "compute density" (trading it for more control flow) in the same time it may affect power consumption, etc. etc.
At the same time CPU are not standing still and they are quiet possibly closer to the optimal already. Actually I think that Larrabee at the time Intel starts working on it was the right solution.
Though they may have made mistakes or missed pieces of the puzzle at this point in time.
In the Keldor's post I just linked, he states that at the very end:

Getting a CPU ecosystem to that point is one heck of an uphill battle.

I think he is right, and quiet possibly it showed in Larrabee design. A bit like MSFT with Windows, Intel used to deal with costumers that are mostly "locked down" in their products lines, as a result even when they give a try at a new market, locking down more costumers is a strong concern. Sometime too strong of a concern which translates in missed opportunity. They can be too focused on extending their traditional market (/grip) than exploiting/creating a new one (/source of revenues).

The language Andrew is speaking about in the aforementioned thread seems really an enabler to change how hardware legacy is handled in CPU environment (where ISA rules though MSFT tries to move away from that constrain).
At the time larrabee got designed ISPC (or others languages that are alike) were in their infancy, but I wonder what a CPU based compute device could achieve free of any arbitrary ISA and hardware legacy constrains (as GPU are).
It kind of get me to the usual argument about GPU being on the wrong side of the AGP/pci express bus, it is a bit of an excuse, they are on the wrong side because they can't do what the CPU does. Being on the same side as CPU should make things better but they are still not there.
(most) GPU have relied on high bandwidth and expansive RAM type to do their thing, it in turn limit the amount of memory they had to play with. It could have been different if GPU were able to do most of the things CPU does and use less wasteful rendering techniques wrt bandwidth (tile based rendering deferred or not).
GPU have been using the same type of languages Andrew is speaking about (ISPC), in my view the jump "to general purposedness" has somehow been too gradual. They could start using CPU type of units, at first they might not had the budget to do a larrabee, it could have looked more like an Emotion Engine + fixed hardware than Larrabee with is fully cached on chip memory pools but it would have been really general purpose. Quiet possibly they would not have been on the wrong side of the AGP/PCI Express as a lot of the work done for graphics and games would have been made within their memory pool. It is a bit what seems Nvidia plan with Maxwell (? doubt about the project name) (I say "a bit like" because Maxwell looks more like an APU just on the other side of the PCi-express).

As I said, you see modern GPUs in a too bad position. That's unjustified. They are basically almost there, where you want to have them.
And I think you meant Exascale, not one of the members on this board.

I' not sure they are, I quiet possibly misunderstand but if they want to get there they will have to turn mostly into plain CPU and give up compute density in process, they will instantly face strong competition from the CPU manufacturers. I think that many cores (be it ARM or X86, super throughput oriented or not) are going to mostly kill GPGPU. Intel for now has Xeon phi that looks like a transitional product to me, I would not be to surprised if going forward, it is replaced by their new Atom or their direct successor (which I could see bring parity with their high product line as far as SIMD ar concerned). ON the ARM side, I would expect that kind of products to spawn too based on their new 64bit ISA.
Andrew made a lot of sound point about iso power comparison, the GPU doesn't look that good when you look at the room CPU have to grow their SIMD capability.

You are right about Exascale, I got confused.

rpg.314 said:
So what? Compute is compute. Doesn't matter in the slightest if you shade pixels or multiply matrices.

What makes you think GPUs don't have texture decompression hw? Even LRB had it.

What I meant is that for example looking at the cost (in silicon) of GCN vs AMD previous designs, I'm not sure about how of a win it is for graphics, or in another manner what would have been the cost to improving graphic alone. BF3 does graphic computation using compute shaders but for a kind of computation Vliw architectures are well suited for.
That is not the case for the custom texture decompression done in civilization 5.
I agree that stupid was not a proper wording, actually the devs are doing a great job at exploiting the hardware they have at hand and what it can do. What I meant is that if if decompression units are needed handling format texture units can't handle (or it would be unpractical to implement, too costly) it would possibly be better to have dedicated hardware a bit like in Durango (and actually that could be true for the CPU too, may be programmable decompression units are needed).

I did not mean that GPU does not support texture decompression, I know at least that texture within the texture cache are in a compress format decompressed and decompress on fly ( which I guess set restriction on how aggressive compression is, I would assume that more advanced the compression scheme the more time it takes to decompress the data).

liolio · Feb 16, 2013

Oops I forgot part of your post (Gipsel), wrt to the number of thread GCN have to deal with, and it is quiet possibly another misunderstanding of mine.
Old AMD SIMD, dealt with 8 "threads/wave-front" (not sure of the proper wording) two of those being active at the same time (Round-robin type of execution).
IN GCN, a CU keeps track of 40, and four are active at the same time. Though the rate at which instructions are issued is the same (1 per cycle iirc) my understanding is that it needs more threads "to function" vs previous design.

EDIT

And by the way it is obvious that I'm neither a coder or an engineer so thanks for sharing your knowledge and taking time to clarify my misunderstanding and/or presenting things in an other light

3dcgi · Feb 17, 2013

Ignoring memory fetches AMD's VLIW designs needed two wavefronts (warps in Nvidia terminology) per compute unit (CU) to utilize all of the ALU cycles. GCN needs 4 wavefronts so there's an increase by 2 of the minimum number of wavefronts. But you also need to consider that the 4 GCN wavefronts will use all of the ALUs while the 2 VLIW wavefronts only used all of the resources if they could co-issue.

Of course these numbers are academic as you can't ignore memory fetches and both designs need more than 4 wavefronts per CU to be efficient.

rpg.314 · Feb 17, 2013

liolio said:
What I meant is that for example looking at the cost (in silicon) of GCN vs AMD previous designs, I'm not sure about how of a win it is for graphics, or in another manner what would have been the cost to improving graphic alone. BF3 does graphic computation using compute shaders but for a kind of computation Vliw architectures are well suited for.
That is not the case for the custom texture decompression done in civilization 5.
I agree that stupid was not a proper wording, actually the devs are doing a great job at exploiting the hardware they have at hand and what it can do. What I meant is that if if decompression units are needed handling format texture units can't handle (or it would be unpractical to implement, too costly) it would possibly be better to have dedicated hardware a bit like in Durango (and actually that could be true for the CPU too, may be programmable decompression units are needed).

I did not mean that GPU does not support texture decompression, I know at least that texture within the texture cache are in a compress format decompressed and decompress on fly ( which I guess set restriction on how aggressive compression is, I would assume that more advanced the compression scheme the more time it takes to decompress the data).

We are way past the point of improving IQ by throwing more shaders at it. Better compute is required for better IQ going forward.

Davros · Feb 17, 2013

It kind of get me to the usual argument about GPU being on the wrong side of the AGP/pci express bus, it is a bit of an excuse,they are on the wrong side because they can't do what the CPU does.

They are on the wrong side because thats where the slot is,

torbor · Feb 17, 2013

Would be ironic if Sony plan in the past was the best way in the end. A powerfull and versatile Cpu and a "dumb" rasterizer.

patsu · Feb 17, 2013

In Cell, they had compute all done on the CPU. They found out people use the extra power mostly for enhancing visuals.

In PS4, if the rumor is true, they shift part of the compute resources to the GPU. Developers can still use the CPU for compute in parallel. They don't have to move the data between 2 pools now. They also moved generic tasks to dedicated hardware.

At high level, PS4 should be better and more vesatile than PS3. You can dedicate all or part of the GPU for compute. Same for the CPU. But it depends on the finer details to realize the full potential.

Would be great if we know more about libGCM, and the modifications done to the GPU and CPU.

liolio · Feb 17, 2013

rpg.314 said:
We are way past the point of improving IQ by throwing more shaders at it. Better compute is required for better IQ going forward.

I kind of agree, though when I read the pov of a researcher on the matter (which got given as a ref again in one of the last presentation by Dice/Repi, about the 5 (in fact25) challenge in realtime rendering looking forward) it seems that GPU are not close to be able to run the kind of implementations he is working on.
I think there is also a distinction between "compute" and "compute" compute shaders it seems are also used to "escape" the graphic pipeline as defined by the API. They can run do the same graphic work GPU were optimized to run.
I will try to go back to that point a bit later, in an effort to make my point clearer, goal that is sadly limit by my own limits and knowledge

Davros said:
They are on the wrong side because thats where the slot is,

I put that in quiet a sarcastic form but I think that the point still somehow hold.
First GPU could no do anything )mostly) outside of graphics, then they could to some extend, some languages were develop to make use of those resources more convenient. It was not good enough for most software editors to use those resources. Then the solution/problem was that thee CPU and GPU use different memory pools. Now we got APUs (and some already provide a shared memory space between the GPU and CPU Intel and ARM GPU), we got Open CL => not good enough.
NExt on the list is HSA as AMD may finally be able to provide a shared memory space between the CPU and GPU. Could be fruitful though it is not as widely supported it seems as Open CL was.

I read the paper about about ISPC Andrew shared, and I think that the guys that wrote the paper have a point, CPU and GPU are still too different, if you design a language (ala Open CL or others) supposed to run on both CPU and GPU it has to work on the lowest common denominator, it can either favor CPU or GPU, or actually none (the real lowest denominator). That is why they created ISPC in the first place to make good use of CPU.

So to go further now, you will have the same memory space, what is the next "excuse"? I would bet than in a few years, the issue will be what some have already pointed out, even if you get a language that get the most of both the CPU and the GPU (which seems a bit troublesome without the developers somehow giving hints which is close to code for 2 architectures or a damned clever compiler in which case VLIW in overall computing should be reconsidered), how do you balance tasks efficiently between those different units?

I think that why GPGPU is not taking off as fast as expect (and won't) is that some project/products managers are aware of those issues and are not willing to make that step into the darkness. Some others may have already ports stuffs a lot their efforts (/money) mostly to an ever shifting target software and hardware platform and might be tedious about taking any other bet.

torbor said:
Would be ironic if Sony plan in the past was the best way in the end. A powerfull and versatile Cpu and a "dumb" rasterizer.

patsu said:
In Cell, they had compute all done on the CPU. They found out people use the extra power mostly for enhancing visuals.

Well actually I think that STI did not had a clear enough direction about the programming model for Cell, or worse a wrong one. Though I think the idea was closer to the "truth" as I see it (so a pretty limited form of truth...

). The thuth about GPGPU has nothing to do with GPU hardware but could really well be in Keldor314's statement:

The bottom line is that being able to JiT code, all the time, allows you to drop binary backwards compatibility, and thereby change the high level architecture any time it's beneficial. This is the real advantage of JiTted code, but it only works if pretty much all the code in the ecosystem is JiTted. Since GPUs have used virtual assembly code from the beginning of programmable shaders, they really can ignore binary compatibility and still have backwards compatibility enforced. Getting a CPU ecosystem to that point is one heck of an uphill battle.

The key to GPGPU might have been in the software, with a defined software platform, design throughput oriented CPU type of units, that are free of legacy and hardware compatibility issue. So the point of GPGPU could have been to set such an environment (not jumping through tinier and tinier hoops to get there having the actors more and more tedious about adopting and porting their code to an ever shifting platform and letting the CPU actually getting close too).

I don't want to make it sounds as it is trivial, it seems that Intel only now find proper way to make use of the hardware they've been producing for a while though effort like ISPC (like the SPMD programming model that has been so successful on GPU).

May the Cell have been built to accommodate that model the result could have been different. Though as Intel with Larrabee IBM was bent on POWER, Toshiba on MIPS though gave up. /plenty of if business arguments that favors consolidating a grip on some market vs addressing a new one (on Cell and larrabee... at least it fails).

If the Cell had follow that road, it may not have included a PPE and be homogenous may have used a ISA with some support for the the aforementioned programming model, an ISA unexposed that would have free the hardware of legacy and restrict its evolutions for the best looking forward.
It would have quiet possibly be closer to a many core emotion engine than the Cell.
It would have been designed to indeed do everything, though I could not have been alike to Larrabee or Power A2, I don't think that at that point of history the silicon budget allowed for that.

3dcgi said:
Ignoring memory fetches AMD's VLIW designs needed two wavefronts (warps in Nvidia terminology) per compute unit (CU) to utilize all of the ALU cycles. GCN needs 4 wavefronts so there's an increase by 2 of the minimum number of wavefronts. But you also need to consider that the 4 GCN wavefronts will use all of the ALUs while the 2 VLIW wavefronts only used all of the resources if they could co-issue.

Of course these numbers are academic as you can't ignore memory fetches and both designs need more than 4 wavefronts per CU to be efficient.

OK, thanks. So pretty much the other wavefronts are here to give the schedule something to play with (/in a positive way, so improve ALUs utilization).

Though putting latencies aside for a moment, I read that paper about IPSC, and the terminology they use is "gang" (would be warp or wavefront in Nvidia and AMD parlance). They say that one core (they set the limitation is arbitrary though they chose it because they thought it was the best thing to do) can spawn a "gang" and the width /# of element is twice the SIMD width (so their max is 16 element with AVX units).
I think they are aiming at workloads that have less parallelism than what GPU deals with (though it can deal with more too).
I wonder what GPU would do to mimic that, in a CU the "gang" are wider, more importantly there are more at some point isn't the model GPU pursue is self defeating in that they rely to much on thread level parallelism and deal with too many of them to hide latency to the point that it sets a lot of pressure on the working memory they have to work with (overall many threads have to share a tiny amount of memory => cache efficiency collapse) and when you try to diminish the number of "gand" "warp" they deal with they are too reliant of TLP and their performances collapse too? (could the amount of "control flow" you can do per element too).

I do not get how GPU can go away without doing what modern CPU do, leveraging everything they can, ILP, DLP, TLP, etc.
So far I think that GPU have tried to avoid what it likely to big a significant hit to their "compute" density and they've been helped in that task by the matching evolution of the API.
But I'm not sure there is magic that could make it so down the line GPU can maintain their advantage against other type of units like CPU.

It is a bit what I tried to say earlier about the jump not been "brutal" enough, the evolution of API and the market overall has allowed GPU to avoid to take that jump (down in compute density but could be for the best as brute force shows its limits). If I look at a project like Exascale for example the manufacturers are given a clear target well ahead of time and come with their take about what is best. With GPU the whole process looks a bit more undefined to me.
To some extend (and whereas I understand that the market dynamic was against such a move) it could have been better to stick a bit longer to improve T&L (/mildy programmable hardware, ~direct x 8) and if the manufacturers has been given a "definitive target" based on expected process evolution. Like ~ around 2006/7 we are making a massive jump to really general purpose compute device backed by fixed function hardware to ease rendering.
Now as it is neither AMD or Nvidia might be that willing to take the jump for competitive issue +> you may get a big, hot, costly piece of hardware which doesn't do much better rendering games (imagine a Fermi in a lot worse).

By the way you worked at ATI, do you have an idea about the cost in silicon to comply to IEEE norm? Is Integer calculation that usefull for graphics? Sebbbi said that he made integer computation on Xenos (working ok within a given range with low precision) but for real time graphic it is good enough. (I also remember some Dice or Epic(?) paper stating that some computations are done at a too high precision (rasterization I would have to search quiet some papers to find the proper ref), for realtime graphic (as games) it is not needed).
I wonder to how much all of this amounts.

patsu · Feb 18, 2013

Cell programming model is pretty clear cut. The problem is it's too harsh for many developers, and does not run existing code well. During development, they profiled existing code, and ported some to run on the SPUs (We have all seen those demoes).

IMHO, GPGPU programming may be a misnomer. The architecture is suited for certain type of work. It may never be as general purpose as a CPU. The GPU binary backward compatibility is only part of the formula. Because the GPUs are thrown at mostly the same type of problems for a large part of its life, it's easier to deal with backward compatibility performance.

If people invent more variety of "GPGPU models", binary compatibility may not guarantee you great performance in all cases.

liolio · Feb 18, 2013

Well.replace gpgpu programming by SPMD programming model which were not invented for.gpu even if proved for now it most successful application.

3dcgi · Feb 18, 2013

liolio said:
By the way you worked at ATI, do you have an idea about the cost in silicon to comply to IEEE norm? Is Integer calculation that usefull for graphics? Sebbbi said that he made integer computation on Xenos (working ok within a given range with low precision) but for real time graphic it is good enough. (I also remember some Dice or Epic(?) paper stating that some computations are done at a too high precision (rasterization I would have to search quiet some papers to find the proper ref), for realtime graphic (as games) it is not needed).
I wonder to how much all of this amounts.

I've never worked on the floating point units, but I don't think IEEE compliance was a huge cost. Low single digit percentage increase would be my guess, but I stress that it's just a guess. A CPU designer once told us he used to think GPU floating point hardware was efficient because it got the "wrong answer." He was impressed that it's still efficient now that it gets the "right answer."

I could be missing some features, but IMO the main compute features that are wasted area for graphics are double precision and ECC. Double precision might become useful for graphics at some point, but I have a hard time seeing ECC being important for games at least.

Also, make sure you don't confuse CPU and GPU terminology. What people mean when they say thread level parallelism is not what GPUs rely on. GPUs need data parallelism. For example, triangles from a mesh spawning thousands of pixel shaders is equivalent to data parallelism.

Gipsel · Feb 18, 2013

liolio said:
Though putting latencies aside for a moment, I read that paper about IPSC, and the terminology they use is "gang" (would be warp or wavefront in Nvidia and AMD parlance). They say that one core (they set the limitation is arbitrary though they chose it because they thought it was the best thing to do) can spawn a "gang" and the width /# of element is twice the SIMD width (so their max is 16 element with AVX units).
I think they are aiming at workloads that have less parallelism than what GPU deals with (though it can deal with more too).

They have to target workloads with less parallelism as on the other ones it's less likely to beat GPUs.

liolio said:
I wonder what GPU would do to mimic that,

To mimic what? Variable vector lengths? That's basically an old concept of vector computers which just need a modern implementation. You may want to have a look at the presentations about future GPU generations (Einstein) from nV.

liolio said:
in a CU the "gang" are wider, more importantly there are more at some point isn't the model GPU pursue is self defeating in that they rely to much on thread level parallelism and deal with too many of them to hide latency to the point that it sets a lot of pressure on the working memory they have to work with (overall many threads have to share a tiny amount of memory => cache efficiency collapse) and when you try to diminish the number of "gand" "warp" they deal with they are too reliant of TLP and their performances collapse too? (could the amount of "control flow" you can do per element too).

As others have noted already, GPUs rely mostly on data level parallelism. Thread level parallelism would be running different kernels in parallel (they can do this too and internally, wide problems get split in to a lot of threads [warps/wavefronts]). And in throughput oriented latency hiding architectures, you don't rely on the caches so much to keep the latency down, but to provide more bandwidth than the external RAM. For that, a pretty low cache efficiency (in CPU terms) is usually enough. With larger SIMD arrays the caches have to grow of course to maintain the level of efficiency they have.

liolio said:
I do not get how GPU can go away without doing what modern CPU do, leveraging everything they can, ILP, DLP, TLP, etc.

Because that is expensive and costs a lot of transistors and power. If you can get away with less effort on that, you will be more power efficient for the kind of tasks which tolerate it.
By the way, GPUs can use DLP, TLP, and ILP (the VLIW architectures relied quite heavily on it for instance), just not as aggressive as CPUs.

Dazed and Computed / did the GPUs took the wrong road?

liolio

Aquoiboniste

anexanhume

liolio

Aquoiboniste

patsu

Gipsel

liolio

Aquoiboniste

patsu

rpg.314

liolio

Aquoiboniste

liolio

Aquoiboniste

3dcgi

rpg.314

Davros

torbor

patsu

liolio

Aquoiboniste

patsu

liolio

Aquoiboniste

3dcgi

Gipsel

Similar threads