Dazed and Computed / did the GPUs took the wrong road?

I've never worked on the floating point units, but I don't think IEEE compliance was a huge cost. Low single digit percentage increase would be my guess, but I stress that it's just a guess. A CPU designer once told us he used to think GPU floating point hardware was efficient because it got the "wrong answer." He was impressed that it's still efficient now that it gets the "right answer."
OK, thanks, less than I though it could (though I did not have crazy figure in mind, say ~10%).
Also, make sure you don't confuse CPU and GPU terminology. What people mean when they say thread level parallelism is not what GPUs rely on. GPUs need data parallelism. For example, triangles from a mesh spawning thousands of pixel shaders is equivalent to data parallelism.
Thanks for pointing this out :) It is actually "relatively" clear into my mind but indeed the terminology for outsiders like me is quiet confusing.

They have to target workloads with less parallelism as on the other ones it's less likely to beat GPUs. ;)
Point is that for compute, the basic tool everybody uses and it is a landslide victory is CPUs.
It seems to me that the gain provided by the GPU still applies to few type of workloads.
What is bothering is that researchers in realtime graphics now have solutions that doesn't rely on the brute force approach use by GPU. I think it kind of shift perspectives, reading Andrew's comment I don't feel (he may clarify, I might read too much into it though sebbbi's pov on the matter is another legitimate source of questioning) like that GPU will get where he would need for his techniques to work in near future.
At the same time he and others acknowledge that nothing (a really paradigm shift in realtime rendering) that nothing may happen before 5/6 and what next is unclear.

So I feel that "worloads with less parallelism" is where most workloads are, including those that would qualify as math heavy, and it could be where (realtime) graphics are headed too. For now let say there is a lock down as nobody will come with a technique that perform horribly on nowadays (and tomorrow GPU).
To mimic what? Variable vector lengths? That's basically an old concept of vector computers which just need a modern implementation. You may want to have a look at the presentations about future GPU generations (Einstein) from nV.
I will search for that presentation :)
I mean mimick CPU in all they are doing, autonomous, able to make a lot out of data locality though cache, dynamically exploit IPC, being able to successfully exploit "low to moderate" level of data parrallelsim through their SIMD and also (while offering less throughput) dealing with high level of data parrallelism. They have a lot of mechanisms to hide latency. They can "storm" through serial part of the code, etc. All that with one units working on data in its cache (assuming that data /code is here but the same applies to GPU, if not you have to rely on latency hiding mechanism). Even from a power perspective, doing stuffs on the CPU and GPU "cores" means that you may have to move data back and forth quiet often (and you have headache like load balancing).

As others have noted already, GPUs rely mostly on data level parallelism. Thread level parallelism would be running different kernels in parallel (they can do this too and internally, wide problems get split in to a lot of threads [warps/wavefronts]). And in throughput oriented latency hiding architectures, you don't rely on the caches so much to keep the latency down, but to provide more bandwidth than the external RAM. For that, a pretty low cache efficiency (in CPU terms) is usually enough. With larger SIMD arrays the caches have to grow of course to maintain the level of efficiency they have.
Because that is expensive and costs a lot of transistors and power. If you can get away with less effort on that, you will be more power efficient for the kind of tasks which tolerate it.
By the way, GPUs can use DLP, TLP, and ILP (the VLIW architectures relied quite heavily on it for instance), just not as aggressive as CPUs.
Thanks for further clarifying the vocable :)

They are turning into CPU, scalar unit, more cache. It will affect further the compute density.
CPU are not that far, if the Durango CPU has really 2 FMA units, the throughput of 8 core is around 200GFLOPS. It is pretty tiny and relatively cool. 32 would be in the 800 GFLOPS, it still would not be that big or hot (by GPU standards). It would be interesting to see how that compare to something like FERMI type of GPU for computing stuffs (not graphics though).

And Jaguar even with 2 FMA units is not that throughput oriented either. Look at the cell compute density was really high.

After having reading the on going "fight" on this board about the "gpu guys" and the "cpu guys", having people like T. Sweeney stating what he stated (even though some mocked him), and overall the enthusiasm that surrounded Larrabee overall, the "CPU guys" have won me over.

GPU are "somehow" in a pretending stance, even Nvidia went back from Fermi to Kepler wrt to compute capability, it was too costly. For the (few) tasks massively data parallel they are almost already has good as they can (what is left to win). They are already bigger than most CPU, hotter, etc. Thanks to lithography they will continue to provide more but mostly more of the same.
With the market dynamic being what it is (/all actors would need to shift to another paradigm, come to an end with the graphic pipeline completely, get rid of those thick API and driver models /it is unlikely especially as MSFT grip is loosening on the that front => more actors) they can't really afford to go for anything else that something that is still close to a machine that handles "massively data parallel problems" and only.

I would more easily put my -self in the position of a business person, when I read the arguments going on for years, I have to say the "GPU guys" have a tough sale, and I don't see a real game changer coming. Actually the it is the CPU that won me over. The GPU evironment is a moving target, I look at Nvidia is doing and stating of late, I would be clearly thinking this is going nowhere /still no where near ready for prime.
On the other hand Intel shipped (at last) Xeon Phi and is likely to have follow up, many cores (X86 or ARM) and high bandwidth interface for CPU are a couple of years ahead, and it seems that the CPU guys have rediscovered the beauty of the "GPU" (SPMD, SiMT as Nvidia calls it?) programming model.
I think that for once the business people are right, it sounds a bit to me like somebody saying "this is where our project is heading, you can rely high on us (/have faith), though we don't really to pay the price for that the project to materialize".
Outside of a few applications, it is a too shady area of computing with too much question in the air for the big money to invest (well financial computation works on GPU but they "work" already for those type of works). You have the HPC "market" but outside of big projects funded by government money , it looks more like an impressively dedicated bunch of people, willing to use anything and to go through any kind of pain for the sake of advancing mankind knowledge. It is noble but I don't think it is enough to support the development of any specific hardware.


Overall at this point I'm not sure one could easily change my mind, I believe the timing Intel had for the shift (with Larrabee) was the right one. Silicon budget were getting good enough. If (one could put Paris in a bottle with if / a posteriori thinking is so much easier...) there would have been a consensus on the matter ( MSFT had the ~weight I think at this time to push the whole industry in that direction) I think that a "Larrabee" done by GPU guys would have removed the lock on 3d realtime researchers/developers, would have hurt the CPU business in more than a couple of applications, and out done Intel effort to extend its ISA based grip on the industry (so larrabee).
Though speaking of business, would even AMD at the time be OK with that (really competing with them-selves with a non X86 architecture)? It is not a given to say the least.
 
Last edited by a moderator:
I doubt there is a single, ideal programming model. People mix them based on the use cases.

There should not be unnecessary copying when you switch model.

e.g., In GoW3's MLAA, they did everything on the SPUs. In PhyreEngine, I believe they solve the MLAA equations using the SPUs and blend using the RSX (because it's quicker at this particular task).

The software guys will find their way around the hardware.
 
The post is already long enough so... new post.

Gipsel I could not find the presentation from Nvidia :( My google skills suck.

Still I would want to point out that even as machine that handle mostly and foremost massively data parallel problems it seems that GPU have significantly fell short of expectations.
I do not believe the Nvidia prevision about "I don't know how many FLOPS, in between 2015 and 2020 on 11nm process". Even on that field scaling seems to have significantly slow down.

As I was searching for the aforementioned presentation, a memory came back into my mind. It is one presentation made by Johan Andersson, a few years ago. Iirc, I think he was expecting GPUs to deliver 50 TFLOPS by 2015. We know now pretty much for sure that it won't happen.
And no one can really question Repi's cleverness or assert that he would do statement without searching for proper information first, he might have based his opinion on the manufacturers expectations/roadmap. The point is they felt short quiet massively in fact.

On the other hand nobody really expected to see in 2013 a 4quad core X86 CPU pushing close to 500 GFLOPS, to support some form of gather, to support some form of transactional memory. All that while being a relatively affordable and not that power hungry, not that big either, core i5 and including what seems to be a sane GPU (/mid range from a few ago).
The higher power (in Watts) version of Haswell, with 8 cores or more are likely to break the TFLOPS mark, quiet unexpected.
 
Last edited by a moderator:
I doubt there is a single, ideal programming model. People mix them based on the use cases.
No there isn't though it seems clear now (AMD having rightly move from VLIW approach) that there is winner when it comes to deal conveniently with SIMD. INtel, Nvidia, AMD, they all have acknowledge the same solution to the same problem.

If I look at ISPC it is possible to call within an ISPC program some C or C++ code and the other way around. It will only get better (it is new).
So not one programming model but languages that can be conveniently interlocked (wording issue, might not be proper), with semantic that is close and consistant in its form, logic. Etc.

The software guys will find their way around the hardware.
I don't think so, for a game or a few other apps, the developers will go through the effort, the big money won't. GPU manufacturers will face a shrinking of their revenues by selling GPU for "compute purpose" (may be in the not real time professional market too), the volume in discrete GPU is also set to go down. As volume goes down it will get tougher and tougher to compete with CPU.

Another trend, is that Nvidia and PowerVR alike have a raising interest in CPU development (PowerVR going as far as buying MIPS technology and I'm more than doubtful about them trying to compete with either the ARM or X86 environment, definitely patents might be part of the reason but I don't think they are the only reason behind the buy out), I may read a bit too much into it but still it is tough not wonder about something happening behind the curtain.

EDIT

For those may wonder about the title:
Quiet a nice piece of code in dodeca-decimal :LOL:

EDIT 2
It got me to wonder about what coders think about music as a language, I don't have the talent to do something interesting out of it but I've a good idea about how it works. It is really a thing of beauty in my opinion. It kind of self contained, the theory is "included"/ built in the way the write music, with armors, etc. Even the most simple chords are the most simplest thing you can write, stack 3 adjacent notes (no matter what the armor is) and you have all the major/minor, diminished chord (in any key with the matching armor), harmony is kind of self explanatory too.
Beautiful, though when you think about it it is millenniums of research.
 
Last edited by a moderator:
Ah I see what you're saying. If it's business influences, then yes GPU focus may be diluted.

From technical standpoint, moving to solve end-to-end GPU issues is a natural next step.

If I look at ISPC it is possible to call within an ISPC program some C or C++ code and the other way around. It will only get better (it is new).
So not one programming model but languages that can be conveniently interlocked (wording issue, might not be proper), with semantic that is close and consistant in its form, logic. Etc.

Yes, they should focus on reducing communication, synchronization and data overhead as part of this holistic quest. IMHO the language extensions are mostly syntactic sugar on top of the real saving/improvement underneath.
 
Gipsel I could not find the presentation from Nvidia :( My google skills suck.
There is some description in this talk of Bill Dally (the interesting part starts around 30 minutes into the presentation). It was actually more detailed in another presentation available on the nV Research website, which is currently offline, or as nV says:"Updated January 24, 2013

Thanks for your patience while we continue to harden security around the NVIDIA Research site. We expect the site to be back up in the March 2013 timeframe." :rolleyes:
There they explained how the dynamic vector lengths actually work. In principle, they give up SIMD a bit. Several workitems (OpenCL language, threads in nV's) can still share (and usually do so) the same instruction stream to use the power efficiency of SIMD execution. But the hardware is actually able to provide an individual instruction down to the level of an individual work item (now the description as "thread" starts to make sense). Combine this with an vec2+vec2+L/S LIW ISA (vec2 for single precision, each of the vec2 instruction slots can also do a DPFMA), register file caches, massive on chip memory with configurable hierarchy, and some specialized latency optimized cores (CPU cores, very probably ARM) and you get an impression in which direction nVidia aims (nothing is set in stone yet of course).
 
GPU are "somehow" in a pretending stance, even Nvidia went back from Fermi to Kepler wrt to compute capability, it was too costly.
I wouldn't be surprised if they cut more than they intended and Maxwell will put them back on the Fermi compute trend. In other words I don't think you can conclude they believe less in GPU compute from this single data point.
 
There is some description in this talk of Bill Dally (the interesting part starts around 30 minutes into the presentation). It was actually more detailed in another presentation available on the nV Research website, which is currently offline, or as nV says:"Updated January 24, 2013

Thanks for your patience while we continue to harden security around the NVIDIA Research site. We expect the site to be back up in the March 2013 timeframe." :rolleyes:
There they explained how the dynamic vector lengths actually work. In principle, they give up SIMD a bit. Several workitems (OpenCL language, threads in nV's) can still share (and usually do so) the same instruction stream to use the power efficiency of SIMD execution. But the hardware is actually able to provide an individual instruction down to the level of an individual work item (now the description as "thread" starts to make sense). Combine this with an vec2+vec2+L/S LIW ISA (vec2 for single precision, each of the vec2 instruction slots can also do a DPFMA), register file caches, massive on chip memory with configurable hierarchy, and some specialized latency optimized cores (CPU cores, very probably ARM) and you get an impression in which direction nVidia aims (nothing is set in stone yet of course).
It seems the presentations are back online, or some of them. Still I haven't found the paper.

I watched the video you linked, as well as read some short presentations on the same topic at Nvidia research and there are a few things that bother me.
First in the link you gave, whereas I'm miles away from questioning the guy knowledge and competences, I feel like the presentation had a bit too much marketing in it (speaking of cores instead of mostly ALUs, etc.).

But this aside, I'm iffy (wrt I read here and there) about some of the assumptions made:
Basically they expect programming languages and compilers to at last "make it works". It sounds a bit optimistic to me.
Then there is something else that bother me, as silicon budget scales, you have more and more cores. They seem to want lots and lots of as dumb as possible cores with a few "chef d'orchestres" doing their stuffs. In the same time they speak about locality, etc.
But isn't it going to be somehow problematic for the "chef d'orchestre" (latency optimized cores, super command processors, no matter how you name it) to keep tracks of all those execution units remotely (vs locally), more and more information will have to travel to those brains, back and forth to ALUs (/cuda cores).
This is how I picture this in my mind, it is like an army of lobotomized ants. Ants are at least autonomous, though there are far of more complex animals in this world. What they propose is brain dead ants that would have to receive all their information from the "queen(s)" to "function".
I think there is an issue here, I can't see that work, mother nature never selected such a model and I think for a good reason.

There are also other things that bother me in the few presentations I read, a lot of it is marketing and they don't acknowledge that GPU scaling has slow massively. I've been reading quiet some tweets from Andrew Lauritzen off late, and to quote him, GPU still scales but it is most obvious only at crazy high resolutions.
Another thing he stated "I would argue that the "big" require too much parallelism".

Overall I'm not convinced even Nvidia is going to stick to the plan they presented. Would not be the first time (see unified shader architectures).

To make it short at my level of understanding, what I get is that they would want a many core, but they are doing GPU now (or mostly and they developing cores for a well established ISA, for which prospect is good /vs going into unknown territories), making an announcement about developing CPU could be illed perceived by investors (as there is lot of competition), so they kind of try to reinvent the wheel.
I think that INtel experiment for example with SCC are more forward looking as they now longer focus about how to "achieve" high throughput on a core basis (they may have a good idea through their GPU and larrabee developments) but how you efficiently connect many "cores" (as CPU or in Nvidia case SMX). Nvidia and AMd will run into that issue soon how to make those (many) SMX or CUs to speak between them selves efficiently? Back to what I stated if the cores are close to brain dead you need to send a lot of info to the "brains" whatever they are, that why ultimately I think CPU as autonomous entities are to win, they minimize the mount of information they need to function, decisions are made locally.
I see nothing that would prevent a new CPU architecture to implement the kind of concept they are speaking about, neither there is a rule saying that CPU should or could not have access to scratchpad memory.

So in my opinion, and for what it is worse, I think that they should go with CPU cores tweaked to meet their requirements and they have a lot more room to toy around than Intel or ARM as the end product would operate under a GPU type of environment freed of the restrictions set by ISA and legacy support for hardware and software.

I wouldn't be surprised if they cut more than they intended and Maxwell will put them back on the Fermi compute trend. In other words I don't think you can conclude they believe less in GPU compute from this single data point.
Ok, W&S :)
 
Last edited by a moderator:
Back
Top