OK, thanks, less than I though it could (though I did not have crazy figure in mind, say ~10%).I've never worked on the floating point units, but I don't think IEEE compliance was a huge cost. Low single digit percentage increase would be my guess, but I stress that it's just a guess. A CPU designer once told us he used to think GPU floating point hardware was efficient because it got the "wrong answer." He was impressed that it's still efficient now that it gets the "right answer."
Thanks for pointing this out It is actually "relatively" clear into my mind but indeed the terminology for outsiders like me is quiet confusing.Also, make sure you don't confuse CPU and GPU terminology. What people mean when they say thread level parallelism is not what GPUs rely on. GPUs need data parallelism. For example, triangles from a mesh spawning thousands of pixel shaders is equivalent to data parallelism.
Point is that for compute, the basic tool everybody uses and it is a landslide victory is CPUs.They have to target workloads with less parallelism as on the other ones it's less likely to beat GPUs.
It seems to me that the gain provided by the GPU still applies to few type of workloads.
What is bothering is that researchers in realtime graphics now have solutions that doesn't rely on the brute force approach use by GPU. I think it kind of shift perspectives, reading Andrew's comment I don't feel (he may clarify, I might read too much into it though sebbbi's pov on the matter is another legitimate source of questioning) like that GPU will get where he would need for his techniques to work in near future.
At the same time he and others acknowledge that nothing (a really paradigm shift in realtime rendering) that nothing may happen before 5/6 and what next is unclear.
So I feel that "worloads with less parallelism" is where most workloads are, including those that would qualify as math heavy, and it could be where (realtime) graphics are headed too. For now let say there is a lock down as nobody will come with a technique that perform horribly on nowadays (and tomorrow GPU).
I will search for that presentationTo mimic what? Variable vector lengths? That's basically an old concept of vector computers which just need a modern implementation. You may want to have a look at the presentations about future GPU generations (Einstein) from nV.
I mean mimick CPU in all they are doing, autonomous, able to make a lot out of data locality though cache, dynamically exploit IPC, being able to successfully exploit "low to moderate" level of data parrallelsim through their SIMD and also (while offering less throughput) dealing with high level of data parrallelism. They have a lot of mechanisms to hide latency. They can "storm" through serial part of the code, etc. All that with one units working on data in its cache (assuming that data /code is here but the same applies to GPU, if not you have to rely on latency hiding mechanism). Even from a power perspective, doing stuffs on the CPU and GPU "cores" means that you may have to move data back and forth quiet often (and you have headache like load balancing).
As others have noted already, GPUs rely mostly on data level parallelism. Thread level parallelism would be running different kernels in parallel (they can do this too and internally, wide problems get split in to a lot of threads [warps/wavefronts]). And in throughput oriented latency hiding architectures, you don't rely on the caches so much to keep the latency down, but to provide more bandwidth than the external RAM. For that, a pretty low cache efficiency (in CPU terms) is usually enough. With larger SIMD arrays the caches have to grow of course to maintain the level of efficiency they have.
Thanks for further clarifying the vocableBecause that is expensive and costs a lot of transistors and power. If you can get away with less effort on that, you will be more power efficient for the kind of tasks which tolerate it.
By the way, GPUs can use DLP, TLP, and ILP (the VLIW architectures relied quite heavily on it for instance), just not as aggressive as CPUs.
They are turning into CPU, scalar unit, more cache. It will affect further the compute density.
CPU are not that far, if the Durango CPU has really 2 FMA units, the throughput of 8 core is around 200GFLOPS. It is pretty tiny and relatively cool. 32 would be in the 800 GFLOPS, it still would not be that big or hot (by GPU standards). It would be interesting to see how that compare to something like FERMI type of GPU for computing stuffs (not graphics though).
And Jaguar even with 2 FMA units is not that throughput oriented either. Look at the cell compute density was really high.
After having reading the on going "fight" on this board about the "gpu guys" and the "cpu guys", having people like T. Sweeney stating what he stated (even though some mocked him), and overall the enthusiasm that surrounded Larrabee overall, the "CPU guys" have won me over.
GPU are "somehow" in a pretending stance, even Nvidia went back from Fermi to Kepler wrt to compute capability, it was too costly. For the (few) tasks massively data parallel they are almost already has good as they can (what is left to win). They are already bigger than most CPU, hotter, etc. Thanks to lithography they will continue to provide more but mostly more of the same.
With the market dynamic being what it is (/all actors would need to shift to another paradigm, come to an end with the graphic pipeline completely, get rid of those thick API and driver models /it is unlikely especially as MSFT grip is loosening on the that front => more actors) they can't really afford to go for anything else that something that is still close to a machine that handles "massively data parallel problems" and only.
I would more easily put my -self in the position of a business person, when I read the arguments going on for years, I have to say the "GPU guys" have a tough sale, and I don't see a real game changer coming. Actually the it is the CPU that won me over. The GPU evironment is a moving target, I look at Nvidia is doing and stating of late, I would be clearly thinking this is going nowhere /still no where near ready for prime.
On the other hand Intel shipped (at last) Xeon Phi and is likely to have follow up, many cores (X86 or ARM) and high bandwidth interface for CPU are a couple of years ahead, and it seems that the CPU guys have rediscovered the beauty of the "GPU" (SPMD, SiMT as Nvidia calls it?) programming model.
I think that for once the business people are right, it sounds a bit to me like somebody saying "this is where our project is heading, you can rely high on us (/have faith), though we don't really to pay the price for that the project to materialize".
Outside of a few applications, it is a too shady area of computing with too much question in the air for the big money to invest (well financial computation works on GPU but they "work" already for those type of works). You have the HPC "market" but outside of big projects funded by government money , it looks more like an impressively dedicated bunch of people, willing to use anything and to go through any kind of pain for the sake of advancing mankind knowledge. It is noble but I don't think it is enough to support the development of any specific hardware.
Overall at this point I'm not sure one could easily change my mind, I believe the timing Intel had for the shift (with Larrabee) was the right one. Silicon budget were getting good enough. If (one could put Paris in a bottle with if / a posteriori thinking is so much easier...) there would have been a consensus on the matter ( MSFT had the ~weight I think at this time to push the whole industry in that direction) I think that a "Larrabee" done by GPU guys would have removed the lock on 3d realtime researchers/developers, would have hurt the CPU business in more than a couple of applications, and out done Intel effort to extend its ISA based grip on the industry (so larrabee).
Though speaking of business, would even AMD at the time be OK with that (really competing with them-selves with a non X86 architecture)? It is not a given to say the least.
Last edited by a moderator: