NVIDIA Fermi: Architecture discussion

How can the compiler construct load clauses when it doesn't know what the program will need to load at runtime? I'm not talking about static texture lookups in a shader here.
I'm still not sure if you really understand it. The compiler is not telling the GPU what to load, it's telling it when to load.

For example, it knows that it can't load values needed in ALU instructions 17 and 28 until ALU instructions 15 and 16 determine the addresses for those loads. Hence there is a tex clause consisting of two loads before ALU instruction 17. After instruction 16, the batch is put aside until the two loads arrive, and then it continues. Basically the compiler makes a big dependency graph and groups together loads when it can. The average group size effectively multiplies latency hiding.

More complex hardware would issue loads at instruction 15 and 16, put it aside until one of them to get back, then have the option of either waiting until the next load came back or executing up to instruction 27 and then waiting. This flexibility is important if you don't have enough threads to saturate either the ALU or TEX throughput with the previous method, but if you do have enough threads then it's overkill.

I made a little GPU simulation program for Jawed about a year ago to illustrate how all this affects latency hiding and thus efficiency for any given program, but it was based on the simple scheduler. Maybe I'll add a more complicated scheduler to see what kind of difference it makes.
 
I don't really see any common desktop application needing that kind of power, because we don't even have much to use 100 GFlops on top CPUs today, let alone 3 TFlops of OpenCL power.

Since I come from image processing background, I can think of a lot of (not yet) common desktop applications which can use a lot of computation power. For example, good image denoise is one of them. Right now, a nice profile based image denoise algorithm takes at least a few seconds on the fastest CPU with all cores utilized. When you put it into video, it takes even more time. Another application is real-time face tracking and face recognition. Fast image search is also greatly needed by many but currently it's very hard to do any meaningful classification on personal pictures. Or even some more crazy applications, such as real-time eye tracking for head position related 3D rendering. Or gesture base user interface. Currently we don't have these applications because current computers are just not fast enough.
 
Yep, but demarcation is just the first part of it and that happens in the compiler. What does the scheduler gain by juggling lots of tiny clauses? And on the flip side why can't a fine grained compiler/scheduler replicate claused behavior by simply moving loads to the most convenient place in the instruction stream (which I assume is happening anyway). If I understand Nvidia's architecture correctly multiple concurrent memory fetches can be in progress from the same warp.
Maybe you misunderstood what I was saying. The more complicated method that I mentioned is indeed what NVidia does, AFAIK. They can replicate claused behaviour, but why would they? Their hardware is capable of fancier scheduling, which is better in low thread situations.

Tiny clauses are not good. That's why they exist, as they can only help. TEX-5xALU-TEX-4xALU is worse than 2xTEX-9xALU.
 
If GPGPU starts to move towards more OO-style programming, clauses may get smaller, as OO programs tend to feature more indirection (polymorphic dispatch), and more "heap" style (external loads) vs "stack" style accesses. I've always thought that the idea language for GPUs is pure-functional, not OO, since in functional programming, mutating external state is the exception rather than the rule, and lack of aliasing and mutation permits equational reasoning in the compiler which can convert many record updates into local allocations.
 
I mean, if you really want to be pessimistic, there is simply no need, currently, for people on the desktop to have 2 TFlops of computing power, be it Fermi or R8xx. Most of it goes underutilized, no one is truly taking advantage of all that power on the desktop, and frankly, outside of games, no one has demonstrated much of need for that kind of power. On consoles, it might be a different story, but today, Intel, Nvidia, and AMD are building products that consumers don't really need.
taking on from your last sentence, i don't think you can say that. AMD and NV (intel are yet to) have been catering to a market that was/is quite viable, but it has reached its saturation point. in this regard, nvidia are acting quite logically - they realize there's only this much room in the gaming enthusiast market. GPGPU computing is not only a fashionable development, GPU chip/IP companies actually depend on its wide adoption for their long-term survival. to return to your statement: consumers do not need GPGPU, but they want the benefits of the technology, just like they did not need motion-controlled gaming, but discovered they enjoyed it once a consumer product offered that.
 
I wonder what the scheduling rate is on ATI HW: clauses would seem to allow ATI significantly more time (vs. NVidia) to look over a set of wavefronts and determine which one should be scheduled next. That, in combination with no instruction level dependency checking, should translate to a smaller, more power efficient scheduler on the ATI side.
 
taking on from your last sentence, i don't think you can say that. AMD and NV (intel are yet to) have been catering to a market that was/is quite viable, but it has reached its saturation point. in this regard, nvidia are acting quite logically - they realize there's only this much room in the gaming enthusiast market.

My point is more subtle than that. If you shipped consoles with TFlops of power, like a XBox720 or PS4, developers would find ways to utilize the power. The PC desktop market is so fragmented, that games end up using HW very inefficiently by comparison. You could ship PCs with 2TFlops GPU and 8-16 CPU cores, but no game is going to be designed with content that targets this. Scaling resolution up is a poor substitute for utilizing this power.
 
You could ship PCs with 2TFlops GPU and 8-16 CPU cores, but no game is going to be designed with content that targets this. Scaling resolution up is a poor substitute for utilizing this power.
I think that's one of the factors why Nvidias pushing 3D-Vision and Physx as hard as they do.
 
You mean like a Vantage Extreme-Score of 10.000 at mediocre clock speeds? No.

I mean like any hardware improvements that are specifically meant to enhance graphics. In one of the tech papers on Nvidia's site it states such improvements have been made.
 
my humble 2 cents.

Well from what I could gather from this thread, This is most likely a given but it looks like NVidia made Fermi with exploiting the supercomputing business in mind. Since they are so hell bent in putting double-precision capabilities in Fermi. Since that market has a very nice price premium, from what I've seen. Comparatively, Their solution is cheaper than AMD or Intel's. And from the reports/rumors of their ability of disabling parts of their chip points to their gaming line. I don't know. It seems to makes sense to me that way.
 
And, as you were able to confirm many times in the past, all SP's are MIMD, right?
No, I write MIMD-similar units and you will understand it, when Nvidia will launch the card.

BTW, according to your article, the clock speeds have been changed compared to the previous stepping (as 'reported' here on May 15th). Now the chip that was shown this week was also an A1 stepping, yet produced in week 35? Isn't that quite a bit later than May 15th?
I reported it on May 15th, the tape out was of course before (in which timeframe exactly, I do not know).

What about the reported 2547 GigaFlops that were confirmed in the same earlier article? And the gigantic bandwidth due to the 512bit bus reported also in May? You reported then about 2.4B transistors. Did they add a whole lot of logic in between the first stepping and the next first stepping?
I am confused in these days. Could it be that 2.4B transistors and 512 Bit was planed for the dektop card and the reported 3.0B transistors and 384 Bit are based on the tesla card with it's many double precision units?
2547 Gigaflops are wrong. I speculate that the next generation chip will furthermore have MADD and MUL per core.
 
Well from what I could gather from this thread, This is most likely a given but it looks like NVidia made Fermi with exploiting the supercomputing business in mind. Since they are so hell bent in putting double-precision capabilities in Fermi. Since that market has a very nice price premium, from what I've seen. Comparatively, Their solution is cheaper than AMD or Intel's. And from the reports/rumors of their ability of disabling parts of their chip points to their gaming line. I don't know. It seems to makes sense to me that way.
With AMD not playing that game they can't afford to cripple double precision for consumer cards, at least not performance wise ... they could disable exceptions or something I guess.
 
No, I write MIMD-similar units and you will understand it, when Nvidia will launch the card.
There's nothing "MIMD-similar" about Fermi at all. You're a willing part of the misinformation and speculation, and some of your sources about hardware are clearly wrong or playing you. Double check your facts (you know, the basic journalism bit of your endeavour) before publishing.
 
If GPGPU starts to move towards more OO-style programming, clauses may get smaller, as OO programs tend to feature more indirection (polymorphic dispatch), and more "heap" style (external loads) vs "stack" style accesses.
There has been done tons of research about how either of these can be optimized out. The only price is to give up the ability of loading classes at runtime (as in JVM/.NET/Ruby/Python etc).
I've always thought that the idea language for GPUs is pure-functional, not OO, since in functional programming, mutating external state is the exception rather than the rule, and lack of aliasing and mutation permits equational reasoning in the compiler which can convert many record updates into local allocations.
I have no intention of acting as a programming-paradigm-definition-Nazi, but the 3 mentioned language traits: purity, functional and OO can be (and are!) mixed in any combination. Chosing OO paradigm doesn't bind you to side-effects, just like chosing functional paradigm doesn't bind you to Hindley-Milner type system (despite the fact that in most existing cases the opposite is true).
 
No, I write MIMD-similar units and you will understand it, when Nvidia will launch the card.
Yet the only thing about which Nvidia has been unusually open is the computing architecture. Read the David Kanter's article. There's absolutely nothing MIMD about Fermi.
 
I don't think so. If you can run 16 kernels in parallel on fermi, then how is not MIMD like? Perhaps it would be more appropriate to say that it is a transition from SPMD to MPMD.
 
Since that market has a very nice price premium, from what I've seen. Comparatively, Their solution is cheaper than AMD or Intel's. And from the reports/rumors of their ability of disabling parts of their chip points to their gaming line. I don't know. It seems to makes sense to me that way.

You mean cheaper than using AMD/Intel CPUs, right?
 
Back
Top