NVIDIA Fermi: Architecture discussion

I don't think so. Clauses are a great way to minimize batch switching and maximize latency hiding. I'm sure hyperthreading works in a similar way, where as many loads as possible for one thread are issued while another thread is occupying the execution units.

For a lot of applications they don't matter because there could be enough threads to saturate tex throughput even with dependent ALU-TEX-ALU-TEX sequences, but some register heavy programs may not have many threads and clauses will help immensely.

But that only works if you can predict latencies at compile time. How does that work when you have no idea what data you will be loading (and from where) at runtime?
 
http://insidehpc.com/2009/10/06/interview-with-hpc-pioneer-justin-rattner/

GPUs, exascale, and the HPC of tomorrow
Although Intel stopped producing commercial supercomputers in 1996, it hasn’t stopped shaping the industry. The architectures pioneered by the company dominate today’s Top500 lists, and chips made by Intel were in 79.8% of all systems in the most recent Top500 list (Jun 2009). One of the challengers to the hegemony of Intel’s processors in HPC today is the GP-GPU. Rattner doesn’t see the GPU as a long-term solution, however. “For straight vector graphics GPUs work great, but next generation visual algorithms won’t compute efficiently” on the SIMD architecture of the GPU, he says, drawing a parallel between the era of vector computers and the Touchstone system and today’s processor/GPU debate. As he sees it the GPU restricts the choice of algorithm to one which matches the architecture, not necessarily the one that provides the best solution to the problem at hand. “The goal of our next generation Larrabee is to take a MIMD approach to visual computing,” he says (That’s a Larrabee die pictured at right being held by Intel’s Pat Gelsinger). Part of Intel’s motivation for this decision is that the platform scales from mobile devices all the way up to supercomputers. And they have early performance results that will be presented at an IEEE conference later this year that show that the Larrabee outperforms both the Nehalem and NVIDIA’s GT280 on volumetric rendering problems.
 
Last edited by a moderator:
At this point I'm modestly interested in how one can say Larrabee is somehow free of SIMD baggage without some intellectual dishonesty.
Maybe I'm getting hung up on the 512-bit SIMD unit per core.

With Fermi allowing 16 kernels to operate on-chip at the same time, how non-MIMD would it be compared to Larrabee?
 
If you have a killer application of GPGPU, and get your code functional on NVidia hardware, wouldn't you choose AMD for your final product if you can tweak the same code to run more cost effectively on it?

Maybe if you are running your own cluster and this is an HPC scenario. As I said before, this is super-niche. GPGPU scenarios are not going to make NV much money. And even if you allow that this is target, the enterprise computing market (say, big pharma) doesn't make big purchases on cost/performance alone, otherwise no one would have bought Sun E10000 servers. You're thinking like an engineer and not a middle manager.
 
But that only works if you can predict latencies at compile time. How does that work when you have no idea what data you will be loading (and from where) at runtime?
Making the hardware deal efficiently with lots of memory accesses distributed randomly throughout the program is expensive, clauses don't make it any more expensive (it's just a couple of bits in the end). It's just a hint which helps make life easier on the hardware for when life can be easy on the hardware ...
 
Maybe if you are running your own cluster and this is an HPC scenario. As I said before, this is super-niche. GPGPU scenarios are not going to make NV much money.
Then why is NVidia going down this road? AMD only has to worry about GPGPU performance and development tools if it does indeed become a big market that approaches the size of the GPU market. Unless I'm mistaken, that assumption is a prerequisite to this whole line of discussion.
 
But that only works if you can predict latencies at compile time. How does that work when you have no idea what data you will be loading (and from where) at runtime?
Why do you need to predict latencies? Clauses are just a way to have a simple scheduling system that increases latency hiding ability compared to fixed round-robin sheduling (like R100-R400, NV20, and probably NV30 & G70). It works like this:
1. Run as many ALU instructions as you can without more data
2. Issue loads for as much data as you can without needing more ALU instructions, and wait for the data to arrive. Note that some of this may only be needed later in the program stream.
3. Repeat
Batches simply alternate between 1 and 2. Hopefully you have enough batches to make sure that you are maximizing either ALU or fetch throughput. If so, then you're running as fast as possible and hiding all latency.

Abandoning clauses for a more sophisticated scheduling system (e.g. where you don't wait for loads to complete until you actually need them immediately) will help for some extreme cases, but increasing the register file often does the same thing by allowing more threads, so whether it worth it to do the former, the latter, or neither depends on the workload.
 
I agree with Mintmaster. Clauses make *a lot* of sense in a data parallel world when you don't need some super clever scheduling mechanism to hide all sorts of latencies and I don't see this drastically changing any time soon.
 
2. Issue loads for as much data as you can without needing more ALU instructions, and wait for the data to arrive. Note that some of this may only be needed later in the program stream.

Yes, I understand the benefits of clauses but this snippet above is what I don't get. Isn't the predictability of loads going to fall dramatically as workloads become less neat and uniform like they are with graphics today? How can the compiler construct load clauses when it doesn't know what the program will need to load at runtime? I'm not talking about static texture lookups in a shader here.

Also, what would a scheduler look like if it has to support both clauses and instruction level dependency checks?
 
I think the point here is that clauses don't hurt. They just help for the workloads where it is statically predictable. They obviously cant help when workloads are irregular.
 
Then why is NVidia going down this road? AMD only has to worry about GPGPU performance and development tools if it does indeed become a big market that approaches the size of the GPU market. Unless I'm mistaken, that assumption is a prerequisite to this whole line of discussion.

I was assuming that the purpose of Fermi is still on the consumer's desktop, not in the cloud. It doesn't make a whole lot of sense for NVidia to target the cloud with GPGPUs, anymore than it makes sense with LRB. There's a market for that kind of stuff, but it is dwarfed by the consumer market.

If you add up the largest clusters in the world today: microsoft, google, yahoo, amazon, it might total 3-5 million machines. Everyone else is collectively dwarfed by them. NVidia would have to sell GPUs at $500 per chip to replace their current revenues. At that price, they are uncompetitive with commodity x86 clusters.

This is an ancillary revenue stream, it cannot be the primary one. A special use case might exist for renderfarms, but look, Pixar rendered Cars on a 3,000 CPU farm IIRC. If NVidia's proposition is "buy Fermi and you only need 300 machines instead of 3,000", that would constitute only a $3million revenue if GPUs were $10k per piece.

I would roughly equate the size of the 'pure GPGPU' (no graphics) market on par with "workstation" segment, where people in bioinformatics, finance, et al might like to run some kernels. In the special effects industry, real-time scene preview would be very valuable.

No, the real win, if there is one, is to find killer applications for GPGPU on the desktop, ones where AMD is at a disadvantage. Silicon Graphics showed the foley of trying to address the server market only.
 
Exactly. Seems like the scheduler would just be issuing clauses rather than single instructions and an instruction depending on the results of a load would terminate a clause. I'm not sure how variable latency of loads would be a problem unless ATI is doing round-robin scheduling over clauses. I guess statically scheduling a clause could be a problem if arithmetic instructions have data-dependent latency; e.g. I believe divides on Intel CPUs are faster when divisors are small - is there anything like that in GPU land?
 
Exactly. Seems like the scheduler would just be issuing clauses rather than single instructions and an instruction depending on the results of a load would terminate a clause. I'm not sure how variable latency of loads would be a problem unless ATI is doing round-robin scheduling over clauses. I guess statically scheduling a clause could be a problem if arithmetic instructions have data-dependent latency; e.g. I believe divides on Intel CPUs are faster when divisors are small - is there anything like that in GPU land?

I'm drawing parallels to one of my systems where we invoke various processes that manipulate the data based on attributes of each data element. Those processes can be all arithmetic or hit a remote datasource. So you don't know in advance what instructions will be executed or which loads will be needed so how can you define clauses around things you don't know? Clauses are fine for current workloads where nearly everything is static, but you guys seem to be saying you can use clauses for a more unpredictable environment too. I just don't see how that works, since clause demarcation happens in the compiler.
 
I was assuming that the purpose of Fermi is still on the consumer's desktop, not in the cloud. It doesn't make a whole lot of sense for NVidia to target the cloud with GPGPUs, anymore than it makes sense with LRB. There's a market for that kind of stuff, but it is dwarfed by the consumer market.
Okay, so you aren't talking about niche applications then.

If you're talking about making software that runs on an already installed hardware base, then how does it help NVidia to sacrifice their GPU competitiveness? A software dev will definately want to double their market by making it work and work well on AMD hardware. If these applications become prevalent enough to affect purchasing decisions, then ATI will adjust their hardware when necessary.

When I was talking about a product, I wasn't talking about replacing exisiting farms from Microsoft, Google, etc. I was talking about a killer HPC product with mass market appeal. Imagine automated driving, for example. I thought this is the type of GPGPU market explosion that you were talking about. I don't really see any common desktop application needing that kind of power, because we don't even have much to use 100 GFlops on top CPUs today, let alone 3 TFlops of OpenCL power.
 
There must be some overhead associated with clause processing such that the cost of, for example, 5 1-instruction clauses is higher than 5 ungrouped instructions.
If there is such a cost, then there is a threshold where clauses don't make sense.

I think there is some number of cycles that have to be hidden by other work, but I don't think the number is disclosed.
 
Last edited by a moderator:
It's just groups of instructions ... all heavier branching does is make clauses shorter.

Yep, but demarcation is just the first part of it and that happens in the compiler. What does the scheduler gain by juggling lots of tiny clauses? And on the flip side why can't a fine grained compiler/scheduler replicate claused behavior by simply moving loads to the most convenient place in the instruction stream (which I assume is happening anyway). If I understand Nvidia's architecture correctly multiple concurrent memory fetches can be in progress from the same warp.
 
If these applications become prevalent enough to affect purchasing decisions, then ATI will adjust their hardware when necessary.

But this assumes that adjusting their hardware preserves all of their current benefits in terms of cost. In other words, you assume that NVidia can't really make their implementation more space efficient, but ATI can add all of the benefits of NVidia's architecture for the class of application we're theorizing, yet not suffer from it.

When I was talking about a product, I wasn't talking about replacing exisiting farms from Microsoft, Google, etc. I was talking about a killer HPC product with mass market appeal. Imagine automated driving, for example. I thought this is the type of GPGPU market explosion that you were talking about. I don't really see any common desktop application needing that kind of power, because we don't even have much to use 100 GFlops on top CPUs today, let alone 3 TFlops of OpenCL power.

HPC for me means scientific applications in computing clusters. I've never heard the term HPC applied to embedded computing. Typically HPC clusters work on problems like n-body simulation, computational fluid dynamics, weather simulation, molecular biology, etc.

Sure, if NVidia could get every Toyota to use a Fermi for visual processing, that would be a large contract, but I was assuming that there is still a large market for media applications, be it games, or photo processing. It Nvidia could demonstrate sheer superiority in physics with a killer app, or say, vastly accelerated OpenCL based face recognition for photo libraries, it would be a good use case.

I mean, if you really want to be pessimistic, there is simply no need, currently, for people on the desktop to have 2 TFlops of computing power, be it Fermi or R8xx. Most of it goes underutilized, no one is truly taking advantage of all that power on the desktop, and frankly, outside of games, no one has demonstrated much of need for that kind of power. On consoles, it might be a different story, but today, Intel, Nvidia, and AMD are building products that consumers don't really need.
 
Back
Top