G80 Architecture from CUDA

Rufus · Feb 16, 2007

So now that CUDA's out I expect a whole bunch of interesting stuff about G80 to be tested. Here's some interesting bits from a first scan of the docs.

Memory sizes:

The amount of shared memory available per multiprocessor is 16 KB divided into 16 banks (see Section 6.1.2.4);
The amount of constant memory available is 64 KB with a cache working set of 8 KB per multiprocessor;
The cache working set for 1D textures is 8 KB per multiprocessor;

Confirms that it's 8 double-pumped ALUs, not 16 ALUs.

Each multiprocessor is composed of eight processors running at twice the clock frequencies mentioned above, so that a multiprocessor is able to process the 32 threads of a warp in two clock cycles.

Shared memory banking:

In the case of the shared memory space, the banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per clock cycle.

Number of threads in-flight:

The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 concurrent threads per multiprocessor to hide them.

I'm sure there are lots more interesting tidbits once I read the whole thing.

Edit: and some more from the RelNotes:

Q: Does CUDA support Double Precision Floating Point arithmetic?
A: CUDA supports the C "double" data type. However on G80 (e.g. GeForce 8800) GPUs, these types will get demoted to 32-bit floats. NVIDIA GPUs supporting double precision in hardware will become available in late 2007.

Talk of a Quadro (not too surprising):

The release and debug configurations require a GeForce 8800 Series GPU (or equivalent G8X-based Quadro GPU) to run properly.

Farhan · Feb 16, 2007

192 threads to hide a RAW hazard? How many cycles is that? 192 at least? Sounds like a huge latency to me (coming from CPU world).

armchair_architect · Feb 16, 2007

Farhan said:
192 threads to hide a RAW hazard? How many cycles is that? 192 at least? Sounds like a huge latency to me (coming from CPU world).

Well, since apparently it can execute an instruction over a 32 thread "warp" (new term for batch?) in two slow clock cycles, that's a 12 cycle latency. It is higher than CPUs, but not too far out of the range. And CPUs put a huge amount of effort (== transistors) into keeping their latency down; G80 gets to spend those on math horsepower instead.

Remember, CPUs are designed to run a single thread very fast (low latency). GPUs are designed to run many threads very fast (high throughput).

Farhan · Feb 16, 2007

Not sure i'm following you here about the "warps"... but 12 sounds a lot more reasonable, in line with expectations assuming a 10 stage ALU as described in their patent. I have not read anything though, i'm just commenting off the summary in the first post. Isn't a thread = a batch?

Nick · Feb 16, 2007

CUDA said:
NVIDIA GPUs supporting double precision in hardware will become available in late 2007.

Wow.

Does this mean every ALU and data path is extended to 64-bit, or would they somehow trade precision for performance? I'm also curious what precision the transcendental functions would/could be...

Farhan · Feb 16, 2007

Nick said:
Wow. Does this mean every ALU and data path is extended to 64-bit, or would they somehow trade precision for performance? I'm also curious what precision the transcendental functions would/could be...

They will almost certainly be half speed, because i don't think any sane circuit designer would waste all those transistors idling away when not being used for GPGPU apps. What i'm wondering is if they will be fully IEEE754 compliant (yes denormals too!)?

armchair_architect · Feb 16, 2007

Farhan said:
Not sure i'm following you here about the "warps"... but 12 sounds a lot more reasonable, in line with expectations assuming a 10 stage ALU as described in their patent. I have not read anything though, i'm just commenting off the summary in the first post. Isn't a thread = a batch?

The CUDA docs use thread to mean a single member of a batch. So warp == batch == SIMD group, and in graphics terms a thread would correspond to a single vertex or pixel. I think that's consistent with how Nvidia has used 'thread' in the past, but ATI has traditionally used 'thread' to mean 'batch'. Very confusing.

I hadn't seen the term warp before. It makes a funny kind of sense, taking the 'thread' analogy a step further (think of the weaving definition of 'warp').

silent_guy · Feb 16, 2007

Farhan said:
They will almost certainly be half speed, because i don't think any sane circuit designer would waste all those transistors idling away when not being used for GPGPU apps. What i'm wondering is if they will be fully IEEE754 compliant (yes denormals too!)?

Learnings for make benefit education of non-math person: how are denormals used in real life? Is there are way to deal with them, say, in normal C programs and how does that work in practise?

Arun · Feb 16, 2007

It's actually slightly more complicated than that for warps. Each warp corresponds to 32 threads, while the ALUs are 8-wide (or, if you prefer to see it this way, each multiprocessor is composed of 8 processors). Furthermore, the ALUs run at twice the clock rate of the scheduler. So, 32 threads is a warp, but that takes two clock cycles to execute. The result of that is every ALU operation on a warp hides at least 2 clock cycles of memory latency. The documentation further lists that the memory latency can go as high as 200-300 cycles. As far as I can see, for truly ideal latency hiding, you'll want 512 threads running on a single multiprocessors, which amounts to 32 clock cycles. How it hides the rest of the latency is likely related to ILP...

And CUDA article coming in a couple of hours - don't expect the longest greatest thing ever given the amazing forward warning we had that it was coming out today :| Anyhow, better get back to writing!

_xxx_ · Feb 16, 2007

Farhan said:
Not sure i'm following you here about the "warps"...

Warp = chain, FWIW.

nAo · Feb 16, 2007

chain? is it a 'popular' concept?

Farhan · Feb 16, 2007

silent_guy said:
Learnings for make benefit education of non-math person: how are denormals used in real life? Is there are way to deal with them, say, in normal C programs and how does that work in practise?

Denormals allow for very small numbers to be represented, which would normally not be representable using the normalized FP format. Say if you subtract 2 FP numbers which are very close to each other, the result is a very small number, sometimes too small to be represented in the normalized format. Trying to normalize it would generate an underflow (it will be rounded to 0, IIRC). So denormals allow the number to be represented in 0.XXXX^(smallest exponent) instead of the normalized FP format of 1.XXXX^(exponent).

Sorry if my attempt to explain it is confusing

In hardware where speed is more important than absolute accuracy, like GPUs, denormal numbers are simply rounded to 0, since 0.xxxx^2(-126) (for FP32) is a really small value and such a small difference will not be obvious in a pixel. Supporting denormals may slow the common case down a lot due to the added complexity.

Titanio · Feb 16, 2007

A warp is the set of threads currently executing on a multiprocessor, or..?

I'm still trying to wrap my head around that. I 'get' thread grids, thread blocks..warps just strike me as a still smaller set of threads, but I'm not totally sure what distinguishes them (unless they are the set currently executing or ready to execute).

Jawed · Feb 16, 2007

A warp is equal to the width of a processor. i.e. 8 in G80.

Jawed

Titanio · Feb 16, 2007

The docs say a warp is 32 threads on G80. 8 would make more sense to me, though.

edit - the beyond3d article is up now:

http://www.beyond3d.com/articles/cuda-intro/index.php?p=01

It explains warps as:

Furthermore, what is not listed on that figure is that threads within a block are further grouped in 'warps'.

While this might seem quite messy on first glance, it really is not. Warps correspond to the group of threads that are scheduled together, which implies that branching ideally should be fully coherent within a single warp for maximal performance. Remember the branch coherence of the G80 is 32 threads.

Rufus · Feb 16, 2007

Jawed said:
A warp is equal to the width of a processor. i.e. 8 in G80.

Read page 57, section 5.1:

CUDA said:
The warp size is 32 threads;

The docs say:

CUDA said:
Each block is split into SIMD groups of threads called warps; each of these warps contains the same number of threads, called the warp size, and is executed by the multiprocessor in a SIMD fashion;

Basically the core acts as a SIMD (single instruction, multiple data) processor of width 32, but split across 4 cycles of 8 each. This is similar to how Intel and AMD CPUs used to treat SSE instructions. SSE deals with 128bit data types, but the SSE units were only 64-bits wide so it took 2 cycles.

What this means is that the core only has to have 1 set of state (program counter, etc) for this warp of 32 threads and just does the same thing across them all. This saves you from having to track as much state, with the obvious downside that if all 32 threads don't branch at the same way, you have a significant amount of overhead.

Jawed · Feb 16, 2007

It seems I was one-level too far down the threading hierarchy

and lunch interrupted my reading at page 35

Jawed

Arun · Feb 16, 2007

And Beyond3D's article on CUDA is live, too... Might help a bit on this, although it's primarily based on the documentation and isn't really focused on it.
Hopefully it is a bit clearer than the docs wrt warps/grids/etc. though!

Jawed · Feb 16, 2007

It seems to me that __syncthreads() shouldn't cause a pipeline flush. This is because the scheduler can see the "__syncthreads() instruction" coming in advance, so it knows not to issue any of the succeeding instructions: instead it should schedule another warp immediately after __syncthreads().

Jawed

Titanio · Feb 16, 2007

Good article Arun, certainly solidified my understanding.

I'm not totally sure how you use syncthreads to enforce mutual exclusion, though, for example. Say I want each thread to examine a piece of data and to write to a location in a shared buffer depending on that data..do you just place synchthreads before and after the write, or..?

G80 Architecture from CUDA

Rufus

Farhan

armchair_architect

Farhan

Nick

Farhan

armchair_architect

silent_guy

Arun

Unknown.

_xxx_

nAo

Nutella Nutellae

Farhan

Titanio

Jawed

Titanio

Rufus

Jawed

Arun

Unknown.

Jawed

Titanio

Similar threads