G80 Architecture from CUDA

rwolf · Feb 16, 2007

Yes, nice article.

Arun · Feb 16, 2007

Titanio said:
I'm not totally sure how you use syncthreads to enforce mutual exclusion, though, for example. Say I want each thread to examine a piece of data and to write to a location in a shared buffer depending on that data..do you just place synchthreads before and after the write, or..?

Well, a practical example might make this easier to understand.

Code:

// Copy input to shared mem.
shared[tid] = values[threadIdx.x];

// We just wrote to shared memory and other threads might access it -> sync.
__syncthreads();
float myVar = shared[threadIdx.x + 24];

// No __syncthreads needed here, because we are only reading; not writing.
myVar *= 3.14f;
shared[threadIdx.x] = myVar;

// We just wrote to shared memory and another thread might access it -> sync.
__syncthreads();

[...]

MfA · Feb 17, 2007

Farhan said:
They will almost certainly be half speed, because i don't think any sane circuit designer would waste all those transistors idling away when not being used for GPGPU apps.

Quarter speed, area for multipliers increases quadratically with width.

BTW, is there really no caching at all for scattered writes with the G80? Not even an output cache for write-combining?

Farhan · Feb 17, 2007

MfA said:
Quarter speed, area for multipliers increases quadratically with width.

You are correct sir. I was a little excited and thinking it would be done the way SSE units do it. But for a GPU it would be a waste. If they make it fully IEEE-754, it will probably be less than quarter speed even, like Cell.

Arun · Feb 17, 2007

MfA said:
BTW, is there really no caching at all for scattered writes with the G80?

No caching at all for writes, and reads are only cached when using pseudo-textures/constants through their respective dedicated caches.

Not even an output cache for write-combining?

Nope - the idea, as far as I can see, is that consecutive accesses will be optimal anyway, because the burst length is <= 4*16. (bytes*threads...)

silent_guy · Feb 17, 2007

MfA said:
Quarter speed, area for multipliers increases quadratically with width.

The delay of a Wallace tree based multiplier is only O(logn), so with respect to speed, it's really not that bad.

MfA · Feb 17, 2007

Arun, if you want to write to consecutive addresses why would you be using scatter?

There are applications which won't be able to coalesce writes from threads into the same single write access but still have spatial coherence which a write combining cache could make use of.

Silent_guy, I'm assuming they will use 26 bit multipliers and simply resort to a form of matrix multiplier for the doubles (which would go quarter speed). I always assumed that's how area efficient single/double precision fp multipliers worked, but I have to admit I have never bothered to actually look it up.

Arun · Feb 17, 2007

MfA said:
There are applications which won't be able to coalesce writes from threads into the same single write access but still have spatial coherence which a write combining cache could make use of.

Yeah, I agree. Honestly, I'm not sure whether G80+CUDA does that. The documentation clearly states that reads and writes are uncached, but this is an implementation detail, and many software programmers might not even think of this as "caching" if you mentioned it to them.

Farhan · Feb 17, 2007

silent_guy said:
The delay of a Wallace tree based multiplier is only O(logn), so with respect to speed, it's really not that bad.

He's referring to the amount of hardware that increases quadratically with width (assuming a single cycle implementation), not the delay. Simplistic estimation for the mantissa calculation:
FP32 - 24*24 full adders = 576
FP64 - 53*53 full adders = 2809

About 5 times more hardware just for the mantissa.

silent_guy · Feb 17, 2007

Farhan said:
He's referring to the amount of hardware that increases quadratically with width (assuming a single cycle implementation), not the delay. Simplistic estimation for the mantissa calculation:
FP32 - 24*24 full adders = 576
FP64 - 53*53 full adders = 2809

About 5 times more hardware just for the mantissa.

I was referring to the "quarter speed".

BeyondEnergy · Feb 18, 2007

stack data structure from cuda

Does CUDA allow one to write program with stack type data structure? A lot of algorithms require a stack like structure. How about hash-tables?

MfA · Feb 18, 2007

Speed as in throughput.

DemoCoder · Feb 18, 2007

BeyondEnergy said:
Does CUDA allow one to write program with stack type data structure? A lot of algorithms require a stack like structure. How about hash-tables?

Explicit or implicit stack? Explicit stack, yes, just use scratchpad RAM to make one. Implicit stack, as in, CPU frame-style stacks? I don't think so, there is apparently no PC counter control/computed jump instruction? Someone on GPGPU basically said no implicitly recursive functions.

As for hash table, again, as long as you've got scatter/gather, you can write one yourself.

nAo · Feb 18, 2007

there's no 'real' stack on G80, you can just code your own implementation

mhouston · Feb 18, 2007

Stacks are hard on the GPU. Even with the PDC (Parallel Data Cache), you have to share the space with all threads in the warp and you have to be careful about conflics on bank access. In the GPGPU community, we adapt datastructure traversal to support "restart" or "backtrack" methods, see Foley et al's paper from Graphics Hardware last year or Horn et al's paper from the upcoming I3D, both on k-D tree raytracing. The later emulates a fixed size small stack using the register file and using looping constructs instead of pure streaming. With scatter and gather, you could emulate stacks in GPU memory (and even host on ATI chips with fine grain scatter), but it becomes *extremely* expensive. You are now talking about tremendous amounts of latency to cover, and you are still talking about defining a bounded region of memory for each thread, basically allocating for the worst case stack usage. However, someone can probably extend Lefohn's GLift GPU datastructure work to make this easier to use, but it's likely still going to be expensive.

The main issue with recursion or stacks is that the memory requirements are unbounded. On the CPU this really isn't a problem, but as you have to handle 100s-1000s of threads, the required space on a GPU or a CPU way into the future gets quite high.

DemoCoder · Feb 19, 2007

Memory isn't always (dynamically) unbounded with recursion. With tail-recursion it's constant, and even with some almost-tail-recursive functions (tail-recursion+mod, linear logic, etc) it's predictable.

What you describe as "restart" or "backtrack" reminds me of a form of STM (Software Transactional Memory), I assume you mean that if you have a thread collision, or if you run out of space, you recompute the datastructure/computation from a known good state.

Titanio · Feb 19, 2007

Arun Demeure said:

Well, a practical example might make this easier to understand.

Code:

// Copy input to shared mem.
shared[tid] = values[threadIdx.x];

// We just wrote to shared memory and other threads might access it -> sync.
__syncthreads();
float myVar = shared[threadIdx.x + 24];

// No __syncthreads needed here, because we are only reading; not writing.
myVar *= 3.14f;
shared[threadIdx.x] = myVar;

// We just wrote to shared memory and another thread might access it -> sync.
__syncthreads();

[...]

Cheers! So basically you sync after writes to shared memory. Thanks again, this will be useful.

mhouston · Feb 19, 2007

DemoCoder said:
Memory isn't always (dynamically) unbounded with recursion. With tail-recursion it's constant, and even with some almost-tail-recursive functions (tail-recursion+mod, linear logic, etc) it's predictable.

What you describe as "restart" or "backtrack" reminds me of a form of STM (Software Transactional Memory), I assume you mean that if you have a thread collision, or if you run out of space, you recompute the datastructure/computation from a known good state.

You are correct, but tail recursion is generally converted into a loop form by the compiler to avoid the stack frame push, and hence no longer recursion from a resource perspective.

Restart and backtrack is this case are are not related to transactional memory as we aren't talking about creation/manipulation of the datastructure, just traversal. You rely on the properties of the algorithm to effectively change the query being issued through the tree. In the case of a k-d tree this can be done by changing the values of tMin and tMax during traversal.

KimB · Feb 21, 2007

According to the docs, recursion is not supported in the current nvcc compiler, from what I recall.

Titanio · Feb 21, 2007

Titanio said:
Cheers! So basically you sync after writes to shared memory. Thanks again, this will be useful.

Just a note - having worked on this for a little while, that proposed solution does not actually enforce mutual exclusion at all. __syncthreads() alone won't do it, it's just a barrier. I'm probably going to have to look into the bakery algorithm or something similar, since there are no atomic instructions here at all AFAIK.

G80 Architecture from CUDA

rwolf

Rock Star

Arun

Unknown.

MfA

Farhan

Arun

Unknown.

silent_guy

MfA

Arun

Unknown.

Farhan

silent_guy

BeyondEnergy

MfA

DemoCoder

nAo

Nutella Nutellae

mhouston

A little of this and that

DemoCoder

Titanio

mhouston

A little of this and that

KimB

Titanio

Similar threads