G80 Architecture from CUDA

I'm not totally sure how you use syncthreads to enforce mutual exclusion, though, for example. Say I want each thread to examine a piece of data and to write to a location in a shared buffer depending on that data..do you just place synchthreads before and after the write, or..?
Well, a practical example might make this easier to understand.
Code:
// Copy input to shared mem.
shared[tid] = values[threadIdx.x];

// We just wrote to shared memory and other threads might access it -> sync.
__syncthreads();
float myVar = shared[threadIdx.x + 24];

// No __syncthreads needed here, because we are only reading; not writing.
myVar *= 3.14f;
shared[threadIdx.x] = myVar;

// We just wrote to shared memory and another thread might access it -> sync.
__syncthreads();

[...]
 
They will almost certainly be half speed, because i don't think any sane circuit designer would waste all those transistors idling away when not being used for GPGPU apps.
Quarter speed, area for multipliers increases quadratically with width.

BTW, is there really no caching at all for scattered writes with the G80? Not even an output cache for write-combining?
 
Quarter speed, area for multipliers increases quadratically with width.
You are correct sir. I was a little excited and thinking it would be done the way SSE units do it. But for a GPU it would be a waste. If they make it fully IEEE-754, it will probably be less than quarter speed even, like Cell.
 
BTW, is there really no caching at all for scattered writes with the G80?
No caching at all for writes, and reads are only cached when using pseudo-textures/constants through their respective dedicated caches.
Not even an output cache for write-combining?
Nope - the idea, as far as I can see, is that consecutive accesses will be optimal anyway, because the burst length is <= 4*16. (bytes*threads...)
 
Arun, if you want to write to consecutive addresses why would you be using scatter? ;) There are applications which won't be able to coalesce writes from threads into the same single write access but still have spatial coherence which a write combining cache could make use of.

Silent_guy, I'm assuming they will use 26 bit multipliers and simply resort to a form of matrix multiplier for the doubles (which would go quarter speed). I always assumed that's how area efficient single/double precision fp multipliers worked, but I have to admit I have never bothered to actually look it up.
 
Last edited by a moderator:
There are applications which won't be able to coalesce writes from threads into the same single write access but still have spatial coherence which a write combining cache could make use of.
Yeah, I agree. Honestly, I'm not sure whether G80+CUDA does that. The documentation clearly states that reads and writes are uncached, but this is an implementation detail, and many software programmers might not even think of this as "caching" if you mentioned it to them.
 
The delay of a Wallace tree based multiplier is only O(logn), so with respect to speed, it's really not that bad.

He's referring to the amount of hardware that increases quadratically with width (assuming a single cycle implementation), not the delay. Simplistic estimation for the mantissa calculation:
FP32 - 24*24 full adders = 576
FP64 - 53*53 full adders = 2809

About 5 times more hardware just for the mantissa.
 
He's referring to the amount of hardware that increases quadratically with width (assuming a single cycle implementation), not the delay. Simplistic estimation for the mantissa calculation:
FP32 - 24*24 full adders = 576
FP64 - 53*53 full adders = 2809

About 5 times more hardware just for the mantissa.

I was referring to the "quarter speed".
 
stack data structure from cuda

Does CUDA allow one to write program with stack type data structure? A lot of algorithms require a stack like structure. How about hash-tables?
 
Does CUDA allow one to write program with stack type data structure? A lot of algorithms require a stack like structure. How about hash-tables?

Explicit or implicit stack? Explicit stack, yes, just use scratchpad RAM to make one. Implicit stack, as in, CPU frame-style stacks? I don't think so, there is apparently no PC counter control/computed jump instruction? Someone on GPGPU basically said no implicitly recursive functions.

As for hash table, again, as long as you've got scatter/gather, you can write one yourself.
 
Stacks are hard on the GPU. Even with the PDC (Parallel Data Cache), you have to share the space with all threads in the warp and you have to be careful about conflics on bank access. In the GPGPU community, we adapt datastructure traversal to support "restart" or "backtrack" methods, see Foley et al's paper from Graphics Hardware last year or Horn et al's paper from the upcoming I3D, both on k-D tree raytracing. The later emulates a fixed size small stack using the register file and using looping constructs instead of pure streaming. With scatter and gather, you could emulate stacks in GPU memory (and even host on ATI chips with fine grain scatter), but it becomes *extremely* expensive. You are now talking about tremendous amounts of latency to cover, and you are still talking about defining a bounded region of memory for each thread, basically allocating for the worst case stack usage. However, someone can probably extend Lefohn's GLift GPU datastructure work to make this easier to use, but it's likely still going to be expensive.

The main issue with recursion or stacks is that the memory requirements are unbounded. On the CPU this really isn't a problem, but as you have to handle 100s-1000s of threads, the required space on a GPU or a CPU way into the future gets quite high.
 
Memory isn't always (dynamically) unbounded with recursion. With tail-recursion it's constant, and even with some almost-tail-recursive functions (tail-recursion+mod, linear logic, etc) it's predictable.

What you describe as "restart" or "backtrack" reminds me of a form of STM (Software Transactional Memory), I assume you mean that if you have a thread collision, or if you run out of space, you recompute the datastructure/computation from a known good state.
 
Well, a practical example might make this easier to understand.
Code:
// Copy input to shared mem.
shared[tid] = values[threadIdx.x];

// We just wrote to shared memory and other threads might access it -> sync.
__syncthreads();
float myVar = shared[threadIdx.x + 24];

// No __syncthreads needed here, because we are only reading; not writing.
myVar *= 3.14f;
shared[threadIdx.x] = myVar;

// We just wrote to shared memory and another thread might access it -> sync.
__syncthreads();

[...]

Cheers! So basically you sync after writes to shared memory. Thanks again, this will be useful.
 
Memory isn't always (dynamically) unbounded with recursion. With tail-recursion it's constant, and even with some almost-tail-recursive functions (tail-recursion+mod, linear logic, etc) it's predictable.

What you describe as "restart" or "backtrack" reminds me of a form of STM (Software Transactional Memory), I assume you mean that if you have a thread collision, or if you run out of space, you recompute the datastructure/computation from a known good state.

You are correct, but tail recursion is generally converted into a loop form by the compiler to avoid the stack frame push, and hence no longer recursion from a resource perspective.

Restart and backtrack is this case are are not related to transactional memory as we aren't talking about creation/manipulation of the datastructure, just traversal. You rely on the properties of the algorithm to effectively change the query being issued through the tree. In the case of a k-d tree this can be done by changing the values of tMin and tMax during traversal.
 
According to the docs, recursion is not supported in the current nvcc compiler, from what I recall.
 
Cheers! So basically you sync after writes to shared memory. Thanks again, this will be useful.

Just a note - having worked on this for a little while, that proposed solution does not actually enforce mutual exclusion at all. __syncthreads() alone won't do it, it's just a barrier. I'm probably going to have to look into the bakery algorithm or something similar, since there are no atomic instructions here at all AFAIK.
 
Back
Top