Well, a practical example might make this easier to understand.I'm not totally sure how you use syncthreads to enforce mutual exclusion, though, for example. Say I want each thread to examine a piece of data and to write to a location in a shared buffer depending on that data..do you just place synchthreads before and after the write, or..?
// Copy input to shared mem.
shared[tid] = values[threadIdx.x];
// We just wrote to shared memory and other threads might access it -> sync.
__syncthreads();
float myVar = shared[threadIdx.x + 24];
// No __syncthreads needed here, because we are only reading; not writing.
myVar *= 3.14f;
shared[threadIdx.x] = myVar;
// We just wrote to shared memory and another thread might access it -> sync.
__syncthreads();
[...]
Quarter speed, area for multipliers increases quadratically with width.They will almost certainly be half speed, because i don't think any sane circuit designer would waste all those transistors idling away when not being used for GPGPU apps.
You are correct sir. I was a little excited and thinking it would be done the way SSE units do it. But for a GPU it would be a waste. If they make it fully IEEE-754, it will probably be less than quarter speed even, like Cell.Quarter speed, area for multipliers increases quadratically with width.
No caching at all for writes, and reads are only cached when using pseudo-textures/constants through their respective dedicated caches.BTW, is there really no caching at all for scattered writes with the G80?
Nope - the idea, as far as I can see, is that consecutive accesses will be optimal anyway, because the burst length is <= 4*16. (bytes*threads...)Not even an output cache for write-combining?
Quarter speed, area for multipliers increases quadratically with width.
Yeah, I agree. Honestly, I'm not sure whether G80+CUDA does that. The documentation clearly states that reads and writes are uncached, but this is an implementation detail, and many software programmers might not even think of this as "caching" if you mentioned it to them.There are applications which won't be able to coalesce writes from threads into the same single write access but still have spatial coherence which a write combining cache could make use of.
The delay of a Wallace tree based multiplier is only O(logn), so with respect to speed, it's really not that bad.
He's referring to the amount of hardware that increases quadratically with width (assuming a single cycle implementation), not the delay. Simplistic estimation for the mantissa calculation:
FP32 - 24*24 full adders = 576
FP64 - 53*53 full adders = 2809
About 5 times more hardware just for the mantissa.
Does CUDA allow one to write program with stack type data structure? A lot of algorithms require a stack like structure. How about hash-tables?
Well, a practical example might make this easier to understand.
Code:// Copy input to shared mem. shared[tid] = values[threadIdx.x]; // We just wrote to shared memory and other threads might access it -> sync. __syncthreads(); float myVar = shared[threadIdx.x + 24]; // No __syncthreads needed here, because we are only reading; not writing. myVar *= 3.14f; shared[threadIdx.x] = myVar; // We just wrote to shared memory and another thread might access it -> sync. __syncthreads(); [...]
Memory isn't always (dynamically) unbounded with recursion. With tail-recursion it's constant, and even with some almost-tail-recursive functions (tail-recursion+mod, linear logic, etc) it's predictable.
What you describe as "restart" or "backtrack" reminds me of a form of STM (Software Transactional Memory), I assume you mean that if you have a thread collision, or if you run out of space, you recompute the datastructure/computation from a known good state.
Cheers! So basically you sync after writes to shared memory. Thanks again, this will be useful.