Why would you need that, isn't a pointer just an index into global memory? (or some UAV in this case). And with the interlocked counters you'll have no problem "allocating" new blocks either. I see no problem in either building or traversing such structures.
Why is newegg in stock of GTX 200 series cards when nvidia is suppose to be close to releasing their next gen cards? These cards weren't prior in stock. Why are there quantities now
Meh, yeah I guess for something like an octree you could use an append buffer to add child nodes and just update the parent with the index for each child. Same effect as a pointer but ugly
Btw, do GPU's currently support mutexes? That would be a requirement for rebuilding the acceleration structure on the GPU no? (say two threads want to create the same child node in a kd-tree to insert two different objects that fall into that cell)
I don't believe they currently support the cpu style light weight mutexes such as LL/SC or FetchAndAdd/CmpExchange but they do support sync primitives that could be used as a heavy weight alternative. But in general I think you are currently better off not playing around with any link list style structures (LL, n-ary trees, latices, etc).
Ballot is used to set a bit mask based on an arbitrary predicate, however it doesn't actually perform the scan. syncthreads_count() on the other hand can be used to run a prefix sum on that mask. That comes in handy if you want to count the number of elements in an array that meet a certain criteria (e.g val < pivot in a quicksort).
See US patents 2009008952 and 20090132878. The PSCAN operation described there is pretty much equivalent to syncthreads_count except that the latter runs across the entire block and not just a single warp.
Why is newegg in stock of GTX 200 series cards when nvidia is suppose to be close to releasing their next gen cards? These cards weren't prior in stock. Why are there quantities now
Even without explicit support you can do a form of test and set with atomics.
Actually, I don't think you'd want to use a syncthreads_count() to do a prefix sum on the mask. syncthreads_count() only returns the reduction of all the predicates in the thread block, which is not the same thing at all.
Ballot is actually very useful for prefix sums, though: you can use it in conjunction with a boolean mask and a popcount to do a warp scan efficiently, for example:
uint warpId = threadIdx.x & 0x1f;
uint warpMask = (1 << warpId) - 1;
...
int predicate = foo();
uint ballot = __ballot();
uint parents = ballot & warpMask;
int warpScan = __popc(parents);
GTX200 series with two renaming schemes = future GTX380/GTX 360
Ballot is used to set a bit mask based on an arbitrary predicate, however it doesn't actually perform the scan. syncthreads_count() on the other hand can be used to run a prefix sum on that mask. That comes in handy if you want to count the number of elements in an array that meet a certain criteria (e.g val < pivot in a quicksort).
See US patents 2009008952 and 20090132878. The PSCAN operation described there is pretty much equivalent to syncthreads_count except that the latter runs across the entire block and not just a single warp.
They need GT200 chips for their Quadro and Tesla business. And not every chip is good enough.
GTX200 series with two renaming schemes = future GTX380/GTX 360
This would be a real fun because i read somewhere that GT200b is EOL.
Not so easy with compute shaders. For instance you are not allowed to have a loop with an exit condition that is a function of a value that has been fetched from an UAV, for example via atomics.Even without explicit support you can do a form of test and set with atomics.
Cool, I wasn't aware that the __popc function existed. So yeah, in conjunction with __ballot you get a warp level sum. I still don't understand why you think syncthreads_count isn't useful though. There will be times when you want to evaluate the mask over the entire block. Essentially it's doing everything above as well as doing a final sum of all the individual warpScan values for thread(warpsize) across all warps.
If you mean that syncthreads_count isn't useful as a scan on a single warp level mask produced by __ballot then yeah, absolutely. I should've worded my original comments better to make it clear that I wasn't suggesting otherwise.
The result of syncthreads_count, while still useful, only gives you the total amount of results the block will be producing. They're fundamentally different things.
CUDA supports atomicCAS(), but it is relatively slow at the moment. Building mutexes out of it is not a particularly good idea on the current generation.
It's prudent for any company to care what their competition will do.Why would they care about the size of Cypress? Why would nVidia care about anything that any other company does when they design their chips?
What would that prove other than showing it can't retire kernels out of order? There can still be 16 running in parallel even if the chip hangs.Still waiting for someone to code up 15 infinite loops plus another kernel and see if how it really works.
http://www.inet.se/recensioner/5408811/xfx-geforce-fermin-4gb
Original source: http://www.semiaccurate.com/2010/02/26/sweden-gets-world-exclusive-geforce-fermin-card/
The numbers there are making my head spin....
So because of their Quadro and Tesla line there is now a surplus of GTX200 series gpus floating about on newegg? So close to their next gen release? Bah, I don't get it myself...how can there be no stock for months then all of sudden, close to their release of next gen there is stock? I get the impression this will be their mid-range gpus...