NVIDIA GF100 & Friends speculation

ECH · Feb 27, 2010

Why is newegg in stock of GTX 200 series cards when nvidia is suppose to be close to releasing their next gen cards? These cards weren't prior in stock. Why are there quantities now :?:

trinibwoy · Feb 27, 2010

Psycho said:
Why would you need that, isn't a pointer just an index into global memory? (or some UAV in this case). And with the interlocked counters you'll have no problem "allocating" new blocks either. I see no problem in either building or traversing such structures.

Meh, yeah I guess for something like an octree you could use an append buffer to add child nodes and just update the parent with the index for each child. Same effect as a pointer but ugly

Btw, do GPU's currently support mutexes? That would be a requirement for rebuilding the acceleration structure on the GPU no? (say two threads want to create the same child node in a kd-tree to insert two different objects that fall into that cell)

Sontin · Feb 27, 2010

ECH said:
Why is newegg in stock of GTX 200 series cards when nvidia is suppose to be close to releasing their next gen cards? These cards weren't prior in stock. Why are there quantities now

They need GT200 chips for their Quadro and Tesla business. And not every chip is good enough.

aaronspink · Feb 27, 2010

trinibwoy said:
Meh, yeah I guess for something like an octree you could use an append buffer to add child nodes and just update the parent with the index for each child. Same effect as a pointer but ugly

Btw, do GPU's currently support mutexes? That would be a requirement for rebuilding the acceleration structure on the GPU no? (say two threads want to create the same child node in a kd-tree to insert two different objects that fall into that cell)

I don't believe they currently support the cpu style light weight mutexes such as LL/SC or FetchAndAdd/CmpExchange but they do support sync primitives that could be used as a heavy weight alternative. But in general I think you are currently better off not playing around with any link list style structures (LL, n-ary trees, latices, etc).

MfA · Feb 27, 2010

Even without explicit support you can do a form of test and set with atomics.

ap_ · Feb 27, 2010

aaronspink said:
I don't believe they currently support the cpu style light weight mutexes such as LL/SC or FetchAndAdd/CmpExchange but they do support sync primitives that could be used as a heavy weight alternative. But in general I think you are currently better off not playing around with any link list style structures (LL, n-ary trees, latices, etc).

CUDA supports atomicCAS(), but it is relatively slow at the moment. Building mutexes out of it is not a particularly good idea on the current generation.

RecessionCone · Feb 27, 2010

trinibwoy said:
Ballot is used to set a bit mask based on an arbitrary predicate, however it doesn't actually perform the scan. syncthreads_count() on the other hand can be used to run a prefix sum on that mask. That comes in handy if you want to count the number of elements in an array that meet a certain criteria (e.g val < pivot in a quicksort).

See US patents 2009008952 and 20090132878. The PSCAN operation described there is pretty much equivalent to syncthreads_count except that the latter runs across the entire block and not just a single warp.

Actually, I don't think you'd want to use a syncthreads_count() to do a prefix sum on the mask. syncthreads_count() only returns the reduction of all the predicates in the thread block, which is not the same thing at all.

Ballot is actually very useful for prefix sums, though: you can use it in conjunction with a boolean mask and a popcount to do a warp scan efficiently, for example:

uint warpId = threadIdx.x & 0x1f;
uint warpMask = (1 << warpId) - 1;
...
int predicate = foo();
uint ballot = __ballot();
uint parents = ballot & warpMask;
int warpScan = __popc(parents);

Vincent · Feb 27, 2010

ECH said:
Why is newegg in stock of GTX 200 series cards when nvidia is suppose to be close to releasing their next gen cards? These cards weren't prior in stock. Why are there quantities now

GTX200 series with two renaming schemes = future GTX380/GTX 360

trinibwoy · Feb 27, 2010

MfA said:
Even without explicit support you can do a form of test and set with atomics.

Interesting, nice workaround

RecessionCone said:
Actually, I don't think you'd want to use a syncthreads_count() to do a prefix sum on the mask. syncthreads_count() only returns the reduction of all the predicates in the thread block, which is not the same thing at all.

Ballot is actually very useful for prefix sums, though: you can use it in conjunction with a boolean mask and a popcount to do a warp scan efficiently, for example:

uint warpId = threadIdx.x & 0x1f;
uint warpMask = (1 << warpId) - 1;
...
int predicate = foo();
uint ballot = __ballot();
uint parents = ballot & warpMask;
int warpScan = __popc(parents);

Cool, I wasn't aware that the __popc function existed. So yeah, in conjunction with __ballot you get a warp level sum. I still don't understand why you think syncthreads_count isn't useful though. There will be times when you want to evaluate the mask over the entire block. Essentially it's doing everything above as well as doing a final sum of all the individual warpScan values for thread(warpsize) across all warps.

If you mean that syncthreads_count isn't useful as a scan on a single warp level mask produced by __ballot then yeah, absolutely. I should've worded my original comments better to make it clear that I wasn't suggesting otherwise.

Sontin · Feb 27, 2010

Vincent said:
GTX200 series with two renaming schemes = future GTX380/GTX 360

This would be a real fun because i read somewhere that GT200b is EOL.

Jawed · Feb 27, 2010

trinibwoy said:
Ballot is used to set a bit mask based on an arbitrary predicate, however it doesn't actually perform the scan. syncthreads_count() on the other hand can be used to run a prefix sum on that mask. That comes in handy if you want to count the number of elements in an array that meet a certain criteria (e.g val < pivot in a quicksort).

See US patents 2009008952 and 20090132878. The PSCAN operation described there is pretty much equivalent to syncthreads_count except that the latter runs across the entire block and not just a single warp.

http://forum.beyond3d.com/showthread.php?p=1327227#post1327227

For the correctly linked patents being referred to.

Yes, the description of __ballot() doesn't match with the PSCAN instruction. I'm confused why PSCAN in that code snippet is not in the the CUDA 3.0 guide

Is it missing?

Jawed

trinibwoy · Feb 27, 2010

RecessionCone's macro seems to be the closest thing to it. There's no single instruction that accomplishes what PSCAN does. Also, I was wrong about syncthreads_count as it returns a single value to all threads - the count of true predicate evaluations. At first I thought it was similiar to rank() from the patent but meh.

ECH · Feb 27, 2010

Sontin said:
They need GT200 chips for their Quadro and Tesla business. And not every chip is good enough.

Vincent said:
GTX200 series with two renaming schemes = future GTX380/GTX 360

Sontin said:
This would be a real fun because i read somewhere that GT200b is EOL.

So because of their Quadro and Tesla line there is now a surplus of GTX200 series gpus floating about on newegg? So close to their next gen release? Bah, I don't get it myself...how can there be no stock for months then all of sudden, close to their release of next gen there is stock? I get the impression this will be their mid-range gpus...

nAo · Feb 27, 2010

MfA said:
Even without explicit support you can do a form of test and set with atomics.

Not so easy with compute shaders. For instance you are not allowed to have a loop with an exit condition that is a function of a value that has been fetched from an UAV, for example via atomics.

RecessionCone · Feb 27, 2010

trinibwoy said:
Cool, I wasn't aware that the __popc function existed. So yeah, in conjunction with __ballot you get a warp level sum. I still don't understand why you think syncthreads_count isn't useful though. There will be times when you want to evaluate the mask over the entire block. Essentially it's doing everything above as well as doing a final sum of all the individual warpScan values for thread(warpsize) across all warps.

If you mean that syncthreads_count isn't useful as a scan on a single warp level mask produced by __ballot then yeah, absolutely. I should've worded my original comments better to make it clear that I wasn't suggesting otherwise.

I do think syncthreads_count is useful, but I'm trying to point out that it does something very different: it's a reduction, rather than a prefix sum. Just to be explicit:

x = [1 2 3 4 5]
reduce(x) = 15 # You can get this with syncthreads_count
prefix_sum(x) = [1 3 6 10 15] #You can get this with ballot, etc.

You can derive all the elements in a prefix sum across a warp using ballot, as I pointed out. You can't derive all the elements in a prefix sum either across a warp or across a thread block with syncthreads_count. It produces a single scalar, the reduction of all the predicates from all threads in the block, after the block synchronized. This is much less information than you can get with a proper prefix sum.

For example, you can use a predicate prefix sum to compact results, if every thread may be producing 0 or 1 elements, you can use the prefix sum to calculate the address to which each particular thread should store their result. The result of syncthreads_count, while still useful, only gives you the total amount of results the block will be producing. They're fundamentally different things.

In any case, I'm happy about both of them. =) You might as well do some computation as you hit the barrier, so I'm happy they're providing any and all and count barriers. But that's very different from a prefix sum. :smile:

trinibwoy · Feb 27, 2010

RecessionCone said:
The result of syncthreads_count, while still useful, only gives you the total amount of results the block will be producing. They're fundamentally different things.

Yep, thanks. I realized my mistake a little while ago

. syncthreads_count was appropriately named

aaronspink · Feb 27, 2010

ap_ said:
CUDA supports atomicCAS(), but it is relatively slow at the moment. Building mutexes out of it is not a particularly good idea on the current generation.

that's because its probably doing a barrier and then the op.

3dcgi · Feb 27, 2010

neliz said:
Why would they care about the size of Cypress? Why would nVidia care about anything that any other company does when they design their chips?

It's prudent for any company to care what their competition will do.

aaronspink said:
Still waiting for someone to code up 15 infinite loops plus another kernel and see if how it really works.

What would that prove other than showing it can't retire kernels out of order? There can still be 16 running in parallel even if the chip hangs.

Ailuros · Feb 27, 2010

rpg.314 said:
http://www.inet.se/recensioner/5408811/xfx-geforce-fermin-4gb

Original source: http://www.semiaccurate.com/2010/02/26/sweden-gets-world-exclusive-geforce-fermin-card/

The numbers there are making my head spin....

What a nice 'coincidence' http://forum.beyond3d.com/showpost.php?p=1398601&postcount=1952

eastmen · Feb 27, 2010

ECH said:
So because of their Quadro and Tesla line there is now a surplus of GTX200 series gpus floating about on newegg? So close to their next gen release? Bah, I don't get it myself...how can there be no stock for months then all of sudden, close to their release of next gen there is stock? I get the impression this will be their mid-range gpus...

They have a whole gtx 3x0 series to fill out . The gt200b series will make another name change. I think though they might end up getting dx 10.1 added to them.

NVIDIA GF100 & Friends speculation

ECH

trinibwoy

Meh

Sontin

aaronspink

MfA

ap_

RecessionCone

Vincent

trinibwoy

Meh

Sontin

Jawed

trinibwoy

Meh

ECH

nAo

Nutella Nutellae

RecessionCone

trinibwoy

Meh

aaronspink

3dcgi

Ailuros

Epsilon plus three

eastmen

Similar threads