NVIDIA GF100 & Friends speculation

Discussion in 'Architecture and Products' started by Arty, Oct 1, 2009.

  1. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    692
    Likes Received:
    30
    Why is newegg in stock of GTX 200 series cards when nvidia is suppose to be close to releasing their next gen cards? These cards weren't prior in stock. Why are there quantities now :?:
     
    #2141 ECH, Feb 27, 2010
    Last edited by a moderator: Feb 27, 2010
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Meh, yeah I guess for something like an octree you could use an append buffer to add child nodes and just update the parent with the index for each child. Same effect as a pointer but ugly :razz:

    Btw, do GPU's currently support mutexes? That would be a requirement for rebuilding the acceleration structure on the GPU no? (say two threads want to create the same child node in a kd-tree to insert two different objects that fall into that cell)
     
  3. Sontin

    Banned

    Joined:
    Dec 9, 2009
    Messages:
    399
    Likes Received:
    0
    They need GT200 chips for their Quadro and Tesla business. And not every chip is good enough.
     
  4. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    I don't believe they currently support the cpu style light weight mutexes such as LL/SC or FetchAndAdd/CmpExchange but they do support sync primitives that could be used as a heavy weight alternative. But in general I think you are currently better off not playing around with any link list style structures (LL, n-ary trees, latices, etc).
     
  5. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Even without explicit support you can do a form of test and set with atomics.
     
  6. ap_

    ap_
    Newcomer

    Joined:
    Feb 17, 2010
    Messages:
    9
    Likes Received:
    0
    CUDA supports atomicCAS(), but it is relatively slow at the moment. Building mutexes out of it is not a particularly good idea on the current generation.
     
  7. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189


    Actually, I don't think you'd want to use a syncthreads_count() to do a prefix sum on the mask. syncthreads_count() only returns the reduction of all the predicates in the thread block, which is not the same thing at all.

    Ballot is actually very useful for prefix sums, though: you can use it in conjunction with a boolean mask and a popcount to do a warp scan efficiently, for example:

    uint warpId = threadIdx.x & 0x1f;
    uint warpMask = (1 << warpId) - 1;
    ...
    int predicate = foo();
    uint ballot = __ballot();
    uint parents = ballot & warpMask;
    int warpScan = __popc(parents);
     
  8. Vincent

    Newcomer

    Joined:
    May 28, 2007
    Messages:
    235
    Likes Received:
    0
    Location:
    London
    GTX200 series with two renaming schemes = future GTX380/GTX 360
     
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Interesting, nice workaround :)

    Cool, I wasn't aware that the __popc function existed. So yeah, in conjunction with __ballot you get a warp level sum. I still don't understand why you think syncthreads_count isn't useful though. There will be times when you want to evaluate the mask over the entire block. Essentially it's doing everything above as well as doing a final sum of all the individual warpScan values for thread(warpsize) across all warps.

    If you mean that syncthreads_count isn't useful as a scan on a single warp level mask produced by __ballot then yeah, absolutely. I should've worded my original comments better to make it clear that I wasn't suggesting otherwise.
     
  10. Sontin

    Banned

    Joined:
    Dec 9, 2009
    Messages:
    399
    Likes Received:
    0
    This would be a real fun because i read somewhere that GT200b is EOL. :lol:
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London

    http://forum.beyond3d.com/showthread.php?p=1327227#post1327227

    For the correctly linked patents being referred to.

    Yes, the description of __ballot() doesn't match with the PSCAN instruction. I'm confused why PSCAN in that code snippet is not in the the CUDA 3.0 guide :???:

    Is it missing?

    Jawed
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    RecessionCone's macro seems to be the closest thing to it. There's no single instruction that accomplishes what PSCAN does. Also, I was wrong about syncthreads_count as it returns a single value to all threads - the count of true predicate evaluations. At first I thought it was similiar to rank() from the patent but meh.
     
  13. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    692
    Likes Received:
    30
    So because of their Quadro and Tesla line there is now a surplus of GTX200 series gpus floating about on newegg? So close to their next gen release? Bah, I don't get it myself...how can there be no stock for months then all of sudden, close to their release of next gen there is stock? I get the impression this will be their mid-range gpus...
     
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Not so easy with compute shaders. For instance you are not allowed to have a loop with an exit condition that is a function of a value that has been fetched from an UAV, for example via atomics.
     
  15. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    I do think syncthreads_count is useful, but I'm trying to point out that it does something very different: it's a reduction, rather than a prefix sum. Just to be explicit:

    x = [1 2 3 4 5]
    reduce(x) = 15 # You can get this with syncthreads_count
    prefix_sum(x) = [1 3 6 10 15] #You can get this with ballot, etc.

    You can derive all the elements in a prefix sum across a warp using ballot, as I pointed out. You can't derive all the elements in a prefix sum either across a warp or across a thread block with syncthreads_count. It produces a single scalar, the reduction of all the predicates from all threads in the block, after the block synchronized. This is much less information than you can get with a proper prefix sum.

    For example, you can use a predicate prefix sum to compact results, if every thread may be producing 0 or 1 elements, you can use the prefix sum to calculate the address to which each particular thread should store their result. The result of syncthreads_count, while still useful, only gives you the total amount of results the block will be producing. They're fundamentally different things.

    In any case, I'm happy about both of them. =) You might as well do some computation as you hit the barrier, so I'm happy they're providing any and all and count barriers. But that's very different from a prefix sum. :smile:
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Yep, thanks. I realized my mistake a little while ago :oops:. syncthreads_count was appropriately named :)
     
  17. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    that's because its probably doing a barrier and then the op.
     
  18. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    It's prudent for any company to care what their competition will do.

    What would that prove other than showing it can't retire kernels out of order? There can still be 16 running in parallel even if the chip hangs.
     
  19. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
  20. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,878
    Likes Received:
    4,727
    They have a whole gtx 3x0 series to fill out . The gt200b series will make another name change. I think though they might end up getting dx 10.1 added to them.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...