G80 Architecture from CUDA

Discussion in 'Architecture and Products' started by Rufus, Feb 16, 2007.

  1. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    54
    Location:
    Canada
    Yes, nice article.
     
  2. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Well, a practical example might make this easier to understand.
    Code:
    // Copy input to shared mem.
    shared[tid] = values[threadIdx.x];
    
    // We just wrote to shared memory and other threads might access it -> sync.
    __syncthreads();
    float myVar = shared[threadIdx.x + 24];
    
    // No __syncthreads needed here, because we are only reading; not writing.
    myVar *= 3.14f;
    shared[threadIdx.x] = myVar;
    
    // We just wrote to shared memory and another thread might access it -> sync.
    __syncthreads();
    
    [...]
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Quarter speed, area for multipliers increases quadratically with width.

    BTW, is there really no caching at all for scattered writes with the G80? Not even an output cache for write-combining?
     
  4. Farhan

    Newcomer

    Joined:
    May 19, 2005
    Messages:
    152
    Likes Received:
    13
    Location:
    in the shade
    You are correct sir. I was a little excited and thinking it would be done the way SSE units do it. But for a GPU it would be a waste. If they make it fully IEEE-754, it will probably be less than quarter speed even, like Cell.
     
  5. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    No caching at all for writes, and reads are only cached when using pseudo-textures/constants through their respective dedicated caches.
    Nope - the idea, as far as I can see, is that consecutive accesses will be optimal anyway, because the burst length is <= 4*16. (bytes*threads...)
     
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    The delay of a Wallace tree based multiplier is only O(logn), so with respect to speed, it's really not that bad.
     
  7. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Arun, if you want to write to consecutive addresses why would you be using scatter? ;) There are applications which won't be able to coalesce writes from threads into the same single write access but still have spatial coherence which a write combining cache could make use of.

    Silent_guy, I'm assuming they will use 26 bit multipliers and simply resort to a form of matrix multiplier for the doubles (which would go quarter speed). I always assumed that's how area efficient single/double precision fp multipliers worked, but I have to admit I have never bothered to actually look it up.
     
    #27 MfA, Feb 17, 2007
    Last edited by a moderator: Feb 17, 2007
  8. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Yeah, I agree. Honestly, I'm not sure whether G80+CUDA does that. The documentation clearly states that reads and writes are uncached, but this is an implementation detail, and many software programmers might not even think of this as "caching" if you mentioned it to them.
     
  9. Farhan

    Newcomer

    Joined:
    May 19, 2005
    Messages:
    152
    Likes Received:
    13
    Location:
    in the shade
    He's referring to the amount of hardware that increases quadratically with width (assuming a single cycle implementation), not the delay. Simplistic estimation for the mantissa calculation:
    FP32 - 24*24 full adders = 576
    FP64 - 53*53 full adders = 2809

    About 5 times more hardware just for the mantissa.
     
  10. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    I was referring to the "quarter speed".
     
  11. BeyondEnergy

    Newcomer

    Joined:
    Nov 19, 2006
    Messages:
    5
    Likes Received:
    0
    stack data structure from cuda

    Does CUDA allow one to write program with stack type data structure? A lot of algorithms require a stack like structure. How about hash-tables?
     
  12. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Speed as in throughput.
     
  13. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Explicit or implicit stack? Explicit stack, yes, just use scratchpad RAM to make one. Implicit stack, as in, CPU frame-style stacks? I don't think so, there is apparently no PC counter control/computed jump instruction? Someone on GPGPU basically said no implicitly recursive functions.

    As for hash table, again, as long as you've got scatter/gather, you can write one yourself.
     
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    there's no 'real' stack on G80, you can just code your own implementation
     
  15. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Stacks are hard on the GPU. Even with the PDC (Parallel Data Cache), you have to share the space with all threads in the warp and you have to be careful about conflics on bank access. In the GPGPU community, we adapt datastructure traversal to support "restart" or "backtrack" methods, see Foley et al's paper from Graphics Hardware last year or Horn et al's paper from the upcoming I3D, both on k-D tree raytracing. The later emulates a fixed size small stack using the register file and using looping constructs instead of pure streaming. With scatter and gather, you could emulate stacks in GPU memory (and even host on ATI chips with fine grain scatter), but it becomes *extremely* expensive. You are now talking about tremendous amounts of latency to cover, and you are still talking about defining a bounded region of memory for each thread, basically allocating for the worst case stack usage. However, someone can probably extend Lefohn's GLift GPU datastructure work to make this easier to use, but it's likely still going to be expensive.

    The main issue with recursion or stacks is that the memory requirements are unbounded. On the CPU this really isn't a problem, but as you have to handle 100s-1000s of threads, the required space on a GPU or a CPU way into the future gets quite high.
     
  16. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Memory isn't always (dynamically) unbounded with recursion. With tail-recursion it's constant, and even with some almost-tail-recursive functions (tail-recursion+mod, linear logic, etc) it's predictable.

    What you describe as "restart" or "backtrack" reminds me of a form of STM (Software Transactional Memory), I assume you mean that if you have a thread collision, or if you run out of space, you recompute the datastructure/computation from a known good state.
     
  17. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Cheers! So basically you sync after writes to shared memory. Thanks again, this will be useful.
     
  18. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    You are correct, but tail recursion is generally converted into a loop form by the compiler to avoid the stack frame push, and hence no longer recursion from a resource perspective.

    Restart and backtrack is this case are are not related to transactional memory as we aren't talking about creation/manipulation of the datastructure, just traversal. You rely on the properties of the algorithm to effectively change the query being issued through the tree. In the case of a k-d tree this can be done by changing the values of tMin and tMax during traversal.
     
  19. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    According to the docs, recursion is not supported in the current nvcc compiler, from what I recall.
     
  20. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Just a note - having worked on this for a little while, that proposed solution does not actually enforce mutual exclusion at all. __syncthreads() alone won't do it, it's just a barrier. I'm probably going to have to look into the bakery algorithm or something similar, since there are no atomic instructions here at all AFAIK.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...