LRB - ditching x86?

Discussion in 'Architecture and Products' started by rpg.314, Jun 9, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Only block size is limited to 512 threads. Multiple blocks can execute per core simultaneously if registers and shared memory allow. On GT200 blocks_per_core*threads_per_block <= 1024.

    And Jawed, why would you need 1024 threads to hide latency?
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
    I think the mixture of increments against intensely random addresses and a moderate collision rate makes histogramming heavily dependent on the memory system, which means parallel speed-up can only come if you can hide the latency.

    Maybe you can hide all the latency with just a few threads per core and a heavy-weight hashing function?

    Jawed
     
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    I think the problem after a number of threads is not about hiding the latency anymore. The problem is more about the inefficiency of random access. In the worst case, you may have to read, say, 32 bytes, to use only 4 bytes. Therefore, even if you can hide latency completely, you still lose memory bandwidth.

    However, from the results of the histogram sample in CUDA SDK, it can only reach about 10GB/s which is less than 10% of the theoretical bandwidth. The major issue is that when using atomic shared memory operations, serialization becomes a serious issue. In the worst case, if all eight threads are accessing the same bank, you need 8 clocks to complete all operations instead of one.

    Histogram is a problem which looks easily parallelizable but actually very hard to parallelize over multiple cores, as the major bottleneck is actually the memory subsystem, not the ALU.
     
  4. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Isn't it rather worst case of 16 threads of a half-warp having a 16 way shared memory address bank conflict on current architecture.
     
    #104 TimothyFarrar, Jun 17, 2009
    Last edited by a moderator: Jun 17, 2009
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
    Which is why having lots of latency hiding is important - it's the life-blood of GPUs. Maybe do extra work to re-arrange the data so that it's less painful in the bin-increment phase.

    Jawed
     
  6. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.

    CPU: Core i7 920
    CPU (1 thread): ~800MB/s.
    CPU (2 threads): ~1150MB/s.
    CPU (4 threads): ~1650MB/s.
    CPU (8 threads): ~1750MB/s.

    GPU: GeForce GTX 285
    First algorithm: multiple histograms, no atomics
    GPU (256 threads): ~ 500MB/s
    GPU (512 threads): ~ 650MB/s
    GPU (1024 threads): ~ 700MB/s

    Second algorithm: single histogram, with global atomics
    GPU (256 threads): ~ 500MB/s
    GPU (512 threads): ~ 850MB/s
    GPU (1024 threads): ~ 900MB/s
    GPU (16384 threads): ~ 900MB/s

    I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.

    Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.
     
  7. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Interesting results, although that's even worse than I expected for "random" data. I think at the very least this data demonstrates that this is one place that GPUs - and I'd argue somewhat due to the programming model - are doing way worse than they ought to be relative to CPUs.
     
  8. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    Wow, hold on, GTX 285 has trouble beating a single I7 core?!
    This is indeed worse that I expected it to be.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
    Ooh, very interesting.

    On the CPU did you give each thread a local histogram and then do a reduction at the end?

    So atomics are faster once the GPU gets to 512 threads, but it seems like 1024 threads is the limit for scaling, which is ~1 warp per core, which is pretty tragic.

    Doing this with Dickens would require indexing/hashing of the words, which should add a lot of computational work. Would this sway things more in favour of the GPU?

    I wonder how this would work on ATI where there's only 10 cores fighting over memory?

    Jawed
     
  10. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    Yes, there is no need to let puny 4 cores to fight for the same histogram :)

    I think the problem is still on the memory. Unfortunately, since the profiler is not working, I can't get more detailed information on why it's so slow (GT200 does have a quite impressive array of profiling registers). I'll try it under Windows XP if I have some free time.

    Possible, but it's also more complex. Since word lengths are not equal, the memory access pattern is going to be more unpredictable. Also, while GPU does have more computation power, even CPU has a lot unused computation power when computing normal histogram, which may be already enough for the indexing/hash works.

    I'm not sure about this, but the problem is still mostly on how the memory controller copes with random access.
     
  11. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Results are interesting.

    128M 16-bit shorts as input into 65536 32-bit int bins. Did I read that correctly?

    And what algorithm did you use for this test?

    With regards to memory performance, anyone know (or want to speculate) how the MCs carve up up global memory based on address on the GT200 series? Jawed mentioned before that it had 8 64-bit MCs (makes sense that global atomics have 64-bit operations). Which would be 64-bytes wide, yet they can do a 32-byte wide segment. So it is just two groups of 4 MCs always fetching a 32-byte wide segment?

    Reason I'm asking is to get an idea of how global atomics work in parallel close to the MCs. Seems like they should be able to do at least 8 global atomics in parallel assuming they hit 2 32-byte segments without address collision. I wonder if the MCs can do 2 parallel 32-bit atomics or just one 32-bit or 64-bit atomic.

    A little OT, but I wonder if future GPU workflow would see a benefit to less shared address lines and smaller minimum segment size with the extra area and pin cost?
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
    900MB/s is 900 / 2 = 450M samples per second. GTX285 is theoretically capable of 1476 * 240 = 354,240M instructions per second, so that's ~787 cycles per sample.

    While huge bin sets are unable to fit in a core's shared memory, you could partition bins across the cores' shared memories - so bin 21574 only exists in core 9's shared memory (65536 / 30 = 2185 bins per core). In this scenario every core evaluates every sample to see if it's owned by that core and does a shared memory atomic update if so.

    Would this be faster than 787 / 30 cores = 26 cycles per sample?

    But this approach doesn't scale - e.g. a GPU with only 4 cores can't do this.

    So then you need some more-elaborate scheme. I can't help thinking that you could just simply multi-pass the samples, with 4096 bins per core. e.g. with 4 cores the first pass does bins 1-16384, second pass does 16385-32768 etc.

    Jawed
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
    This is my understanding: each MC is connected to a pair of memory chips. Each chip has a 32-bit interface. The minimum burst length is 4, which gives a 16-byte burst per chip, or a 32B burst per MC.

    GDDR5 has a minimum burst length of 8, so if you want 32B bursts then you want to have 32-bits of IO per MC. Otherwise you end up with 64B bursts.

    Jawed
     
  14. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    So am I understanding this correctly now then,

    For the 8 MCs, would global memory be banked 8-way with 32B per way?
    In comparison with shared memory being banked 16-way with 4B per way?

    It's too bad I'm not at home, I could just test CUDA write performance to figure this stuff out...
     
  15. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I am no CUDA guru, but I think that this is an unfair match up (for the GPU). 64K bins, each 4 bytes= 256KB. Which fits in the private L2 cache of each core on an i7. The image data is streamed through.

    IMHO, a better way would be to have the num of blocks proportional to the amount of data along x axis and have 16 blocks in the y direction. Along y axis, split up the bins so that each bin along y tackles only 4k bins (to keep them in shared memory) and use shared atomics. Global memory latency is known to be (500-700) cycles, so Jawed's estimate of ~787 cycles per sample suggests that in this case you are experiencing the full blast of entire memory latency. Using shared memory would be much better, IMHO.

    IIRC, there have been histogram generation apps demoed on the CUDA zone which show a real speed up. May be considering them would be a good idea.
     
  16. nutball

    Veteran Subscriber

    Joined:
    Jan 10, 2003
    Messages:
    2,271
    Likes Received:
    638
    Location:
    en.gb.uk
    It's not an unfair match-up if that's the real-world problem you happen to have to solve.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,999
    Likes Received:
    975
    Location:
    London
    Seems right to me.

    The next question is how memory addresses are tiled/striped across MCs... What happens on GPUs with non-power-of-2 MC count?

    Also, with banking a feature of the internal architecture of memory chips, it can get confusing to talk about banks in DDR, so just bear that potential for confusion in mind.

    Jawed
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,850
    Likes Received:
    285
    Location:
    Taiwan
    I was thinking about this, and I have an idea. Although the shared memory is 16KB, but it can't be completely used because some are used for other purposes. Because a GeForce GTX 285 has 30 MP so each core needs to store 2185 bins.

    To avoid repeated reading from the same address, maybe texture cache can be used. However, I'm not sure whether the texture cache is shared among the cores. On the other hand, it's really not limited to reading bandwidth right now, so even a 30 times waste is, maybe, acceptable :)
     
  20. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Isn't TEX L1 shared across 3 cores, and TEX L2 shared across all?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...