Nvidia GT300 core: Speculation

Discussion in 'Architecture and Products' started by Shtal, Jul 20, 2008.

Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Was there the expectation that it wouldn't serialize accesses to the same location during atomic ops?
     
  2. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Good point.

    Grid was {512,1} blocks and {16,16} threads per block (8 warps, 256 threads). I think this should have been enough to saturate both cards.

    Actually you are right in that likely an increase in number of parallel units (ROPs or MC or whatever the hardware uses) able to do atomics would indeed factor into the increase performance in the non-100%-collide case.

    Also I'm not sure how CUDA distributes blocks to cores. So likely this would also effect how many atomic operations had the same memory segment collision (in my test).
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    It's serialising memory accesses. If this was fully cached then it would be cheaper.

    If this was a pixel blend, then I guess it's flushing the result to memory upon completion, on the assumption that cache isn't big enough to hold the pixel until the next blend operation arrives. So has to read the pixel back every time a new blend (atomic) comes along.

    The multiple atomics case is just like having lots of pixels in flight being blended separately.

    Jawed
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It would seem Nvidia went for the simplest route in this case.
    Write back with no route for an on-chip RAW attempt.
     
  5. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    BTW, a few days ago I did least one quick test on the 8600 GTS in which I added instructions into the loop to see if I could get concurrent execution in the 100% collide case, and it seemed as if this was the case. I could vary the number of instructions following the global atomicAdd and wouldn't see an effect on run-time until I got round/over the expected number of cycles of latency.

    I really need to double check those results again, but it seemed to me as if the ALUs were issuing the atomic operations without serializing execution (so atomic operation ALU work happening outside the ALUs)?
     
  6. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    GT300 will have around 2.4 billions transistors and under 490 sqmm.

    Source: Hardware-Infos
     
  7. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    Sounds very promising. A lesser die size gap than GT200 vs RV770 with (from the sounds of it) a larger performance gap.
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,062
    Likes Received:
    3,119
    Location:
    New York
    2400m at 490mm^2 will still be at a significant trans/mm^2 disadvantage (based on RV740). Notwithstanding the uselessness of that particular metric, it still points to the likelihood that they weren't able to match AMD's density.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    With <=32 warps per multiprocessor (but 24 on 8600GT), that's 4 blocks per multiprocessor. At any one time there are 4 blocks per multiprocessor * 30 multiprocessors = 120 blocks in flight.

    The only issue with this is occupancy. How many registers allocated for the kernel? If you can allocate a huge amount of shared memory per thread, that might be a way to falsely lower the number of simultaneous warps/blocks per multiprocessor. Dunno if this will actually work to be honest.

    It should be possible to work out the number of colliding addresses for each of your overlaps. Then you can get an average number of collisions for each of the 7 MCs, assuming that memory segments are distributed evenly, round-robin.

    Blocks can't be split across multiprocessors. The collision count is purely defined by overlap and the size of variable. Presumably you used a 32-bit atomic (only 32- and 64-bit global atomic variables are possible I think). GT200 has a 128-byte segment size for 32-bit variables, but it can halve that for indepedent memory operations, so 16 addresses:
    • overlap = 1 : 512 blocks / 16 addresses per segment = 32 colliding addresses and 16 * 8 warps * 2 half-warps = 256 collisions per address
    • overlap = 16 : 512 blocks / 1 address per segment = 512 colliding addresses and 8 warps * 2 half-warps = 16 collisions per address
    Erm, I think that's how it works...

    Jawed
     
  10. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,249
    Likes Received:
    3,419
    Well, it says "less than 490mm^2" isn't it...
     
  11. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,742
    Likes Received:
    152
    Yes, but probably not by much...

    If it was less than 480mm^2, surely it would have read < 480mm^2. Let's split the difference and say 485mm^2.

    But... if we are to believe the specs then GT300 should offer a minimum of 3x RV740 performance to a maximum of 4x. 3x RV740 die size would be ~411mm^2, 4x 548mm^2, and 3.5x 480mm^2. From a performance/mm^2 they appear quite similar. After all it's not how many transistors so much as what you do with them.
     
  12. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,249
    Likes Received:
    3,419
    Hint: GT200b is 490mm^2. So it's more like "less than GT200b" really.

    How's that? We don't know anything about G300's architecture at the moment.
     
  13. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,742
    Likes Received:
    152
    Should I really have to do this?

    Since you apparently know GT300's die size why don't you go ahead and post it?
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    http://www.brightsideofnews.com/news/2009/5/12/nvidias-gt300-is-smaller2c-faster-than-larrabee.aspx

    Later the article raises an eyebrow over the single-PCB GTX295 that's coming. Yes, it definitely would be interesting if NVidia launched a "GTX395" concurrently with "GTX380", using a single board for 2 GPUs.

    But, if a single board is so good, why hasn't NVidia done it already?

    Jawed
     
  15. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Seems to me that they went for the quick 'n' dirty solution.
    Who knows... Perhaps they thought they weren't going to sell enough of them to do a major rework of the GT200 PCB. And perhaps they reconsidered when they were selling more GTX295s than they predicted... Or perhaps because the GT300 will use a PCB very similar to the GT200, and they figured they'd get a headstart on a 2-GPU PCB this time...
     
  16. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26
    I personally don't think a single board is better for the consumer unless you water cool. I think dual PCBs are generally superior from a thermal standpoint and PCB/weight considerations. But it might be more cost effective for Nvidia.
     
  17. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Cooling is only an issue when you try to vent through the back ... for a gamer card I just don't see why they would have to religiously stick to that.
     
  18. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    FWIW, I wouldnt take Theo's or hardware-infos info. They've been wrong on too many occasions and according to CJ are just taking stabs in the dark, emailing him asking for confirmation on their guesses. :huh:

    Maybe CJ can enlighten us what is going on in the background? :runaway:
     
  19. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
  20. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Likes Received:
    0
    I don't even click 'bright side of news' links any more, it's almost always more theo valich drooling.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...