AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    On paper maybe but in real-world-apps there are worlds between the efficiency of both solutions.
     
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Nope, the Ultra is quite a bit faster than the GTX at Vantage extreme. The GTX score looks too low though should be around 2500.
     
  3. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    While R580 got a fair boost from adding ALUs they were also math bound at the time. I'm not sure I'd go as far as calling RV670 limited by shader power.

    I could see all AA/AF being performed by the actual shaders but they're still left with actually feeding data to those shaders quickly enough. I wouldn't be surprised if the TMUs got turned into a bunch of really large point samplers. That might even trim enough transistors where the 800 SPs become more reasonable. RV670 seemed to scale reasonably well with with higher levels of AA/AF so if they could feed in data faster things should balance out. Especially since the higher levels rely more on cached results than fetching new unique values.
     
  4. MarkoIt

    Regular

    Joined:
    Mar 1, 2007
    Messages:
    392
    Likes Received:
    0
    But someone also made some calculation on the leaked chip footage... according on this calculation the chip won't exceded a 275 mm^2 die-size.

    According some source, HD4850 scores about X26xx, so HD4870 could have a score higher than X3200 (+30% on HD4850).
    Today some chinese source claimed that HD4870 could out-perform 8800U by a 35% ...
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yeah, it's been bugging me too. So 5 SIMDs of 160 ALUs with the horrible batch size is still a contender.

    As I understand it, there's 3 levels to the control logic:
    1. Command Processor - this interfaces with the CPU (and other GPUs?). It configures pipeline state, e.g. the loading of shader programs into the 3 shader-types: VS, GS, PS; per shader-type thread counts; Z blend mode
    2. Sequencer(s) - this manages extant threads and how they are issued to the ALU and TU SIMDs. It tracks the status of each thread (can the thread be issued?, what's it waiting for?) and makes control-flow (dynamic branching) decisions, e.g. if all the elements in a hardware thread have completed a dynamic loop then exit the loop. It also handles the DMA of all data in and out of the register file and constant cache and it also initiates on-die data movements (e.g. moving completed fragments out of the register file into the RBEs)
    3. SIMDs - these take a clause of code (up to 128 ALU instructions on ALUs or 8 TU instructions on TUs) and run it to completion. In terms of control there's: instruction fetch/decode; operand fetching (which has complex rules due to the limited port bandwidth); constant fetching; constant buffer fetching; predication update; program counter
    Then there's the memory system, where every memory client has virtualised access to memory. I presume this is reasonably costly.

    I've failed so far to produce any kind of model of the costs. I just can't squeeze the 800-ALU-quart into the 250mm2 pint pot :sad:

    In my opinion reducing the clocking of batches is as unlikely as a different clock for the ALUs. I think it would be a huge architectural change.

    But hey, I thought 800 was impossible for space reasons - without doing a "custom" re-implementation. I suppose it's possible they have utilised custom logic...

    Jawed
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Register file memory should be very dense. Though I'm guessing that for porting reasons there are multiple copies of the register file (writes go to all copies - in reading, separate instances of the register file fetch distinct operands for the same clock).

    RV670 has 1MB of register file, before taking account of porting. If there's 4 ports created using separate instances, then that's 4MB. That's huge.

    RV770 would need 2.5MB (10MB) which is starting to look insane...

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
  8. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Ah you're right (http://www.digit-life.com/articles3/video/vantage2.html) - 9800GTX around ~2600 (looks like the sucky memory management of the G80/G92 is an issue here with "only" 512MB video ram) vs. ~2900 for the 9800U. But in any case the difference shouldn't be that large as in this table.
     
  9. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    I meant that TMU+Logic (including register file, etc.)+caches have a big impact on total size, more than the ALU themselves, I did not speak for the TMU or caches alone (even if comparing different procesess is difficult, it's known that Intel's caches are more dense even at the same "nominal" process node). Finally, look at the RV635->RV670 comparison: size goes up only 70 mm^2 and you have almost 2.5 times the shaders, double the TMU and bus width.
    So it's easy to believe that the shaders themselves are not very big.
     
  10. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Now that you mention it - that seems about the biggest point against 800/40/16 in RV770 - the 256mm² as a given.


    Because going from RV635 to RV670 you'd cramp 200 ALUs, 8 TMUs and 128 Bit memory interface into the 70mm² (which i take for granted, don't know the die size of RV635 myself).

    Now, RV770 is supposed to be about 64mm² larger than RV670 and the only thing you'd get off there was 128 Bit memory interface, which allegedly stays the same in RV770 (except the mysterious non-crossfire Crossfire-Card R700, which would according to some need some kind of chip to chip interconnect through the ring buses).

    So, with even less additional die space, you not only want to add 200 shaders and 8 TMUs but at least 480 shaders and 16 TMUs. Hm.

    Besides: Does anyone here believe, AMD's not going to stick to their FP16-single-cycle-TMUs?
     
  11. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    Does the use of different 55nm node compared to RV670 change anything at all diesize wise? (more tightly/loosely packed transistors?)
     
  12. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
    What about additional 12 ROPs?

    Kaotik: Exactly. There were rumours about new process libraries many months ago. There was also a mention of new transistor desing. Maybe this is truth and the reason for this is higher transistor density. Which could be the cause of conservative clock speeds (like NV40 vs. R420 - same size, different density, 62 milion of transistor difference, lower clock-speed for NV40)
     
  13. Wirmish

    Newcomer

    Joined:
    May 4, 2007
    Messages:
    160
    Likes Received:
    0
    Psycho is right, the RV670 is not square... my bad. :oops:

    [​IMG]

    [​IMG]

    RV670:
    Size: ----> ~14.44mm x ~13.30mm = 192mm²
    Pixels ---> 89 x 82

    RV770:
    Pixels ---> 103 x 102
    Length -> 103 / 89 = 1.1573 * 14.44mm -> ~16.7115mm
    Width ---> 102 / 82 = 1.2439 * 13.30mm -> ~16.5439mm
    Size -----> 16.7115mm * 16.5439mm = ~276.47mm²

    RV770 ~= 276 mm² (±5%) :yes:
     
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    50% faster? That's what whocares' numbers are suggesting, and there's not data out there supporting that assertion.
     
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Somehow i counted them under memory interface, but you're right they should be mentioned separately. But nonetheless - the ROPs seem to need a little rework from RV670 to RV770 too, so that does not change my overall view.
     
  16. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Dave has explained multiple times that the ALU configuration of R6xx is super scalar.

    scalar | scalar | scalar | scalar | scalar+
     
  17. ZerazaX

    Regular

    Joined:
    Oct 29, 2007
    Messages:
    280
    Likes Received:
    0
    While I don't know exactly how much the 128 BIt memory interface takes, do keep in mind exactly how much die size shrank from 80 nm to 55nm from R600 to RV670 with the removal of the 256-bit extrenal/512-bit internal ringbus. Remember there's an internal component too, so I'm sure it's not a small part of the die at all.

    Plus, add in the fact that supposedly they are on TSMC's premium 55nm node process and densities might be increased as well
     
  18. Magnum_Force

    Newcomer

    Joined:
    Mar 12, 2008
    Messages:
    104
    Likes Received:
    70
    I have no idea how cache/register files/control logic works, but is it possible ATi build in surplus amounts in there architecture for easy scaling (like rv635 to rv670 for example) and thus didn't really need change it all that much.

    They put in excess bandwidth into R600, so is it possible, or doesn't it work like that.
     
  19. revan

    Newcomer

    Joined:
    Nov 9, 2007
    Messages:
    55
    Likes Received:
    18
    Location:
    look in the sunrise ..will find me
    I think 800 SP make sense

    480Sp and 750MHz clock will not be enough to beat the 9800Gtx with a 30% margin, as some rumors pointed
    Also, 800Sp/750Mhz clock/256b Gddr5 will fit the 'terrascale' claim, on some Ati marketing papers leaked...
    More than that...we know by experience that Ati like to refresh products in the end of the year, should be easy to do this by raising clocks, more easy anyway than changing the marchitecture..
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I guess that control logic doesn't grow very much. e.g. there's only 1 more SIMD in RV670, although the SIMDs do get twice as wide. The latter would indicate the SIMD control logic cost is halved per ALU.


    Compare the real RV635 and an imaginary one:
    • real: 3 SIMDs (each of 2 quads) + 2 quads TU (128KB of L2 cache) + 1 RBE
    • imaginary: 2 SIMDs (each of 3 quads) + 3 quads TU (192KB of L2 cache) + 1 RBE
    The imaginary one has less SIMDs, which implies less control logic (since each SIMD is independent). But the imaginary one has more TUs and more cache and perhaps more control logic for the TUs.

    The imaginary RV630 should perform better on the face of it. Perhaps it would be bigger though, implying that the TUs cost more than control logic.

    Or, maybe when there's an L2 a 2:1 ALU:TEX ratio is not high enough - i.e. the performance gain isn't there?

    Jawed
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...