AMD: R8xx Speculation

Discussion in 'Architecture and Products' started by Shtal, Jul 19, 2008.

?

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Poll closed Oct 14, 2009.
  1. Within 1 or 2 weeks

    1 vote(s)
    0.6%
  2. Within a month

    5 vote(s)
    3.2%
  3. Within couple months

    28 vote(s)
    18.1%
  4. Very late this year

    52 vote(s)
    33.5%
  5. Not until next year

    69 vote(s)
    44.5%
  1. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Juniper is the 180mm2 sku (probably the wafer they showed) and Cypress is supposed to be 300-350mm2, which obviously means Cypress is the highest end single chip solution.
     
  2. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,971
    Likes Received:
    1,458
    Location:
    Guess...
    Ah I see. And whats GT3xx supposed to be again? About 450mm2?

    I see what all the fuss is about if Cypress is a single chip solution then. GT3xx will indeed have to be very quick to make its size disadvantage worth it. And of course and X2 varient of Cypress could be extremely fast.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,018
    Likes Received:
    1,010
    Location:
    London
    Scatter isn't coherent across strands in any meaningful sense. Not only that, but scatter implies arbitrary write order to colliding addresses. D3D11 basically says you're on your own if you use colliding addresses from distinct strands. So there's no coherency to maintain, as far as I can tell.

    During normal pixel shading render back end (colour write, blend) operations happen in cluster memory associated with the TU/RBE. Fetching data from memory is no different from fetching a tile of texels, conceptually. Say the fetch is actioned through L2, which lives close to the MCs.

    Atomics are obviously different, because the memory system is explicitly told to serialise accesses to each given address (but not to serialise the entire set of all accesses by all strands). This functionality would stay outside of the clusters.

    I presume that if a kernel does a non-atomic read of an address at the same time as that address is a candidate for atomic updates by that kernel, then it's up to the programmer to fence these properly, otherwise suffer the consequences of indeterminate ordering.

    The key aspect to me is that a pixel shader is allowed to fetch data from its position in all render targets currently bound (8 distinct buffers + Z/stencil). This is logically the same as fetching from a set of non-compressed textures. The pixel shader is then able to update all of those buffers, again solely for its position.

    These operations are a bit like reading/writing shared memory. I think it was Trinibwoy who suggested a while back that NVidia could do ROP processing in the multiprocessors using shared memory as a buffer.

    Currently, in ATI's architecture, shared memory is idle while pixel shading. It's only usable when running a compute kernel. So, LDS might be a candidate for this kind of usage.

    I suspect the more-stringent texture-filtering requirements of D3D11 might make ATI return to single cycle fp16 filtering, which then provides the precision required to perform 8-bit pixel blending at full speed.

    Z-test, hierarchical-Z, colour/z (de-)compression for render targets sounds like something that should stay close to the MCs. Clearly the tiled nature of rendering buffers makes it possible to separate the reading-from-memory/de-compression of buffers from the RBE operations and then the compression/writing-to-memory. These operations are all atomic at the tile level, and nominally only one cluster is performing atomic updates on any given tile. The question then becomes one of the added latencies that arise in moving render buffer data into clusters and then back. I'm not convinced the latencies matter, per se.

    The simple case of append, with any kind of structured data, has no strict ordering defined as far as I can tell. i.e. each cluster can generate a local tile of data to be appended - when the tile is full it can be posted to the memory system to be slotted (and compacted?) into its destination in memory.

    I can't see why both ATI and NVidia GPUs couldn't work this way, to be honest.

    Jawed
     
  4. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Add 100mm2 to it and be safe.

    ding-ding-ding-ding-ding!
     
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,248
    Likes Received:
    3,193
    Location:
    Finland
    The last perceptions of the codenames I've seen was
    Hemlock = "R800"
    Cypress = "RV870"
    Juniper = "RV830"
    Redwood = "RV810"
    Cedar = "RV810"

    And
    Cypress = "R800"
    Redwood = "RV870"
    Juniper = "RV830"
    Cedar = "RV810"
    Hemlock = "RV810"
     
  6. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    There you go.
     
    #1746 neliz, Aug 18, 2009
    Last edited by a moderator: Aug 18, 2009
  7. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,248
    Likes Received:
    3,193
    Location:
    Finland
    So you're suggesting Cedar will be higher performance part than Juniper?
     
  8. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,971
    Likes Received:
    1,458
    Location:
    Guess...
    Hmm, so Cypress is expected, based purely on the vantage numbers to be a single chip GPU which is approximately 30-40% faster than a GTX 285 while Hemlock would be an X2 variant of that?

    Sounds pretty good if true, expecially if the size difference is as pronounced as expected. On the other hand though, I'm expecting GT300 to be somewhere between 50 and 80% faster than a 285 so its going to be an interesting battle. Its certainly seems unlikely that GT300 will be able to keep up with Hemlock but its also going to be smaller (cheaper) overall as well as single vs dual chip.

    In many ways its sounds similar to this generation before the 295 was introduced. Except this time AMD will be first off the post.
     
  9. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Suggesting?
     
  10. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,248
    Likes Received:
    3,193
    Location:
    Finland
    Knowing, guessing, suggesting, letting us know, all the same to me :lol:
     
  11. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Yes order of writes arbitrary. However if read/write passes through RBEs, and if RBEs shared tiles (or cachelines, or whatever), they would have to later join the result of those accesses on the shared tiles (rough coherency, colliding write order doesn't matter, but cannot loose writes when RBEs compress/write tile back to GDDR). Atomics break the idea of sharing tiles. Clearly I don't think this is what they are doing...

    Assuming only self update (say full screen pass) and no overlapping pixels, then one could do this on current hardware assuming API allowed it.

    Possible DX11 problem case is when say one is tessellating a surface and fragments, in the same patch for example, overlap the same destination pixel/sample location. Even though ROP is ordered, fetching prior render target data is unordered (to my knowledge), and thus much less useful.

    Note that the above tessellation case would provide problems with software ROP in the ALUs. Locking down a tile/line/block of memory to insure atomic access during ROP processing would be crazy bad (less bad if all ALU ops for ROP done together like ATI clause, and tile/line/block access through L2). ATI clause model could actually be very powerful with regards to atomics if the developer had native API access...
     
  12. w0mbat

    Newcomer

    Joined:
    Nov 18, 2006
    Messages:
    234
    Likes Received:
    5
    too much speculation for me today - im passing out
     
  13. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Not on this board though ;)
     
  14. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,515
    Likes Received:
    442
    Location:
    Varna, Bulgaria
    Come out off the bed and speculate like a real men! :lol:
     
  15. mao5

    Regular

    Joined:
    Apr 14, 2004
    Messages:
    276
    Likes Received:
    5
  16. psolord

    Regular

    Joined:
    Jun 22, 2008
    Messages:
    444
    Likes Received:
    55
    So now we have these rumored scores:

    Is it me or is there a hole somewhere between Cypress and Juniper?

    So if the Cypress fills the 300$ segment, Juniper should be around the 150$ segment. In the last generation there was the 4850 at 200$ and the 4870 at 300$, fairly priced as well, since the 4870 was about 30% faster hence deserving the premium.

    So from 150$ to 300$ there is a gap, both performance wise and product wise. Me thinks that the 4850 was the most profitable part of ATI during the last generation and they should have a product to replace it.

    Unless of course these scores of Cypress ~P16xxx - P17xxx - P18xxx suggest that they could be something more like P15XXX to P20XXX or more and they are once again hiding facts!

    I still don't see how there could be a lower end cypress, bar with a lower core frequency and mem frequency. It should be easily overcome though with overclocking and voltage tweaks. I don't think there is the option for GDDR3 again for the "5850".

    PS Hey i can edit now, Yeepeeee! :p
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,018
    Likes Received:
    1,010
    Location:
    London
    The sharing is transient - any fenced-clause that does scatter implicitly makes all reads incoherent within that clause, if they fetch from any address that is the target of a scatter. The only exception is if the scatter address(es) is explicitly private to the strand.

    Even without tessellation this is a basic problem. The same pixel might appear on several triangles that are in flight at the same time. I honestly don't know how developers are going to make much use of pixel read/write when the render target is constantly changing.

    I don't understand what it is about the tessellation case that creates a problem here. Tessellation is followed by setup then rasterisation. Tessellation is transparent as far as a fragment is concerned.

    The LDS "exchange" mechanism is atomic (i.e. write/read are indivisible), because a clause encapsulates the LDS instructions. But ATI cannot currently support an atomic: "import data into registers, execute ALU cycles, export data from registers" because that's 3 clauses - ALU instructions cannot operate on non-register memory, except for fetching from constant buffers.

    Apart from that the other problem with a software back end is that the pixel operations need to be sorted by primitive ID. NVidia treats this problem in two parts (not all fragments need to be sorted and if any do they're score-boarded for exclusive pixel shading) but I don't know how ATI handles it. I think it's handled by post-shading arbitration.

    I'm still unclear how many triangles can own fragments in a pixel shading thread on ATI, I think it may only be 1. If so that means there are, at most, 128 triangles in flight at a time, per cluster. Post-shading arbitration presumably checks fragments' triangle IDs as the fragments arrive and decides if they're allowed to proceed to back end or have to wait until their predecessors have all been processed. Still, in theory this is potentially an absurd amount of data hanging around waiting for one wanton triangle's fragments to arrive - so I'm puzzled what ATI's actually doing here. Some of this problem can be solved by making the thread scheduler constrain ordering, e.g. preferring to stall pixel shading on threads that are newer than the wanton thread, to prevent the arbiter's fragment queue from filling.

    So a software RBE solution would require that the fragments are effectively sorted by triangle ID. It would be like a mini A-buffer, each pixel having a private queue of fragments.

    So a unified TU/RBE unit would need an arbiter on it's input, but would then happily and, by default, perform atomic back-end operations.

    Jawed
     
  18. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Oh dear, now the asians listen to Charlie!?!

    Just like there's a hole between RV770 and RV730?
     
  19. psolord

    Regular

    Joined:
    Jun 22, 2008
    Messages:
    444
    Likes Received:
    55
  20. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...