AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
I may be completely wrong, but I thought Juniper was the highest end single chip solution and Cypress was going to be a dual chip soluition?

If thats the case then what makes this better than the previous generation?
Juniper is the 180mm2 sku (probably the wafer they showed) and Cypress is supposed to be 300-350mm2, which obviously means Cypress is the highest end single chip solution.
 
Ah I see. And whats GT3xx supposed to be again? About 450mm2?

I see what all the fuss is about if Cypress is a single chip solution then. GT3xx will indeed have to be very quick to make its size disadvantage worth it. And of course and X2 varient of Cypress could be extremely fast.
 
Given a unified TU/RBE, random RT access implies read/write from non-local RBEs (or local fetch and some crazy "tile" cache coherency, which I'm guessing is highly unlikely).
Scatter isn't coherent across strands in any meaningful sense. Not only that, but scatter implies arbitrary write order to colliding addresses. D3D11 basically says you're on your own if you use colliding addresses from distinct strands. So there's no coherency to maintain, as far as I can tell.

During normal pixel shading render back end (colour write, blend) operations happen in cluster memory associated with the TU/RBE. Fetching data from memory is no different from fetching a tile of texels, conceptually. Say the fetch is actioned through L2, which lives close to the MCs.

Atomics are obviously different, because the memory system is explicitly told to serialise accesses to each given address (but not to serialise the entire set of all accesses by all strands). This functionality would stay outside of the clusters.

I presume that if a kernel does a non-atomic read of an address at the same time as that address is a candidate for atomic updates by that kernel, then it's up to the programmer to fence these properly, otherwise suffer the consequences of indeterminate ordering.

How do you see a unified TU/RBE working?
The key aspect to me is that a pixel shader is allowed to fetch data from its position in all render targets currently bound (8 distinct buffers + Z/stencil). This is logically the same as fetching from a set of non-compressed textures. The pixel shader is then able to update all of those buffers, again solely for its position.

These operations are a bit like reading/writing shared memory. I think it was Trinibwoy who suggested a while back that NVidia could do ROP processing in the multiprocessors using shared memory as a buffer.

Currently, in ATI's architecture, shared memory is idle while pixel shading. It's only usable when running a compute kernel. So, LDS might be a candidate for this kind of usage.

I suspect the more-stringent texture-filtering requirements of D3D11 might make ATI return to single cycle fp16 filtering, which then provides the precision required to perform 8-bit pixel blending at full speed.

Z-test, hierarchical-Z, colour/z (de-)compression for render targets sounds like something that should stay close to the MCs. Clearly the tiled nature of rendering buffers makes it possible to separate the reading-from-memory/de-compression of buffers from the RBE operations and then the compression/writing-to-memory. These operations are all atomic at the tile level, and nominally only one cluster is performing atomic updates on any given tile. The question then becomes one of the added latencies that arise in moving render buffer data into clusters and then back. I'm not convinced the latencies matter, per se.

The simple case of append, with any kind of structured data, has no strict ordering defined as far as I can tell. i.e. each cluster can generate a local tile of data to be appended - when the tile is full it can be posted to the memory system to be slotted (and compacted?) into its destination in memory.

I can't see why both ATI and NVidia GPUs couldn't work this way, to be honest.

Jawed
 
I may be completely wrong, but I thought Juniper was the highest end single chip solution and Cypress was going to be a dual chip soluition?

If thats the case then what makes this better than the previous generation?

The last perceptions of the codenames I've seen was
Hemlock = "R800"
Cypress = "RV870"
Juniper = "RV830"
Redwood = "RV810"
Cedar = "RV810"

And
Cypress = "R800"
Redwood = "RV870"
Juniper = "RV830"
Cedar = "RV810"
Hemlock = "RV810"
 
Hmm, so Cypress is expected, based purely on the vantage numbers to be a single chip GPU which is approximately 30-40% faster than a GTX 285 while Hemlock would be an X2 variant of that?

Sounds pretty good if true, expecially if the size difference is as pronounced as expected. On the other hand though, I'm expecting GT300 to be somewhere between 50 and 80% faster than a 285 so its going to be an interesting battle. Its certainly seems unlikely that GT300 will be able to keep up with Hemlock but its also going to be smaller (cheaper) overall as well as single vs dual chip.

In many ways its sounds similar to this generation before the 295 was introduced. Except this time AMD will be first off the post.
 
Scatter isn't coherent across strands in any meaningful sense. Not only that, but scatter implies arbitrary write order to colliding addresses. D3D11 basically says you're on your own if you use colliding addresses from distinct strands. So there's no coherency to maintain, as far as I can tell.

Yes order of writes arbitrary. However if read/write passes through RBEs, and if RBEs shared tiles (or cachelines, or whatever), they would have to later join the result of those accesses on the shared tiles (rough coherency, colliding write order doesn't matter, but cannot loose writes when RBEs compress/write tile back to GDDR). Atomics break the idea of sharing tiles. Clearly I don't think this is what they are doing...

The key aspect to me is that a pixel shader is allowed to fetch data from its position in all render targets currently bound (8 distinct buffers + Z/stencil). This is logically the same as fetching from a set of non-compressed textures. The pixel shader is then able to update all of those buffers, again solely for its position.

Assuming only self update (say full screen pass) and no overlapping pixels, then one could do this on current hardware assuming API allowed it.

Possible DX11 problem case is when say one is tessellating a surface and fragments, in the same patch for example, overlap the same destination pixel/sample location. Even though ROP is ordered, fetching prior render target data is unordered (to my knowledge), and thus much less useful.

These operations are a bit like reading/writing shared memory. I think it was Trinibwoy who suggested a while back that NVidia could do ROP processing in the multiprocessors using shared memory as a buffer.

Note that the above tessellation case would provide problems with software ROP in the ALUs. Locking down a tile/line/block of memory to insure atomic access during ROP processing would be crazy bad (less bad if all ALU ops for ROP done together like ATI clause, and tile/line/block access through L2). ATI clause model could actually be very powerful with regards to atomics if the developer had native API access...
 
So now we have these rumored scores:

Cypress ~P16xxx - P17xxx - P18xxx
Juniper XT ~P95xx
Redwood ~P46xx

Is it me or is there a hole somewhere between Cypress and Juniper?

So if the Cypress fills the 300$ segment, Juniper should be around the 150$ segment. In the last generation there was the 4850 at 200$ and the 4870 at 300$, fairly priced as well, since the 4870 was about 30% faster hence deserving the premium.

So from 150$ to 300$ there is a gap, both performance wise and product wise. Me thinks that the 4850 was the most profitable part of ATI during the last generation and they should have a product to replace it.

Unless of course these scores of Cypress ~P16xxx - P17xxx - P18xxx suggest that they could be something more like P15XXX to P20XXX or more and they are once again hiding facts!

I still don't see how there could be a lower end cypress, bar with a lower core frequency and mem frequency. It should be easily overcome though with overclocking and voltage tweaks. I don't think there is the option for GDDR3 again for the "5850".

PS Hey i can edit now, Yeepeeee! :p
 
Yes order of writes arbitrary. However if read/write passes through RBEs, and if RBEs shared tiles (or cachelines, or whatever), they would have to later join the result of those accesses on the shared tiles (rough coherency, colliding write order doesn't matter, but cannot loose writes when RBEs compress/write tile back to GDDR). Atomics break the idea of sharing tiles. Clearly I don't think this is what they are doing...
The sharing is transient - any fenced-clause that does scatter implicitly makes all reads incoherent within that clause, if they fetch from any address that is the target of a scatter. The only exception is if the scatter address(es) is explicitly private to the strand.

Possible DX11 problem case is when say one is tessellating a surface and fragments, in the same patch for example, overlap the same destination pixel/sample location. Even though ROP is ordered, fetching prior render target data is unordered (to my knowledge), and thus much less useful.
Even without tessellation this is a basic problem. The same pixel might appear on several triangles that are in flight at the same time. I honestly don't know how developers are going to make much use of pixel read/write when the render target is constantly changing.

Note that the above tessellation case would provide problems with software ROP in the ALUs. Locking down a tile/line/block of memory to insure atomic access during ROP processing would be crazy bad (less bad if all ALU ops for ROP done together like ATI clause, and tile/line/block access through L2). ATI clause model could actually be very powerful with regards to atomics if the developer had native API access...
I don't understand what it is about the tessellation case that creates a problem here. Tessellation is followed by setup then rasterisation. Tessellation is transparent as far as a fragment is concerned.

The LDS "exchange" mechanism is atomic (i.e. write/read are indivisible), because a clause encapsulates the LDS instructions. But ATI cannot currently support an atomic: "import data into registers, execute ALU cycles, export data from registers" because that's 3 clauses - ALU instructions cannot operate on non-register memory, except for fetching from constant buffers.

Apart from that the other problem with a software back end is that the pixel operations need to be sorted by primitive ID. NVidia treats this problem in two parts (not all fragments need to be sorted and if any do they're score-boarded for exclusive pixel shading) but I don't know how ATI handles it. I think it's handled by post-shading arbitration.

I'm still unclear how many triangles can own fragments in a pixel shading thread on ATI, I think it may only be 1. If so that means there are, at most, 128 triangles in flight at a time, per cluster. Post-shading arbitration presumably checks fragments' triangle IDs as the fragments arrive and decides if they're allowed to proceed to back end or have to wait until their predecessors have all been processed. Still, in theory this is potentially an absurd amount of data hanging around waiting for one wanton triangle's fragments to arrive - so I'm puzzled what ATI's actually doing here. Some of this problem can be solved by making the thread scheduler constrain ordering, e.g. preferring to stall pixel shading on threads that are newer than the wanton thread, to prevent the arbiter's fragment queue from filling.

So a software RBE solution would require that the fragments are effectively sorted by triangle ID. It would be like a mini A-buffer, each pixel having a private queue of fragments.

So a unified TU/RBE unit would need an arbiter on it's input, but would then happily and, by default, perform atomic back-end operations.

Jawed
 
Back
Top