AMD: R9xx Speculation

You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise: :oops:

Nope, I didn't fail, but maybe the numbers I used where off? GF110 is ~520mm^2 according to Dailytech and most sites says GF104 is around 330mm^2. Do you have better numbers?
 
Small triangles (which are problematic for both, the rasterizers as well as the ROPs) tend to cluster on the screen ;). It's therefore preferable to use smaller checkerboard patterns for a more even distribution of the load.
Your example would group 4x4 tiles of 8x8 pixels (at least, maybe you were thinking of stripes) for one rasterizer.
I'm not worried about stripes versus checkerboard.

Cypress:

http://v3.espacenet.com/publication...=B1&FT=D&date=20091215&DB=EPODOC&locale=en_V3

Rasterisation in either the A or B SEs is aligned with screen tiles.

I can't find anything about mapping of ROPs though.

The probability that one rasterizer is completely busy and the second one has nothing to do is much higher in that case (or you would need much larger buffers).
I think the rasteriser is working on 8x2, 4x4 or 2x8 contiguous blocks of pixels, per cycle. That's why I made the tiles 8x8. 8x8 also matches the hardware thread size.

You talk about distributing the load for small triangles, but each SE has its own triangle buffer to feed its own rasteriser. Load-balancing depends merely upon each rasteriser having triangles to rasterise.

I don't see how balancing the load on the ROPs comes into this. Their source is pixel-shading and shader-export and they're load-balancing that input along with memory operations.

Against all that I can't provide anything, other than to say that to minimise inter-SE contention for ROPs it's easiest to make the ROPs private to SEs, aligning ROPs within screen-space tiles, at least for standard render target operations.

Though one of the main architectural diagrams (press slide) shows Shader Export as a single block - not as two distinct blocks as I would expect, one per SE. It also shows a single UTDP, when in fact there's two.

Neighbouring tiles definitely should be assigned to different rasterizers. But one may make the ROP tiles larger than that. It depends a bit on the buffers/color caches there. But generally the same rule should apply.
Truthfully, without doing some specific tests, I don't think we'll get a resolution to the ROP mapping question. I don't think it matters, because if that rumour about Cayman is true ROP mapping becomes totally scalable and independent of SIMD arrangement.
 
Nope, I didn't fail, but maybe the numbers I used where off? GF110 is ~520mm^2 according to Dailytech and most sites says GF104 is around 330mm^2. Do you have better numbers?
I just wanted to refer to the "right" in "right inbetween" as neliz has emphasized it. I would do it if I wanted to indicate that it is in fact more on the right side ;). But that is something I can't imagine right now.
 
I just wanted to refer to the "right" in "right inbetween" as neliz has emphasized it. I would do it if I wanted to indicate that it is in fact more on the right side ;). But that is something I can't imagine right now.

Ah, I see ;)
 
Oh... well that explains it. Still however I believe that foreigners should stop pretending to speak my language. I want reveals to come from English language sites first! :D

Google translate messes up on my PC. :(

Don't worry, we should be launching English version of Muropaketti (a box of cereal in English) shortly. I added direct English quote from AMD to original news post to avoid confusion in the Google Translation:

"Please find an announcement from AMD below regarding the change of Cayman NDA:

Demand for the ATI Radeon HD 5800 series continues to be very strong, the ATI Radeon HD 5970 remains the fastest graphics card in the world and the newest members of the AMD graphics family, the AMD Radeon HD 6850 and HD 6870, have set new standards for performance at their respective price points, and are available in volume.

With that in mind, we are going to take a bit more time before shipping the AMD Radeon HD 6900 series. As of today, the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be week 50. We will be providing additional information on these products, including the exact date and time of the NDA lift, in the weeks prior to launch."


http://plaza.fi/muropaketti/muropak...a-6970-naytonohjaimet-julkaistaan-viikolla-50
 
Last edited by a moderator:
I think the rasteriser is working on 8x2, 4x4 or 2x8 contiguous blocks of pixels, per cycle. That's why I made the tiles 8x8. 8x8 also matches the hardware thread size.
Yes, that's also the reason I mentioned that I suspect 8x8 Pixel screen tiles before.
You talk about distributing the load for small triangles, but each SE has its own triangle buffer to feed its own rasteriser. Load-balancing depends merely upon each rasteriser having triangles to rasterise.
Those triangles simply may not be there.
Imagine you have a batch of a lot of tiny triangles which are sitting very close to each other (as they have to because they are so small), which directly translate into sitting in a single screen tile. What happens then is that one rasterizer and therefore one shader engine starts to idle as they don't get any new tris. Of course, some buffering will lighten the effect, but if the contiguous area belonging to a rasterizer are smaller, also the needed buffer space gets smaller. It is therefore preferable to have relatively small contiguous areas for the rasterizers, hence my preference for a checkerboard.
Of course you don't want to split triangles unnecessary between rasterizers (and reduce performance in that case). For this reason the screen tiles should be a bit larger than the possible pixel/clock capacity of each scanline converter. A 1:4 ratio (16 pixel/clock raster, 64 pixel screen tiles) looks not too bad for me.
I don't see how balancing the load on the ROPs comes into this.
Imagine that one rasterizer gets loaded with a huge amount of tiny tris while the other rasterizer has a somewhat easier task because the tris are not evenly distributed on screen. What is better? If only half of the ROPs are tied to an already limiting shader engine and have to do the heavy lifting of handling a lot of small tris (which reduces the effectiveness of the ROPs, especially with MSAA) while the other half is half idle? Or as alternative if you distribute the work of both shader engines over all ROPs balancing it a bit out on the backend?

As you said, the crossbar between shader export and ROPs is there. Wouldn't it make much more sense to avoid the raster/ROP "aliasing" as a potential pitfall? The Fermis do it that way and I see no reason it should be much different in Cypress.
Against all that I can't provide anything, other than to say that to minimise inter-SE contention for ROPs it's easiest to make the ROPs private to SEs, aligning ROPs within screen-space tiles, at least for standard render target operations.

Though one of the main architectural diagrams (press slide) shows Shader Export as a single block - not as two distinct blocks as I would expect, one per SE. It also shows a single UTDP, when in fact there's two.
It may be the easier to implement if you restrict the access from one shader engine to only half of the ROPs/memory interface (one would need two crossbars of half the size instead of a single big one), but it would also be slower. And we do know that the infrastructure has to be there (you can export from each shader engine to an arbitrary memory location). So why shouldn't one use it?

Btw., later slides show two separate UTDPs and still a unified shader export block.
I don't think it matters, because if that rumour about Cayman is true ROP mapping becomes totally scalable and independent of SIMD arrangement.
Do you refer to the rumour that the ROPs go into the render engines?
I don't see how it would be more independent of the SIMD arrangement as it is probably now. Actually quite the opposite. But it would add some flexibility to scale the ROP count really independently of the memory controller width. But one would still need a similar crossbar as today connecting all those outputs from the render engines to the memory channels (which requires twice the througput as today for blending!). This change would be more a theoretical advantage than a real one.
 
Last edited by a moderator:
I lost my reply to a keyboard-mouse accident, so this is going to be terse :p

Those triangles simply may not be there.
Well, hard luck. You can also write a heavy texturing shader that hits one memory channel hard, leaving the others idle.

Imagine you have a batch of a lot of tiny triangles which are sitting very close to each other (as they have to because they are so small), which directly translate into sitting in a single screen tile. What happens then is that one rasterizer and therefore one shader engine starts to idle as they don't get any new tris. Of course, some buffering will lighten the effect, but if the contiguous area belonging to a rasterizer are smaller, also the needed buffer space gets smaller. It is therefore preferable to have relatively small contiguous areas for the rasterizers, hence my preference for a checkerboard.
Of course you don't want to split triangles unnecessary between rasterizers (and reduce performance in that case). For this reason the screen tiles should be a bit larger than the possible pixel/clock capacity of each scanline converter. A 1:4 ratio (16 pixel/clock raster, 64 pixel screen tiles) looks not too bad for me.
Imagine that one rasterizer gets loaded with a huge amount of tiny tris while the other rasterizer has a somewhat easier task because the tris are not evenly distributed on screen. What is better? If only half of the ROPs are tied to an already limiting shader engine and have to do the heavy lifting of handling a lot of small tris (which reduces the effectiveness of the ROPs, especially with MSAA) while the other half is half idle? Or as alternative if you distribute the work of both shader engines over all ROPs balancing it a bit out on the backend?
Cypress setup-hierarchical-Z-rasteriser-ROP-tiling is R300 version 5 (or whatever). It was never designed for a wodge of teeny triangles squished into a single tile.

As you said, the crossbar between shader export and ROPs is there. Wouldn't it make much more sense to avoid the raster/ROP "aliasing" as a potential pitfall? The Fermis do it that way and I see no reason it should be much different in Cypress.
It may be the easier to implement if you restrict the access from one shader engine to only half of the ROPs/memory interface (one would need two crossbars of half the size instead of a single big one), but it would also be slower. And we do know that the infrastructure has to be there (you can export from each shader engine to an arbitrary memory location). So why shouldn't one use it?
In this scenario one SE, which doesnt know what the other is doing, is trying to throttle pixel export in response to ROP queue-depth. But each SE doesn't know what the other is doing. So when one throttles, the other gets greedy. And the first then runs-dry. Anti-load-balancing.

Do you refer to the rumour that the ROPs go into the render engines?
No, nothing in particular.

I don't see how it would be more independent of the SIMD arrangement as it is probably now. Actually quite the opposite. But it would add some flexibility to scale the ROP count really independently of the memory controller width. But one would still need a similar crossbar as today connecting all those outputs from the render engines to the memory channels (which requires twice the througput as today for blending!). This change would be more a theoretical advantage than a real one.
I see that as unlikely for a long time.
 
Last edited by a moderator:
Well, hard luck. You can also write a heavy texturing shader that hits one memory channel hard, leaving the others idle.
But that is similarly difficult as textures are tiled over all memory channels too. But it would get simpler and thus more probable if you would huge tiles are a simple division or striping instead of the spacefilling curves which are apparently used ;)
In this scenario one SE, which doesnt know what the other is doing, is trying to throttle pixel export in response to ROP queue-depth. But each SE doesn't know what the other is doing. So when one throttles, the other gets greedy. And the first then runs-dry. Anti-load-balancing.
I don't get your scenario.
I would think that nothing gets throttled until the queues (in the ROPs) are full. After that it is a matter of arbitration but the ROPs work as hard as they can either way with no unnecessary idle time. And not the export of one render engine as a whole stalls, but more probable just the export to that ROP which gets too much work. That's also a reason one should split the output of each render engine over all ROPs. It simply reduces the chance of congestion.

Actually the best would be to additionally redistribute the pixel shader load over all shader engines after the rasterizers, i.e. don't tie a shader engine to a rasterizer. That should show an even better scalability more closely resembling that Pomegranate concept. But that is probably a bit further into the future.
I see that as unlikely for a long time.
Me too.
 
Don't worry, we should be launching English version of Muropaketti (a box of cereal in English) shortly. I added direct English quote from AMD to original news post to avoid confusion in the Google Translation:

"Please find an announcement from AMD below regarding the change of Cayman NDA:

Demand for the ATI Radeon HD 5800 series continues to be very strong, the ATI Radeon HD 5970 remains the fastest graphics card in the world and the newest members of the AMD graphics family, the AMD Radeon HD 6850 and HD 6870, have set new standards for performance at their respective price points, and are available in volume.

With that in mind, we are going to take a bit more time before shipping the AMD Radeon HD 6900 series. As of today, the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be week 50. We will be providing additional information on these products, including the exact date and time of the NDA lift, in the weeks prior to launch."


http://plaza.fi/muropaketti/muropak...a-6970-naytonohjaimet-julkaistaan-viikolla-50

I can confirm too that the HD 6900s are now officially delayed. For the detail, 1h prior to the GTX 580 launch, AMD called to reaffirm us that the launch would go as expected on the 22nd and to give us a date for the press samples. Glad to see that 48h after they suddenly realize that the HD 5800 are selling well and that the HD 5970 is faster than a board they didn’t expect to see on the market when they planned the Nov 22nd launch…
 
But that is similarly difficult as textures are tiled over all memory channels too. But it would get simpler and thus more probable if you would huge tiles are a simple division or striping instead of the spacefilling curves which are apparently used ;)
Yes, and that problem's easier to solve, because they're read-only, etc. ;)

I agree with you on how things can be more efficient. But Cypress was designed for big triangles, not small triangles. Because it's just an iteration of R300. It's a minimal change.

The good stuff we're discussing may be in Cayman, with a bit of luck.

You might enjoy this:

http://www.graphicshardware.org/previous/www_2005/presentations/bando-hexagonal-gh05.pdf

I don't get your scenario.I would think that nothing gets throttled until the queues (in the ROPs) are full.
Well, that's a purely binary approach: full-speed or stalled.

After that it is a matter of arbitration but the ROPs work as hard as they can either way with no unnecessary idle time. And not the export of one render engine as a whole stalls, but more probable just the export to that ROP which gets too much work. That's also a reason one should split the output of each render engine over all ROPs. It simply reduces the chance of congestion.
Once one stalls it'll cascade stalls for writes to the others, simply because a hardware thread's export will cover multiple ROP-quads.

Actually the best would be to additionally redistribute the pixel shader load over all shader engines after the rasterizers, i.e. don't tie a shader engine to a rasterizer. That should show an even better scalability more closely resembling that Pomegranate concept. But that is probably a bit further into the future.
The question boils down to how many parameters do you want to put into the load-balancing formula - and how far across the chip are you going to macro-load-balance. Tiling and hierarchical-tiling approaches localise load-balancing because it's practical, and the workloads they're designed for make good use of the whole chip.

It's similar to "right-sizing" the inter-stage buffers in a pipeline, rather than going all out to provide optimal performance in the worst-possible case.

But Fermi shows quite nicely that harsh geometric workloads don't respect the small-scale, fragmented, approach of old: hence the multiply-staged and load-balanced processing of geometry and the use of L1/L2 cache as both a shock-absorber and a communications network.
 
Once one stalls it'll cascade stalls for writes to the others, simply because a hardware thread's export will cover multiple ROP-quads.
No, not necessarily. A wavefront consists of pixels from a single ROP tile in the simplest instance. And that case gets only more probable in case of smaller triangles (simply because you can easily fill your wavefronts with quads from a single screen tile).
But Fermi shows quite nicely that harsh geometric workloads don't respect the small-scale, fragmented, approach of old: hence the multiply-staged and load-balanced processing of geometry and the use of L1/L2 cache as both a shock-absorber and a communications network.
Right. But what has that to do with the assignment of each ROP to a certain rasterizer you suspect for Cypress?
 
No, not necessarily. A wavefront consists of pixels from a single ROP tile in the simplest instance.
It does?

Right. But what has that to do with the assignment of each ROP to a certain rasterizer you suspect for Cypress?
It was a comparison of load-balancing techniques, staging and granularity of buffering.
 
Of course!

A wavefront (hardware thread) consists of 16 quads, i.e. 64 pixels maximum. Assuming a ROP screen tile is just 8x8 pixels (as said, it could be also larger), this tile gets assigned to a single rasterizer creating at least one, possibly more(*) wavefronts out of it, all accessing the same tile.

(*) This is simply caused by the fact, that each triangle uses all quads it touches, that means a 8x8 tile with 128 triangles half a pixel in size would create at least 4 wavefronts all accessing the same ROP tile.
 
Hmm, I thought your starting point was the tightest-possible interleaving of ROP tiles so that a sequence of small triangles lands on as many different ROP-quads as possible :???:

And in order to maximise the coverage of ROP-quads per triangle, have all ROPs shared by both SEs :???:
 
Hmm, I thought your starting point was the tightest-possible interleaving of ROP tiles so that a sequence of small triangles lands on as many different ROP-quads as possible :???:
But a screen tile is still at least 8x8 pixels, irrespective how you interleave the individual tiles.
And in order to maximise the coverage of ROP-quads per triangle, have all ROPs shared by both SEs :???:
:???:
You want to minimize possible stalls which could happen when you restrict the access of one SE to only half of the ROPs (half of the ROPs working, the other half idle).
That way each ROP gets only half the work from one SE, so there is the chance that asymmetric load to the rasterizers/shader engines evens out between all ROPs and is fairly symmetric at the ROP level again.
 
I can confirm too that the HD 6900s are now officially delayed. For the detail, 1h prior to the GTX 580 launch, AMD called to reaffirm us that the launch would go as expected on the 22nd and to give us a date for the press samples. Glad to see that 48h after they suddenly realize that the HD 5800 are selling well and that the HD 5970 is faster than a board they didn’t expect to see on the market when they planned the Nov 22nd launch…

I have a feeling that AMD were taken a back(again since it's happened again and again from NV40 and G80) with the GTX580 results and now they have to decide on clock speeds to try and be competitive.

That or TSMC screwed them over again.

I wonder if I get a prize? I reckon the delay is a BIOS flash to speed the cards up.

*please note that this is my own opinion*
 
Changing clock speeds is not something you do on a whim, nor can it be necessarily done quickly. Dependant on where you are in a qualification cycle changing clocks will have major ramifications that can result in potentially months of schedule alteration.
 
Back
Top