AMD: R9xx Speculation

Gipsel · Nov 11, 2010

Kef said:
So around 425mm^2 :smile:

You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:

neliz · Nov 11, 2010

Gipsel said:
You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:

Hey, I've been a millimeter off before

Kef · Nov 11, 2010

Gipsel said:
You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:

Nope, I didn't fail, but maybe the numbers I used where off? GF110 is ~520mm^2 according to Dailytech and most sites says GF104 is around 330mm^2. Do you have better numbers?

Jawed · Nov 11, 2010

Gipsel said:
Small triangles (which are problematic for both, the rasterizers as well as the ROPs) tend to cluster on the screen . It's therefore preferable to use smaller checkerboard patterns for a more even distribution of the load.
Your example would group 4x4 tiles of 8x8 pixels (at least, maybe you were thinking of stripes) for one rasterizer.

I'm not worried about stripes versus checkerboard.

Cypress:

http://v3.espacenet.com/publication...=B1&FT=D&date=20091215&DB=EPODOC&locale=en_V3

Rasterisation in either the A or B SEs is aligned with screen tiles.

I can't find anything about mapping of ROPs though.

The probability that one rasterizer is completely busy and the second one has nothing to do is much higher in that case (or you would need much larger buffers).

I think the rasteriser is working on 8x2, 4x4 or 2x8 contiguous blocks of pixels, per cycle. That's why I made the tiles 8x8. 8x8 also matches the hardware thread size.

You talk about distributing the load for small triangles, but each SE has its own triangle buffer to feed its own rasteriser. Load-balancing depends merely upon each rasteriser having triangles to rasterise.

I don't see how balancing the load on the ROPs comes into this. Their source is pixel-shading and shader-export and they're load-balancing that input along with memory operations.

Against all that I can't provide anything, other than to say that to minimise inter-SE contention for ROPs it's easiest to make the ROPs private to SEs, aligning ROPs within screen-space tiles, at least for standard render target operations.

Though one of the main architectural diagrams (press slide) shows Shader Export as a single block - not as two distinct blocks as I would expect, one per SE. It also shows a single UTDP, when in fact there's two.

Neighbouring tiles definitely should be assigned to different rasterizers. But one may make the ROP tiles larger than that. It depends a bit on the buffers/color caches there. But generally the same rule should apply.

Truthfully, without doing some specific tests, I don't think we'll get a resolution to the ROP mapping question. I don't think it matters, because if that rumour about Cayman is true ROP mapping becomes totally scalable and independent of SIMD arrangement.

Gipsel · Nov 11, 2010

Kef said:
Nope, I didn't fail, but maybe the numbers I used where off? GF110 is ~520mm^2 according to Dailytech and most sites says GF104 is around 330mm^2. Do you have better numbers?

I just wanted to refer to the "right" in "right inbetween" as neliz has emphasized it. I would do it if I wanted to indicate that it is in fact more on the right side

. But that is something I can't imagine right now.

Kef · Nov 11, 2010

Gipsel said:
I just wanted to refer to the "right" in "right inbetween" as neliz has emphasized it. I would do it if I wanted to indicate that it is in fact more on the right side . But that is something I can't imagine right now.

Ah, I see

Sampsa · Nov 11, 2010

Squilliam said:
Oh... well that explains it. Still however I believe that foreigners should stop pretending to speak my language. I want reveals to come from English language sites first!

Google translate messes up on my PC.

Don't worry, we should be launching English version of Muropaketti (a box of cereal in English) shortly. I added direct English quote from AMD to original news post to avoid confusion in the Google Translation:

"Please find an announcement from AMD below regarding the change of Cayman NDA:

Demand for the ATI Radeon HD 5800 series continues to be very strong, the ATI Radeon HD 5970 remains the fastest graphics card in the world and the newest members of the AMD graphics family, the AMD Radeon HD 6850 and HD 6870, have set new standards for performance at their respective price points, and are available in volume.

With that in mind, we are going to take a bit more time before shipping the AMD Radeon HD 6900 series. As of today, the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be week 50. We will be providing additional information on these products, including the exact date and time of the NDA lift, in the weeks prior to launch."

http://plaza.fi/muropaketti/muropak...a-6970-naytonohjaimet-julkaistaan-viikolla-50

Gipsel · Nov 11, 2010

Jawed said:
I think the rasteriser is working on 8x2, 4x4 or 2x8 contiguous blocks of pixels, per cycle. That's why I made the tiles 8x8. 8x8 also matches the hardware thread size.

Yes, that's also the reason I mentioned that I suspect 8x8 Pixel screen tiles before.

Jawed said:
You talk about distributing the load for small triangles, but each SE has its own triangle buffer to feed its own rasteriser. Load-balancing depends merely upon each rasteriser having triangles to rasterise.

Those triangles simply may not be there.
Imagine you have a batch of a lot of tiny triangles which are sitting very close to each other (as they have to because they are so small), which directly translate into sitting in a single screen tile. What happens then is that one rasterizer and therefore one shader engine starts to idle as they don't get any new tris. Of course, some buffering will lighten the effect, but if the contiguous area belonging to a rasterizer are smaller, also the needed buffer space gets smaller. It is therefore preferable to have relatively small contiguous areas for the rasterizers, hence my preference for a checkerboard.
Of course you don't want to split triangles unnecessary between rasterizers (and reduce performance in that case). For this reason the screen tiles should be a bit larger than the possible pixel/clock capacity of each scanline converter. A 1:4 ratio (16 pixel/clock raster, 64 pixel screen tiles) looks not too bad for me.

Jawed said:
I don't see how balancing the load on the ROPs comes into this.

Imagine that one rasterizer gets loaded with a huge amount of tiny tris while the other rasterizer has a somewhat easier task because the tris are not evenly distributed on screen. What is better? If only half of the ROPs are tied to an already limiting shader engine and have to do the heavy lifting of handling a lot of small tris (which reduces the effectiveness of the ROPs, especially with MSAA) while the other half is half idle? Or as alternative if you distribute the work of both shader engines over all ROPs balancing it a bit out on the backend?

As you said, the crossbar between shader export and ROPs is there. Wouldn't it make much more sense to avoid the raster/ROP "aliasing" as a potential pitfall? The Fermis do it that way and I see no reason it should be much different in Cypress.

Jawed said:
Against all that I can't provide anything, other than to say that to minimise inter-SE contention for ROPs it's easiest to make the ROPs private to SEs, aligning ROPs within screen-space tiles, at least for standard render target operations.

Though one of the main architectural diagrams (press slide) shows Shader Export as a single block - not as two distinct blocks as I would expect, one per SE. It also shows a single UTDP, when in fact there's two.

It may be the easier to implement if you restrict the access from one shader engine to only half of the ROPs/memory interface (one would need two crossbars of half the size instead of a single big one), but it would also be slower. And we do know that the infrastructure has to be there (you can export from each shader engine to an arbitrary memory location). So why shouldn't one use it?

Btw., later slides show two separate UTDPs and still a unified shader export block.

Jawed said:
I don't think it matters, because if that rumour about Cayman is true ROP mapping becomes totally scalable and independent of SIMD arrangement.

Do you refer to the rumour that the ROPs go into the render engines?
I don't see how it would be more independent of the SIMD arrangement as it is probably now. Actually quite the opposite. But it would add some flexibility to scale the ROP count really independently of the memory controller width. But one would still need a similar crossbar as today connecting all those outputs from the render engines to the memory channels (which requires twice the througput as today for blending!). This change would be more a theoretical advantage than a real one.

Jawed · Nov 11, 2010

I lost my reply to a keyboard-mouse accident, so this is going to be terse

Gipsel said:
Those triangles simply may not be there.

Well, hard luck. You can also write a heavy texturing shader that hits one memory channel hard, leaving the others idle.

Imagine you have a batch of a lot of tiny triangles which are sitting very close to each other (as they have to because they are so small), which directly translate into sitting in a single screen tile. What happens then is that one rasterizer and therefore one shader engine starts to idle as they don't get any new tris. Of course, some buffering will lighten the effect, but if the contiguous area belonging to a rasterizer are smaller, also the needed buffer space gets smaller. It is therefore preferable to have relatively small contiguous areas for the rasterizers, hence my preference for a checkerboard.
Of course you don't want to split triangles unnecessary between rasterizers (and reduce performance in that case). For this reason the screen tiles should be a bit larger than the possible pixel/clock capacity of each scanline converter. A 1:4 ratio (16 pixel/clock raster, 64 pixel screen tiles) looks not too bad for me.
Imagine that one rasterizer gets loaded with a huge amount of tiny tris while the other rasterizer has a somewhat easier task because the tris are not evenly distributed on screen. What is better? If only half of the ROPs are tied to an already limiting shader engine and have to do the heavy lifting of handling a lot of small tris (which reduces the effectiveness of the ROPs, especially with MSAA) while the other half is half idle? Or as alternative if you distribute the work of both shader engines over all ROPs balancing it a bit out on the backend?

Cypress setup-hierarchical-Z-rasteriser-ROP-tiling is R300 version 5 (or whatever). It was never designed for a wodge of teeny triangles squished into a single tile.

As you said, the crossbar between shader export and ROPs is there. Wouldn't it make much more sense to avoid the raster/ROP "aliasing" as a potential pitfall? The Fermis do it that way and I see no reason it should be much different in Cypress.
It may be the easier to implement if you restrict the access from one shader engine to only half of the ROPs/memory interface (one would need two crossbars of half the size instead of a single big one), but it would also be slower. And we do know that the infrastructure has to be there (you can export from each shader engine to an arbitrary memory location). So why shouldn't one use it?

In this scenario one SE, which doesnt know what the other is doing, is trying to throttle pixel export in response to ROP queue-depth. But each SE doesn't know what the other is doing. So when one throttles, the other gets greedy. And the first then runs-dry. Anti-load-balancing.

Do you refer to the rumour that the ROPs go into the render engines?

No, nothing in particular.

I don't see how it would be more independent of the SIMD arrangement as it is probably now. Actually quite the opposite. But it would add some flexibility to scale the ROP count really independently of the memory controller width. But one would still need a similar crossbar as today connecting all those outputs from the render engines to the memory channels (which requires twice the througput as today for blending!). This change would be more a theoretical advantage than a real one.

I see that as unlikely for a long time.

-The_Mask- · Nov 11, 2010

hatter said:
Are these specs possible... 6Ghz GDDR5! Uploaded with ImageShack.us

OMG, second time someone steal my thoughts.

http://gathering.tweakers.net/forum/list_message/35007468#35007468

And Cayman is delayed, it's confirmed by AMD.

the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be the week of December 13th.

It's from the e-mail they send.

Gipsel · Nov 11, 2010

Jawed said:
Well, hard luck. You can also write a heavy texturing shader that hits one memory channel hard, leaving the others idle.

But that is similarly difficult as textures are tiled over all memory channels too. But it would get simpler and thus more probable if you would huge tiles are a simple division or striping instead of the spacefilling curves which are apparently used

Jawed said:
In this scenario one SE, which doesnt know what the other is doing, is trying to throttle pixel export in response to ROP queue-depth. But each SE doesn't know what the other is doing. So when one throttles, the other gets greedy. And the first then runs-dry. Anti-load-balancing.

I don't get your scenario.
I would think that nothing gets throttled until the queues (in the ROPs) are full. After that it is a matter of arbitration but the ROPs work as hard as they can either way with no unnecessary idle time. And not the export of one render engine as a whole stalls, but more probable just the export to that ROP which gets too much work. That's also a reason one should split the output of each render engine over all ROPs. It simply reduces the chance of congestion.

Actually the best would be to additionally redistribute the pixel shader load over all shader engines after the rasterizers, i.e. don't tie a shader engine to a rasterizer. That should show an even better scalability more closely resembling that Pomegranate concept. But that is probably a bit further into the future.

Jawed said:
I see that as unlikely for a long time.

Me too.

Tridam · Nov 11, 2010

Sampsa said:
Don't worry, we should be launching English version of Muropaketti (a box of cereal in English) shortly. I added direct English quote from AMD to original news post to avoid confusion in the Google Translation:

"Please find an announcement from AMD below regarding the change of Cayman NDA:

Demand for the ATI Radeon HD 5800 series continues to be very strong, the ATI Radeon HD 5970 remains the fastest graphics card in the world and the newest members of the AMD graphics family, the AMD Radeon HD 6850 and HD 6870, have set new standards for performance at their respective price points, and are available in volume.

With that in mind, we are going to take a bit more time before shipping the AMD Radeon HD 6900 series. As of today, the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be week 50. We will be providing additional information on these products, including the exact date and time of the NDA lift, in the weeks prior to launch."

http://plaza.fi/muropaketti/muropak...a-6970-naytonohjaimet-julkaistaan-viikolla-50

I can confirm too that the HD 6900s are now officially delayed. For the detail, 1h prior to the GTX 580 launch, AMD called to reaffirm us that the launch would go as expected on the 22nd and to give us a date for the press samples. Glad to see that 48h after they suddenly realize that the HD 5800 are selling well and that the HD 5970 is faster than a board they didn’t expect to see on the market when they planned the Nov 22nd launch…

Jawed · Nov 11, 2010

Gipsel said:
But that is similarly difficult as textures are tiled over all memory channels too. But it would get simpler and thus more probable if you would huge tiles are a simple division or striping instead of the spacefilling curves which are apparently used

Yes, and that problem's easier to solve, because they're read-only, etc.

I agree with you on how things can be more efficient. But Cypress was designed for big triangles, not small triangles. Because it's just an iteration of R300. It's a minimal change.

The good stuff we're discussing may be in Cayman, with a bit of luck.

You might enjoy this:

http://www.graphicshardware.org/previous/www_2005/presentations/bando-hexagonal-gh05.pdf

I don't get your scenario.I would think that nothing gets throttled until the queues (in the ROPs) are full.

Well, that's a purely binary approach: full-speed or stalled.

After that it is a matter of arbitration but the ROPs work as hard as they can either way with no unnecessary idle time. And not the export of one render engine as a whole stalls, but more probable just the export to that ROP which gets too much work. That's also a reason one should split the output of each render engine over all ROPs. It simply reduces the chance of congestion.

Once one stalls it'll cascade stalls for writes to the others, simply because a hardware thread's export will cover multiple ROP-quads.

Actually the best would be to additionally redistribute the pixel shader load over all shader engines after the rasterizers, i.e. don't tie a shader engine to a rasterizer. That should show an even better scalability more closely resembling that Pomegranate concept. But that is probably a bit further into the future.

The question boils down to how many parameters do you want to put into the load-balancing formula - and how far across the chip are you going to macro-load-balance. Tiling and hierarchical-tiling approaches localise load-balancing because it's practical, and the workloads they're designed for make good use of the whole chip.

It's similar to "right-sizing" the inter-stage buffers in a pipeline, rather than going all out to provide optimal performance in the worst-possible case.

But Fermi shows quite nicely that harsh geometric workloads don't respect the small-scale, fragmented, approach of old: hence the multiply-staged and load-balanced processing of geometry and the use of L1/L2 cache as both a shock-absorber and a communications network.

Gipsel · Nov 11, 2010

Jawed said:
Once one stalls it'll cascade stalls for writes to the others, simply because a hardware thread's export will cover multiple ROP-quads.

No, not necessarily. A wavefront consists of pixels from a single ROP tile in the simplest instance. And that case gets only more probable in case of smaller triangles (simply because you can easily fill your wavefronts with quads from a single screen tile).

Jawed said:
But Fermi shows quite nicely that harsh geometric workloads don't respect the small-scale, fragmented, approach of old: hence the multiply-staged and load-balanced processing of geometry and the use of L1/L2 cache as both a shock-absorber and a communications network.

Right. But what has that to do with the assignment of each ROP to a certain rasterizer you suspect for Cypress?

Jawed · Nov 11, 2010

Gipsel said:
No, not necessarily. A wavefront consists of pixels from a single ROP tile in the simplest instance.

It does?

Right. But what has that to do with the assignment of each ROP to a certain rasterizer you suspect for Cypress?

It was a comparison of load-balancing techniques, staging and granularity of buffering.

Gipsel · Nov 11, 2010

Jawed said:
It does?

Of course!

A wavefront (hardware thread) consists of 16 quads, i.e. 64 pixels maximum. Assuming a ROP screen tile is just 8x8 pixels (as said, it could be also larger), this tile gets assigned to a single rasterizer creating at least one, possibly more(*) wavefronts out of it, all accessing the same tile.

(*) This is simply caused by the fact, that each triangle uses all quads it touches, that means a 8x8 tile with 128 triangles half a pixel in size would create at least 4 wavefronts all accessing the same ROP tile.

Jawed · Nov 11, 2010

Hmm, I thought your starting point was the tightest-possible interleaving of ROP tiles so that a sequence of small triangles lands on as many different ROP-quads as possible :???:

And in order to maximise the coverage of ROP-quads per triangle, have all ROPs shared by both SEs :???:

Gipsel · Nov 11, 2010

Jawed said:
Hmm, I thought your starting point was the tightest-possible interleaving of ROP tiles so that a sequence of small triangles lands on as many different ROP-quads as possible

But a screen tile is still at least 8x8 pixels, irrespective how you interleave the individual tiles.

Jawed said:
And in order to maximise the coverage of ROP-quads per triangle, have all ROPs shared by both SEs

You want to minimize possible stalls which could happen when you restrict the access of one SE to only half of the ROPs (half of the ROPs working, the other half idle).
That way each ROP gets only half the work from one SE, so there is the chance that asymmetric load to the rasterizers/shader engines evens out between all ROPs and is fairly symmetric at the ROP level again.

Unknown Soldier · Nov 11, 2010

Tridam said:
I can confirm too that the HD 6900s are now officially delayed. For the detail, 1h prior to the GTX 580 launch, AMD called to reaffirm us that the launch would go as expected on the 22nd and to give us a date for the press samples. Glad to see that 48h after they suddenly realize that the HD 5800 are selling well and that the HD 5970 is faster than a board they didn’t expect to see on the market when they planned the Nov 22nd launch…

Unknown Soldier said:
I have a feeling that AMD were taken a back(again since it's happened again and again from NV40 and G80) with the GTX580 results and now they have to decide on clock speeds to try and be competitive.

That or TSMC screwed them over again.

I wonder if I get a prize? I reckon the delay is a BIOS flash to speed the cards up.

*please note that this is my own opinion*

Dave Baumann · Nov 11, 2010

Changing clock speeds is not something you do on a whim, nor can it be necessarily done quickly. Dependant on where you are in a qualification cycle changing clocks will have major ramifications that can result in potentially months of schedule alteration.

AMD: R9xx Speculation

Gipsel

neliz

GIGABYTE Man

Kef

Jawed

Gipsel

Kef

Sampsa

Gipsel

Jawed

-The_Mask-

Gipsel

Tridam

Jawed

Gipsel

Jawed

Gipsel

Jawed

Gipsel

Unknown Soldier

Dave Baumann

Gamerscore Wh...

Similar threads