You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:So around 425mm^2 :smile:
You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:So around 425mm^2 :smile:
You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:
You failed to pay attention to the emphasis in "right inbetween GF104 and GF110"! I think it is not a play on words, isn't it? Otherwise:
I'm not worried about stripes versus checkerboard.Small triangles (which are problematic for both, the rasterizers as well as the ROPs) tend to cluster on the screen . It's therefore preferable to use smaller checkerboard patterns for a more even distribution of the load.
Your example would group 4x4 tiles of 8x8 pixels (at least, maybe you were thinking of stripes) for one rasterizer.
I think the rasteriser is working on 8x2, 4x4 or 2x8 contiguous blocks of pixels, per cycle. That's why I made the tiles 8x8. 8x8 also matches the hardware thread size.The probability that one rasterizer is completely busy and the second one has nothing to do is much higher in that case (or you would need much larger buffers).
Truthfully, without doing some specific tests, I don't think we'll get a resolution to the ROP mapping question. I don't think it matters, because if that rumour about Cayman is true ROP mapping becomes totally scalable and independent of SIMD arrangement.Neighbouring tiles definitely should be assigned to different rasterizers. But one may make the ROP tiles larger than that. It depends a bit on the buffers/color caches there. But generally the same rule should apply.
I just wanted to refer to the "right" in "right inbetween" as neliz has emphasized it. I would do it if I wanted to indicate that it is in fact more on the right side . But that is something I can't imagine right now.Nope, I didn't fail, but maybe the numbers I used where off? GF110 is ~520mm^2 according to Dailytech and most sites says GF104 is around 330mm^2. Do you have better numbers?
I just wanted to refer to the "right" in "right inbetween" as neliz has emphasized it. I would do it if I wanted to indicate that it is in fact more on the right side . But that is something I can't imagine right now.
Oh... well that explains it. Still however I believe that foreigners should stop pretending to speak my language. I want reveals to come from English language sites first!
Google translate messes up on my PC.
Yes, that's also the reason I mentioned that I suspect 8x8 Pixel screen tiles before.I think the rasteriser is working on 8x2, 4x4 or 2x8 contiguous blocks of pixels, per cycle. That's why I made the tiles 8x8. 8x8 also matches the hardware thread size.
Those triangles simply may not be there.You talk about distributing the load for small triangles, but each SE has its own triangle buffer to feed its own rasteriser. Load-balancing depends merely upon each rasteriser having triangles to rasterise.
Imagine that one rasterizer gets loaded with a huge amount of tiny tris while the other rasterizer has a somewhat easier task because the tris are not evenly distributed on screen. What is better? If only half of the ROPs are tied to an already limiting shader engine and have to do the heavy lifting of handling a lot of small tris (which reduces the effectiveness of the ROPs, especially with MSAA) while the other half is half idle? Or as alternative if you distribute the work of both shader engines over all ROPs balancing it a bit out on the backend?I don't see how balancing the load on the ROPs comes into this.
It may be the easier to implement if you restrict the access from one shader engine to only half of the ROPs/memory interface (one would need two crossbars of half the size instead of a single big one), but it would also be slower. And we do know that the infrastructure has to be there (you can export from each shader engine to an arbitrary memory location). So why shouldn't one use it?Against all that I can't provide anything, other than to say that to minimise inter-SE contention for ROPs it's easiest to make the ROPs private to SEs, aligning ROPs within screen-space tiles, at least for standard render target operations.
Though one of the main architectural diagrams (press slide) shows Shader Export as a single block - not as two distinct blocks as I would expect, one per SE. It also shows a single UTDP, when in fact there's two.
Do you refer to the rumour that the ROPs go into the render engines?I don't think it matters, because if that rumour about Cayman is true ROP mapping becomes totally scalable and independent of SIMD arrangement.
Well, hard luck. You can also write a heavy texturing shader that hits one memory channel hard, leaving the others idle.Those triangles simply may not be there.
Cypress setup-hierarchical-Z-rasteriser-ROP-tiling is R300 version 5 (or whatever). It was never designed for a wodge of teeny triangles squished into a single tile.Imagine you have a batch of a lot of tiny triangles which are sitting very close to each other (as they have to because they are so small), which directly translate into sitting in a single screen tile. What happens then is that one rasterizer and therefore one shader engine starts to idle as they don't get any new tris. Of course, some buffering will lighten the effect, but if the contiguous area belonging to a rasterizer are smaller, also the needed buffer space gets smaller. It is therefore preferable to have relatively small contiguous areas for the rasterizers, hence my preference for a checkerboard.
Of course you don't want to split triangles unnecessary between rasterizers (and reduce performance in that case). For this reason the screen tiles should be a bit larger than the possible pixel/clock capacity of each scanline converter. A 1:4 ratio (16 pixel/clock raster, 64 pixel screen tiles) looks not too bad for me.
Imagine that one rasterizer gets loaded with a huge amount of tiny tris while the other rasterizer has a somewhat easier task because the tris are not evenly distributed on screen. What is better? If only half of the ROPs are tied to an already limiting shader engine and have to do the heavy lifting of handling a lot of small tris (which reduces the effectiveness of the ROPs, especially with MSAA) while the other half is half idle? Or as alternative if you distribute the work of both shader engines over all ROPs balancing it a bit out on the backend?
In this scenario one SE, which doesnt know what the other is doing, is trying to throttle pixel export in response to ROP queue-depth. But each SE doesn't know what the other is doing. So when one throttles, the other gets greedy. And the first then runs-dry. Anti-load-balancing.As you said, the crossbar between shader export and ROPs is there. Wouldn't it make much more sense to avoid the raster/ROP "aliasing" as a potential pitfall? The Fermis do it that way and I see no reason it should be much different in Cypress.
It may be the easier to implement if you restrict the access from one shader engine to only half of the ROPs/memory interface (one would need two crossbars of half the size instead of a single big one), but it would also be slower. And we do know that the infrastructure has to be there (you can export from each shader engine to an arbitrary memory location). So why shouldn't one use it?
No, nothing in particular.Do you refer to the rumour that the ROPs go into the render engines?
I see that as unlikely for a long time.I don't see how it would be more independent of the SIMD arrangement as it is probably now. Actually quite the opposite. But it would add some flexibility to scale the ROP count really independently of the memory controller width. But one would still need a similar crossbar as today connecting all those outputs from the render engines to the memory channels (which requires twice the througput as today for blending!). This change would be more a theoretical advantage than a real one.
Are these specs possible... 6Ghz GDDR5! Uploaded with ImageShack.us
It's from the e-mail they send.the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be the week of December 13th.
But that is similarly difficult as textures are tiled over all memory channels too. But it would get simpler and thus more probable if you would huge tiles are a simple division or striping instead of the spacefilling curves which are apparently usedWell, hard luck. You can also write a heavy texturing shader that hits one memory channel hard, leaving the others idle.
I don't get your scenario.In this scenario one SE, which doesnt know what the other is doing, is trying to throttle pixel export in response to ROP queue-depth. But each SE doesn't know what the other is doing. So when one throttles, the other gets greedy. And the first then runs-dry. Anti-load-balancing.
Me too.I see that as unlikely for a long time.
Don't worry, we should be launching English version of Muropaketti (a box of cereal in English) shortly. I added direct English quote from AMD to original news post to avoid confusion in the Google Translation:
"Please find an announcement from AMD below regarding the change of Cayman NDA:
Demand for the ATI Radeon HD 5800 series continues to be very strong, the ATI Radeon HD 5970 remains the fastest graphics card in the world and the newest members of the AMD graphics family, the AMD Radeon HD 6850 and HD 6870, have set new standards for performance at their respective price points, and are available in volume.
With that in mind, we are going to take a bit more time before shipping the AMD Radeon HD 6900 series. As of today, the NDA lift for information relating to the AMD Radeon HD 6950 and HD 6970 will be week 50. We will be providing additional information on these products, including the exact date and time of the NDA lift, in the weeks prior to launch."
http://plaza.fi/muropaketti/muropak...a-6970-naytonohjaimet-julkaistaan-viikolla-50
Yes, and that problem's easier to solve, because they're read-only, etc.But that is similarly difficult as textures are tiled over all memory channels too. But it would get simpler and thus more probable if you would huge tiles are a simple division or striping instead of the spacefilling curves which are apparently used
Well, that's a purely binary approach: full-speed or stalled.I don't get your scenario.I would think that nothing gets throttled until the queues (in the ROPs) are full.
Once one stalls it'll cascade stalls for writes to the others, simply because a hardware thread's export will cover multiple ROP-quads.After that it is a matter of arbitration but the ROPs work as hard as they can either way with no unnecessary idle time. And not the export of one render engine as a whole stalls, but more probable just the export to that ROP which gets too much work. That's also a reason one should split the output of each render engine over all ROPs. It simply reduces the chance of congestion.
The question boils down to how many parameters do you want to put into the load-balancing formula - and how far across the chip are you going to macro-load-balance. Tiling and hierarchical-tiling approaches localise load-balancing because it's practical, and the workloads they're designed for make good use of the whole chip.Actually the best would be to additionally redistribute the pixel shader load over all shader engines after the rasterizers, i.e. don't tie a shader engine to a rasterizer. That should show an even better scalability more closely resembling that Pomegranate concept. But that is probably a bit further into the future.
No, not necessarily. A wavefront consists of pixels from a single ROP tile in the simplest instance. And that case gets only more probable in case of smaller triangles (simply because you can easily fill your wavefronts with quads from a single screen tile).Once one stalls it'll cascade stalls for writes to the others, simply because a hardware thread's export will cover multiple ROP-quads.
Right. But what has that to do with the assignment of each ROP to a certain rasterizer you suspect for Cypress?But Fermi shows quite nicely that harsh geometric workloads don't respect the small-scale, fragmented, approach of old: hence the multiply-staged and load-balanced processing of geometry and the use of L1/L2 cache as both a shock-absorber and a communications network.
It does?No, not necessarily. A wavefront consists of pixels from a single ROP tile in the simplest instance.
It was a comparison of load-balancing techniques, staging and granularity of buffering.Right. But what has that to do with the assignment of each ROP to a certain rasterizer you suspect for Cypress?
Of course!It does?
But a screen tile is still at least 8x8 pixels, irrespective how you interleave the individual tiles.Hmm, I thought your starting point was the tightest-possible interleaving of ROP tiles so that a sequence of small triangles lands on as many different ROP-quads as possible
And in order to maximise the coverage of ROP-quads per triangle, have all ROPs shared by both SEs
I can confirm too that the HD 6900s are now officially delayed. For the detail, 1h prior to the GTX 580 launch, AMD called to reaffirm us that the launch would go as expected on the 22nd and to give us a date for the press samples. Glad to see that 48h after they suddenly realize that the HD 5800 are selling well and that the HD 5970 is faster than a board they didn’t expect to see on the market when they planned the Nov 22nd launch…
I have a feeling that AMD were taken a back(again since it's happened again and again from NV40 and G80) with the GTX580 results and now they have to decide on clock speeds to try and be competitive.
That or TSMC screwed them over again.