AMD: R9xx Speculation

They are the monsters with 100 arms and NH did talk about a codename "Radeon 100". Hm.


100 hands and 50 heads.

Hundred hands... maybe they want people to assume they are staying with their current shader setup, since every hand has 5 fingers?
 
you are really speculating on the type and number of sp using the description of an old greek giant?
 
Of course. They had also one hundred eyes per ogre. Maybe Eyefinity 2.0 will support up-to 100 displays
icon_tongue2.gif
 
Fudo is kind of right..

There's refreshes and new architecture.

Are you implying Heca' is the name for the 32nm refresh of Evergreen?

And the ghosts that chased them were Inky, Pinky, Blinky and Clyde. Clyde was replaced in the mid-cycle refresh by Sue.

-Charlie

Touche', thrust, and perry Mr. D.

When you go to Greek mythos everything is related, and they all seem to come in threes. There's three main winds, the three...That just seemed like a logical guess even if not semi-accurate.


Wait, are you telling me Fudo is wrong and the next-gen architecture is named Pac-man and there are five chips, with RV970 being Clyde and Rv990 being Sue?! Quick, someone inform VR-ZONE to post an update!

;)
 
Last edited by a moderator:
I don't have much knowledge but I can't see ATI giving up on the "SIMDness" nature of their architecture. They have a compute density advantage over every other actors (Nvidia and Larrabbe) I can't see them taking that much risks when they are not in need for it.
Couldn't they only adjust their architecture to work at a granularity of 16 instead of 64? Do that would be that costly? keep 8 "threads" active per SIMD instead of 2 (2 threads per quad of VLIW units)?
They could rework how the thread dispatcher works add, more arbitrers/sequencers per SIMD, add 3 more branch units per SIMD, etc. (Between the 4 quads of shader cores would no longer qualify as a SIMD but keeping the physical organization seems a good idea as it allow them to achieve pretty high density). They my lose in density but they have room, no? Or this would be awfully complicated to achieve?
As my wife let me to go out this hers friends I'm let alone speculating without much knowledge to play with... Anyway shit happens...

So in my late evening lonelyness and my "under the armchair status" next ATi could something like 4 arrays of 8 "SIMD" (not accurate wording) working on batchs of 16 elements.
They may do as Intel and give up on "specific" caches and go with bigger L1 and L2 caches focus on bandwidth between the SIMD<=>L1<=>L2. They could also give up on RBE or if attach the theirs "remains" to a SIMD (as for the textures units). All while doing this they should try (if possible I've basically no clue) to get closer to their "sweet point" strategy and have a chip a bit under 300mm².
I'm sure marketer could claim more than one hundred "compute/logic/whatever cores" based on the number of quads in the chip (32x4)

But I'm a fool anyway.
EDIT I forgot to add four rasterizers and two triangle set up :LOL:
 
Last edited by a moderator:
I have severe doubts there's any real use for 4 rasterizing units and 2 setup units anytime soon (but would of course stand corrected).

A liittle birdie told me that in order to hit a setup limit on a 5870 you'd need over 14M Polys per frame, which doesn't sound to me that there's a need for higher setup rates in the foreseeable future.

As for rasterizing units as far as I've understood it, when you have non overlapping triangles within tiles, tiles get split up between the two rasterizing units. In order for 4 units to make sense it's my understanding that the amount with non overlapping geometry would have to be idiotically high.
 
I could see them go to 4 wide (xyz t) with 40bit FP units so they can all do 32bit int & increase the likelihood of 100% unit utilisation.

Reading the recent B3D RV740/870 articles it seems that a bunch of the time even where the code parallelism is there to go 5 wide, there is no register port to feed the t unit, so why not dump one of the 4 skinnys & save a bit on scheduling/die space/compiler complexity?

Maybe more than one t unit to enable full Double fp math capability?
ttt may be doable if a t unit is twice the size of a skinny.

I guess if the t unit is significantly more than twice the size of a skinny unit, its not going to be worthwhile losing so much potential processing power though.
 
I have severe doubts there's any real use for 4 rasterizing units and 2 setup units anytime soon (but would of course stand corrected).

A little birdie told me that in order to hit a setup limit on a 5870 you'd need over 14M Polys per frame, which doesn't sound to me that there's a need for higher setup rates in the foreseeable future.

As for rasterizing units as far as I've understood it, when you have non overlapping triangles within tiles, tiles get split up between the two rasterizing units. In order for 4 units to make sense it's my understanding that the amount with non overlapping geometry would have to be idiotically high.
It was half a joke to add cheese and also based on comments I read here. It seems that some members would have happily welcome a faster set up rate.
Basically my idea was to push further the "x2 mantra" like the RV8xx have one set up unit and two rasterizers (from my understanding they are tied to their own SIMD array... I could be completely wrong as I must read again Alex's article about the RV800). If their next chip were made out of four SIMD array I thank that it could make sense.
About having more tiles I'm not sure overlapping geometry would be that much of a concern.
You would have quiet some resources to deal with it as compute power grow faster than other the "rest". A chip like the one I describe would have a max throughput of 4.3TFlops (say it run at the same speed as HD5870 it's basically 2.7/20x32) so the chip could keep up with the extra work.

But yes there is something I don't get about the set up rate and why some members wanted it to be faster. If we assume a resolution of 1080p that ~2millions pixels per frame and that having triangles tinier than four pixel destroy the quads efficiency/utilisation you should think that having more than 500 000 triangles per frame is a waste. If I use that as a max figure that about 15millions triangles/s @30fps 30millions @60fps. Basically something doesn't add up thus I must miss some bottlenecks or problems...
Alex manage to have the RV7xx to set up 485 Mtris/s while dealing with a 3651324 triangles mesh.
If some members could shime in it would be nice :)

EDIT
I come to think about something (probably dumb... but I would prefer if it perceives as a bit too candid..).
I read a presentation from Epic about what they could plan for their next engine and they speak of using something close to REYES. They were speaking of using 4 triangles per pixels and they could end with 4millions triangles characters. Now that's a lot, do you think it could be clever that instead of breaking the quads efficiency and no longer working on at least group of four pixels generate a lot more fragments?
I mean SIMDness of GPU is what so far as allow them to reach high throughput, breaking the SIMD is likely to bring a lot of problems on the hardware, you would end up dealing with a hell lot more discrete units, lot of communication/synchronization burden, lot of extra logic etc. Compute density may end suffering a lot. Wouldn't it be better to have the set up to work in a way that when he encounter tiny triangles it act as if it the resolution was higher (step by step) till the say triangle will "resolve" in enough fragments to not destroy the quad efficiency/utilization?
That's a lot of extra work for some pixels (like x16 SSAA in Epic case) but actually not all the scene will be made out of such dense mesh and the technic would not break efficiency for the rest of the frame. My idea (could completely stupid I conceide) is that when dealing with tiny triangles instead of wasting resources (not properly feed quads) why not do instead some extra usefull extra work (generating more and tinier fragments) especially if you a resources in spare?
In that case faster set up rate faster rasterizer could make sense. It would also have the advantage that the architecture could work the "classic" way for less tortured meshes.

(Between I'm speaking of Epic but may have some relevance if tessellation becomes big in the real time rendering realm).
 
Last edited by a moderator:
I think the order to get setup limited was >14M Polys / frame @60fps.

About having more tiles I'm not sure overlapping geometry would be that much of a concern.

If I start thinking of macro & micro tiles, possible tile sizes per case, triangle size and what not my poor layman's head will explode.

In the case you'd want to pull an evil joke on someone you could remind him that a Hemlock will eventually have 4 rasterizing units all together. Just don't mention that there's a difference between SFR and AFR :devilish:
 
You guys don't hang around. :LOL:

So can we expect a "double digit" (i.e. 10+) teraflop single board (dual GPU) design by the end of 2010 then? Crazy to think we're talking in them sort of terms already. How long do people think this doubling of compute capability every year from ATI can continue?

What fixed function units will be for the axe? Found it quite fascinating how removing some fixed function hardware (texture interpolaters) in RV870 removed a bottleneck, any other strong candidates that may giev the same benefits?
 
Last edited by a moderator:
You guys don't hang around. :LOL:

So can we expect a "double digit" (i.e. 10+) teraflop single board (dual GPU) design by the end of 2010 then? Crazy to think we're talking in them sort of terms already. How long do people think this doubling of compute capability every year from ATI can continue?

As long as TSMC keeps pushing manufacturing process nodes that allow for doubling of transistor density ;)

What fixed function units will be for the axe? Found it quite fascinating how removing some fixed function hardware (texture interpolaters) in RV870 removed a bottleneck, any other strong candidates that may giev the same benefits?

I'd say Larrabee's a good example of what a future GPU may look like as it's a very forward-looking design. The only fixed function hardware are the texture units. I don't know if interpolation is handled by these units or if it's done in the SIMDs, though. ROPs may be the next thing to disappear from future NV/ATi designs (note they're already not in Larrabee), but they are pretty cheap from a die area perspective so maybe they'll stick around after all.
 
As long as TSMC keeps pushing manufacturing process nodes that allow for doubling of transistor density ;)

That makes enough sense, I guess. :smile:

Heck, if chopping away fixed function units is the way to go (and by all accounts it is, to some degree, at least) then perhaps we'll be lucky enough to see a 3x times increase! An 8 teraflop single GPU solution for less than $400 in 2010? Yes, pelase! :D Oh, and I know, I know, raw flops doesn't mean all that much in the grand scheme of things, there's a hell of a lot more to it but it is fascinating to see the crazy pace with which that peak number is rising.

Whilst, I've been mighty impressed with what RV870 can achieve with its modest increase in memory bandwidth, surely this is going to become an increasing issue with the math side of things rising at such a faster rate? Any chance ATI could go with something more akin to PowerVR's tile based deffered rendering method in future generations to alleviate this problem? Or do we expect it to be left up to the developer to come up with lots of fancy new rendering methods? I suppose removal of the ROPs would speed this along but if their area cost isn't very large anyway perhaps it doesn't make sense to potentially jeopardise performance in current games.
 
Last edited by a moderator:
Whilst, I've been mighty impressed with what RV870 can achieve with its modest increase in memory bandwidth, surely this is going to become an increasing issue with the math side of things rising at such a faster rate? Any chance ATI could go with something more akin to PowerVR's tile based deffered rendering method in future generations to alleviate this problem?

As long as faster DDR exists and buswidths can be scaled why would they? The spots where you see a 512bit bus are rather exceptions than a rule for the time being.

Apart from that as long as PowerVR is stuck in embedded enviroments/SoCs, there's no real comparison or even "threat" for the other IHVs.

Some would be expect from someone like me to evangelize PowerVR's TBDR to the point of it being the ultimate solution for bandwidth heachaches. I won't though because I merely consider it a very interesting alternative with it's own advantages in the exact same way as IMRs have their own advantages too in my mind.

Today they haven't even managed to re-enter the PC space even if it would be just an SoC and I don't think Intel is all that likely to go as far. It's perfectly fine to have the currently used IP tied in up to netbooks only with ridiculously underperforming drivers. It's pretty convenient too.
 
I think the order to get setup limited was >14M Polys / frame @60fps.
That's when you're 100% setup limited, i.e. none of your triangles are big.

The reality is that there's always a wide distribution of triangle:pixel ratios, and setup limits the parts with long stretches of tiny/clipped/culled trangles. So if you have a scene with 6M triangles that you're rendering at 60 fps, but only 1M of those triangles are generating 95% of the pixel load (requiring 12 ms) and 4M triangles are in a bunch of setup-limited clumps (requiring 4.7 ms), then doubling your setup speed will give you a 16% performance boost.

That's nothing to sneeze at, particularly if it only costs 5% more silicon (100M transistors, the size of the 1 tri per clk R300 ;) ) and 2-3% more total video card cost.
 
Could it be that just refreshed DX11 (RV870) will carry abandoned R900 name? Shouldn't we look at what plant, coral, stock bread AMD is now thinking about. At least with their PR naming they don't need anything new in their next gen architecture they could name it Fermi+ and they would still be pretty original just like AXP/A64+++ series :rofl:

Somehow i disbelieve R900 is brand new architecture, just lightweight pure VLIW on R600 heritage w/o unnecessary things like RBEs and TMUs maybe with DX12 compliance. So just an dx11 refresh guess drops out here. And it will be all that R800 with DX11 compliance didn't need to be. Probably 256-bit wide SIMD units with 2x 32b wide SFUs for transcendentals (instead just one in R600 based cores). Or maybe fully armed 512bit wide SIMDs:LOL: Meh, that's overkill imho and certainly not a way to go. So while we still have only 20 cores x16SIMDs x5SP, then we could have 32SIMD x8SP per core and let's say 32 cores (extra 8 cores included to substitute missing ROP functionality) = 8192SPs.

I don't know where they'll fit it but considering RBEs & TMUs waste 40% of die space in RV770 and with triple packed SPs per core i guess they could fit it pretty nice under 300mm2 @28nm and with 1.2-1.5GHz clock rate. Especially if we make assumption that inventive 28nm will have extremely reduced leakage considering even 32nm shrink claims almost 40% less leakage than 45nm TSMCs process

R600 was delayed because of a bug in the 90nm design library. Not at all ATi's fault.

And it wasn't even designed for 90nm. Great AMD fog unit execution at their prime :LOL:
 
The mainstream chips of R9xx will be probably be build for Fusion, so they will use a SOI manufacturing processes, but i have no idea how this can impact over the architecture.

Long time ago they announced Fusion to be based on R600 architecture. Truly it was long before RV770 came into life and they delayed their Fusion from H1 2009 to H2 2010, and now is announced somewhere for 2011 afair :oops: But considering that ATi implements their last gen graphics features in their top notch chipsets, as 790gx featured only dx10 (rv610 based) in rv770 age and month before dx11 Cypress they release dx10.1 only 785g (rv615 based?) i'd bet we won't see nothing of their fancy brand new architecture in Fusion One :rofl: in H1 2011 and we could be pretty happy if we saw R600 based descendant with dx11 support aka. RV840 based product (128b dedicated memory bus ) fused with 4 'Dozer cores hopefully over HT bus (wishful thinking)
 
Back
Top