AMD: R9xx Speculation

Jawed · Jul 19, 2010

CarstenS said:
Tell me, did we leave "the early days" of DX10's Geometry Shader? (I don't have to spell out the analogies here, have i?).

And the analysis wasn't much good back then, either. See a trend?

Apparently, people associate bad-ass slowliness with "in software" and that's what i was talking about - not whether or not Fermi might use transistors for other stuff than the tessellation stage.

Well, the idea that AMD would be misled by NVidia doing software tessellation any time in the development of Evergreen is laughable. AMD expecting to be ahead of NVidia? well not only is this the natural expectation, but until Fermi's architecture was revealed, it was true as AMD was pretty boastful about tessellation.

Frankly, I have no idea what you're saying here. Which difference? Whose raster perf is poor? And which z-rates are poorly utilized?

Rasterisation rate in RV770 versus GT200 (and 64-fragment hardware threads are just less efficient than 32-fragment hardware threads when triangles are small) and very high Z-rate in GT200.

Since only multiplying setup/rasterizer doesn't get you anywhere if you do not also reinforce the necessary infrastructure… And to do it properly, you'll have to walk the painful way, I guess.
Anyways, I had the same question and they said: 10% more compared to an approach analogue to GT200/RV790.

Well, I still can't rationalise that 10% into anything meaningful. Still haven't really woken up though.

You forgot one very important key change: The number of units. Granted, it's a rather obvious thing, but if you have a performance, cost and yield target, you also have to factor in, exactly how many of the engineer's dreams you can incorporate into the new design in order to meet these goals.

Yeah.

It occurred to me, afterwards, that 1. might have suffered due to 40nm woes merely as a risk factor - i.e. if it wasn't right could it be fixable with metal or would it really spoil things? So just keeping change minimal.

The D3D11 functionality in 1. was met and a few sharp corners were rounded off.

Also there's the chance that some changes in 1. would align with changes in 2., 3. and 4.

Jawed

Jawed · Jul 19, 2010

3dcgi said:
Only the pixel shader thread generation is bound by rasterization. Compute is not.

What is the rate of work item generation?

Jawed · Jul 19, 2010

AlexV said:
Patents are nice and all, but their materialization in hardware is an unknown quantity.

In this case the setup-rasteriser architecture documented and described by AMD staff matches the patent documents

http://forum.beyond3d.com/showthread.php?p=1383812#post1383812
http://forum.beyond3d.com/showthread.php?p=1374951#post1374951

nVidia tends to reccomend the same thing with regards to not going under 8 pixel triangles in general - does that mean they're small triangle unfriendly(hint, Fermi's epic setup rate happens with really small triangles, so if that were the only consideration that'd be what they'd want always)?

The difference is that full speed rasterisation in ATI is >=16 fragments per triangle (if both rasterisers are working on different triangles) but it's >=8 in NVidia.

Do you base your definitive statement about what happens the instant a triangle falls under a particular area on actual experience with the hardware? If so, please detail it, because that's definitely what I (and others) are seeing in practice.

No. Where's your documented experience? (And I presume that last clause has inverted meaning from your intention.)

I'm talking a stream of small triangles that fit entirely within a screen-space tile (obviously some will straddle tiles).

Buffering in Cypress in combination with a mixture of triangle sizes allows a rasteriser to spend more than one cycle on a triangle. Once that starts then rasterisation in the two rasterisers can be simultaneous even when the triangles don't straddle a tile. None of which helps with a stream of tiny, ~1 pixel, triangles.

Tessellation means a shitload more than setup - raster,

Did I say otherwise?

and there are multiple potential sticking points with regards to data flow that can and apparently do hamper Cypress performance, or rather, expose the parts where it has a less graceful performance decline compared to Fermi.

Performance cliff.

I don't think that we should be using SDK samples that are written for readability rather than performance, and which do quite a few things, to underline specific architectural traits. If we want to talk about setup/raster, we need to use something that isolates those portions as well as possible, and start from there.

Intrigued to see what your next round of testing shows.

Jawed

ShaidarHaran · Jul 19, 2010

Novum said:
It's very unlikely that existing FS do that, because you couldn't render with DP precision until very recently.

You are correct. Consumer FS absolutely do not use double precision. Earlier contrary posts were likely speculation.

Gipsel · Jul 19, 2010

Jawed said:
Start here:

http://forum.beyond3d.com/showthread.php?p=1416950#post1416950

quite a few posts. There's a lot of "new functionality" in Cypress's ALUs - it looks like functionality that's almost, but not quite, enough to ditch T - well that's my interpretation anyway.

Another possibility is XYZT

Actually, I just had again a look on those strings in the driver. And it looks like I misinterpreted that a bit last year. The wording changed and is quite unambiguous now ("HW doesn't support trans unit slot" instead of "An ALU instruction was issued to the scalar slot. This feature is scheduled for removal in Northern Islands!" in Cat 9.8).
So it really looks like NI is a pure 4 slot design. But the 4 slots are not created equal, the w (merged with t?) unit can't process a few instruction. I don't know which ones are missing, but generally it looks almost like the t slot today.

By the way, what do you expect from things like "SET_PRIORITY__NI", "INTERRUPT_AND_SLEEP__NI" or "INTERRUPT__NI"? A step in the direction of better managed multitasking on a GPU? And NI gets int24 multiplication in the ALUs, not only uint24 like Evergreen (which is still not usable from IL, isn't it?).

Edit:
Anyone heard of some "Sumo" and "Wrestler" codenames in connection with fusion?

neliz · Jul 19, 2010

Gipsel said:
Edit:
Anyone heard of some "Sumo" and "Wrestler" codenames in connection with fusion?

Why the rolleyes? That should be the codenames for the graphics part in Ontario.

Sumo and Wrestler, which seems to be a newer (more high end?) part is 9802
They were "leaked" in April as part of the Beta Catalyst drivers for June.

"SUMO 9640" = ati2mtag_Sumo_Desktop, PCI\VEN_1002&DEV_9640
"SUMO 9641" = ati2mtag_Sumo_Mobile, PCI\VEN_1002&DEV_9641
"SUMO 9642" = ati2mtag_Sumo_Desktop, PCI\VEN_1002&DEV_9642
"SUMO 9643" = ati2mtag_Sumo_Mobile, PCI\VEN_1002&DEV_9643
"SUMO 9644" = ati2mtag_Sumo_Desktop, PCI\VEN_1002&DEV_9644
"SUMO 9645" = ati2mtag_Sumo_Mobile, PCI\VEN_1002&DEV_9645

http://www.rage3d.com/board/showthread.php?t=33962012

Gipsel · Jul 19, 2010

neliz said:
That should be the device ID's

Sumo 9640, 9642 and 9644 and Wrestler 9802

Ontario should be Sumo based, 9641,9643 and 9645.

They were "leaked" in April as part of the Beta Catalyst drivers for June.

http://www.rage3d.com/board/showthread.php?t=33962012

Thanks, I missed that. I've just seen "Fusion SUMO" and "Fusion WRESTLER" in the release 8.74 package (Cat 10.6, it's not in the .inf files there) and wondered if they were known already.

3dcgi · Jul 20, 2010

Jawed said:
Rasterisation rate in RV770 versus GT200 (and 64-fragment hardware threads are just less efficient than 32-fragment hardware threads when triangles are small) and very high Z-rate in GT200.

The wavefront/warp width doesn't have anything to do with rasterization efficiency, just with pixel shader efficiency with divergent code.

3dcgi · Jul 20, 2010

Jawed said:
What is the rate of work item generation?

This info should be public, but since I don't know for sure I'll leave it for someone else to answer.

Jawed · Jul 20, 2010

3dcgi said:
The wavefront/warp width doesn't have anything to do with rasterization efficiency, just with pixel shader efficiency with divergent code.

That would only be true if a hardware thread can be fully populated with quads if one triangle doesn't entirely fill it.

If that's true then in the worst case with small triangles it's going to take 16 cycles to fill a hardware thread on ATI, because the rasteriser is one triangle at a time. Short shaders make this worse.

Jawed · Jul 20, 2010

3dcgi said:
This info should be public, but since I don't know for sure I'll leave it for someone else to answer.

Several slide decks alude to a variety of patterns of fragment/work-item creation (pixel shader versus compute shader, basically) but none of them, as far as I can tell, allude to enhanced throughput.

Jawed · Jul 20, 2010

Gipsel said:
Actually, I just had again a look on those strings in the driver. And it looks like I misinterpreted that a bit last year. The wording changed and is quite unambiguous now ("HW doesn't support trans unit slot" instead of "An ALU instruction was issued to the scalar slot. This feature is scheduled for removal in Northern Islands!" in Cat 9.8).

Ooh, tasty.

At face value that seems to imply that the next chip is NI. Which would imply that SI was cancelled. But what if SI comes after NI or SI was a mirage or NI is in there as well as SI, but HD6xxx is SI. Oh and what happened to Hecatoncheires?

Some rumours suggest that NI followed SI quite closely, i.e. well within one year.

Also, if x86 APU is supposed to get an annual refresh to the GPU section, then what we could be seeing is first APU being Evergreen based then second version being NI-based, skipping SI. Lead time on APU (tape-out to retail) is ~twice GPU so it seems likely that some skipping would be required to deal with the 40nm fuckup and the cancellation of 32nm (for GPUs).

i.e. the GPU roadmap has been redrawn so much that the APU guys have picked on something not caught in the middle.

So it really looks like NI is a pure 4 slot design. But the 4 slots are not created equal, the w (merged with t?) unit can't process a few instruction. I don't know which ones are missing, but generally it looks almost like the t slot today.

In Evergreen there are slight differences between the XYZW lanes - this is because of pairings and other intricacies in some of the new instructions. So the difference you're seeing for W might be nothing more than that.

Also, if there really is no "transcendental" lane, then sub-functions of transcendental computation would be spread asymmetrically across the other lanes. And there's always the possibility that transcendentals are a macro (just like double-precision divide is a macro - or, indeed, like a 1-ULP SIN or COS is a macro).

By the way, what do you expect from things like "SET_PRIORITY__NI", "INTERRUPT_AND_SLEEP__NI" or "INTERRUPT__NI"? A step in the direction of better managed multitasking on a GPU?

Ooh, more tasty stuff. Allowing the driver/operating system to stop one kernel/application totally killing responsiveness.

And NI gets int24 multiplication in the ALUs, not only uint24 like Evergreen (which is still not usable from IL, isn't it?).

I've not seen uint24 so far.

Also, while I'm at it, it's been suggested that the SIMDs would be 24 wide instead of 16 wide, in order to keep a 20-SIMD design. 1024 work items is the upper limit on workgroup size - that and other common power-of-two workgroup sizes don't accord with multiples of 24.

If a shader engine remains at 10 SIMDs, with 16 ROPs, then 30 SIMDs (each of 64 ALU lanes) sort of matches up with having a 384-bit bus, i.e. 48 ROPs

Each shader engine presumably has quite a bit of "control" overhead. The way the die shot of RV770 shows things, it's extremely hard to see how much overhead there is. We can see the ALUs and TUs, but I can't work out how much of the "uncore" is the stuff that makes a shader engine work (the Sequencer). Obviously the more shader engines, the more of this overhead. And there's other overheads for integration of each SE into the entire GPU.

Anyway, generally speaking, increasing the count of SEs is going to lower the per-mm² ALU efficiency. So some of the gain in efficiency due to increased ALU utilisation with 4 lanes instead of 5 is going to be lost with both the increased count of SIMDs and the likely increased count of SEs. Though 2 SEs, each of 15 SIMDs, is theoretically an option.

3dcgi · Jul 20, 2010

Jawed said:
That would only be true if a hardware thread can be fully populated with quads if one triangle doesn't entirely fill it.

If that's true then in the worst case with small triangles it's going to take 16 cycles to fill a hardware thread on ATI, because the rasteriser is one triangle at a time. Short shaders make this worse.

Correct, it would be dumb to only allow one triangle per wavefront/warp.

jaredpace · Jul 20, 2010

So original NI was supposed to be 2560sp 4D array 640 ALU @ 32nm or 28nm. 2:1 increase in ALU count over Evergreen. Now for the sake of die size while having to remain @ 40nm, 480 ALU / 1920sp (3/4ths) to keep it <400mm2. Gives merit to rumors of it being a 'half generation'. The planned increase in ALU count was the same as Rv770-Rv870 2:1, now due to process node it is 1.5:1. Scrap 320 > 640, it is now 320 > 480.

Gipsel · Jul 20, 2010

jaredpace said:
So original NI was supposed to be 2560sp 4D array 640 ALU @ 32nm or 28nm. 2:1 increase in ALU count over Evergreen. Now for the sake of die size while having to remain @ 40nm, 480 ALU / 1920sp (3/4ths) to keep it <400mm2. Gives merit to rumors of it being a 'half generation'. The planned increase in ALU count was the same as Rv770-Rv870 2:1, now due to process node it is 1.5:1. Scrap 320 > 640, it is now 320 > 480.

Or the top Northern Islands GPU still has 640 x 4 slots VLIW units or even more (as NI ist supposed to use 28nm) and SI is just a mildly upgraded Evergreen to stay within a reasonable die size budget. Alternatively, we may not see the HD68x0 GPUs until 28nm, so the top of the line GPU may be a HD6770 or something like that to limit the die size.

neliz · Jul 20, 2010

wouldn't be surprised to see the 1920SP part launch as 67xx and 28nm parts as 68xx as Gipsel suggests (I think I said that a couple of months ago as well :/)

no-X · Jul 20, 2010

It is expected to be under 400mm² and perform ~20% better than GTX480? It would be nice, but isn't this scenario overly optimistic?

neliz · Jul 20, 2010

no-X said:
It is expected to be under 400mm² and perform ~20% better than GTX480? It would be nice, but isn't this scenario overly optimistic?

I think GF104 shows that GF100 isn't exactly a shining example of perf/mm2

no-X · Jul 20, 2010

That's true, but try to imagine it in a bit different way - this GPU (~25% bigger than GF104) would be almost as fast as HD5970. That would be another "RV770"...

LordEC911 · Jul 20, 2010

neliz said:
wouldn't be surprised to see the 1920SP part launch as 67xx and 28nm parts as 68xx as Gipsel suggests (I think I said that a couple of months ago as well :/)

Yes you did. Back in April I believe.

no-X said:
It is expected to be under 400mm² and perform ~20% better than GTX480? It would be nice, but isn't this scenario overly optimistic?

Neliz also said that as well back around April.
I have heard two different versions though, 20% faster than a current GTX480 and 20% faster than a 512SP GF100.

AMD: R9xx Speculation

Jawed

Jawed

Jawed

ShaidarHaran

hardware monkey

Gipsel

neliz

GIGABYTE Man

Gipsel

3dcgi

3dcgi

Jawed

Jawed

Jawed

3dcgi

jaredpace

Gipsel

neliz

GIGABYTE Man

no-X

neliz

GIGABYTE Man

no-X

LordEC911

Similar threads