AMD: R9xx Speculation

Jawed · Nov 1, 2010

3dcgi said:
The barycentrics don't go through LDS, they pre-populate the GPRs.

Here:

http://www.gamedev.net/community/forums/topic.asp?topic_id=540832

is an HS and DS. Looking at the ISA for the DS:

Code:

00 ALU: ADDR(32) CNT(19) 
      0  x: ADD         T0.x, -R0.y,  1.0f      
         y: MUL_e       T0.y,  R0.x,  (0x40400000, 3.0f).x      
         z: ADD         T0.z, -R0.x,  1.0f      
         w: MUL_e       T0.w,  R0.y,  (0x40400000, 3.0f).x      
         t: MOV         R1.x,  0.0f      
      1  x: MUL_e       ____,  R0.x,  PV0.y      
         y: MUL_e       ____,  PV0.z,  PV0.z      
         z: MUL_e       ____,  R0.y,  PV0.w      
         w: MUL_e       T1.w,  PV0.x,  PV0.x      
         t: MOV         R1.y,  1      
      2  x: MUL_e       R3.x,  T0.w,  PV1.w      
         y: MUL_e       R5.y,  T0.z,  PV1.x      
         z: MUL_e       R5.z,  T0.y,  PV1.y      
         w: MUL_e       R5.w,  T0.x,  PV1.z      
         t: MUL_e       R4.y,  T0.z,  PV1.y      
      3  x: MUL_e       R2.x,  R0.x,  R0.x      
         y: MUL_e       R2.y,  R0.y,  R0.y      
         w: MUL_e       R2.w,  T0.x,  T1.w      VEC_120 
[...]

indicates that R0.xy contains the vertex output by TS. Sigh, first time I've actually studied the shader closely enough.

So LDS is not holding the output from TS in this case, it's being written directly into GPRs. This is pretty similar to what the interpolator block used to do, creating allocations in the register file and populating shader inputs with data by setting GPRs.

Taken at face value (and ignoring the registers HS uses), the 8 GPR allocation for this DS means that 32 threads can be in flight on the SIMD. Which is 2048 vertices output from TS.

So the count of vertices that a SIMD can accept from TS is determined by GPR allocation for the DS.

This shader also has an integer in R0.z as input. What is it?

AnarchX · Nov 1, 2010

Could the off-chip buffering of Cayman Gen8 tessellator a kind of delay stream?
http://docs.google.com/viewer?a=v&q...OqIoyw&sig=AHIEtbTh-tW2F6Gzszm9N_PuLgZulp8ZvQ

caveman-jim · Nov 1, 2010

Jawed said:
Any links?

Nah, came from email from NV.

CarstenS · Nov 1, 2010

From what I understand, it sounds more like something to reduce overdraw.

Mintmaster · Nov 1, 2010

Jawed said:
That's the spec for computation, I believe, not storage. And since we were talking about the space consumed in LDS, then 8 bytes per vertex is the truth.

It's still doesn't mean much for BW, especially for NI's 1-2 triangles per clock peak.

It's impossible, anyway, currently, since LDS is too small when TF=64.
...
Well, for a square patch the correct number would be 450*16 (thanks to AlexV for the correction - doh for not even thinking about it). Clearly LDS can't store them all.

I don't think you're reading my post properly. There is no reason to store all the vertices produced by a patch.

Of course. DS shaders tend to take a while.

Not much more than typical vertex shaders.

And when TS is working its way through the patch that results in thousands of vertices, it can't load-balance itself and switch to another patch, when the destination LDS for the first patch runs out of capacity.

For patches like that, you just have to write the 3 or 4 control points to GDS. Then any SIMD can do the DS.

It's not a compilation problem.

In a way it is. If you compile shaders to use the same number of registers per fragment, then you can basically have an ubershader to work on wavefronts using any of those shaders.

So off-die buffering is irrelevant?
...
Which you still haven't eludicated.

When did I say it was irrelevent? Yes, I did give you a possibility: there could be a data path bottleneck somewhere. You could have bank conflicts, or maybe limitations from all fragments in the DS accessing the same control point. Caches used for regular vertex processing (where all data comes from RAM) may alleviate that.

Jawed · Nov 1, 2010

Mintmaster said:
It's still doesn't mean much for BW, especially for NI's 1-2 triangles per clock peak.

Well you've probably seen my reply to 3dcgi by now, which indicates that each vertex output by TS is effectively consuming 16 bytes, an entire register, as it's written to the register file, not LDS. And that allocation is multiplied by the entire allocation of registers for the DS.

I don't think you're reading my post properly. There is no reason to store all the vertices produced by a patch.

The basis of my position isn't that all the vertices need to be stored. It's that when a huge lump of vertices are the result of one or a few patches owned by a single SIMD, TS throughput can be affected. Subject to the total count of SIMDs that can accept TS output.

In the trivial case (which I suspect isn't realistic): if there's only one SIMD running HS/DS then TS has to stop while waiting for DS threads to complete processing. More realistically several SIMDs will be there to take on the workload of DS.

Regardless of the number of SIMDs occupied by HS/DS, when a patch results in TS outputting more than X vertices (X dependent on register allocation of DS), TS is going to stall because it can't multi-thread patches - it treats them strictly sequentially. That's my interpretation, and I suspect it's a major factor in the performance cliff we see. Cypress falls to 1/6 to 1/10 of GTX480 throughput in the worst cases.

Fermi probably stalls its TSs too, when amplification is very high. There's certainly warnings:

http://www.highperformancegraphics.org/media/Hot3D/HPG2010_Hot3D_NVIDIA.pdf

Not much more than typical vertex shaders.
For patches like that, you just have to write the 3 or 4 control points to GDS. Then any SIMD can do the DS.

Migration/sharing of workload, something like Fermi, but which appears not to be part of Evergreen. Although, to be fair, there's no hard evidence for this.

In a way it is. If you compile shaders to use the same number of registers per fragment, then you can basically have an ubershader to work on wavefronts using any of those shaders.

I imagine the architecture would have to be re-jigged for that kind of support. The implicit inputs to a pixel shader, say, aren't like the inputs to a DS. This ubershader would be of a type distinct from all those currently implemented.

When did I say it was irrelevent? Yes, I did give you a possibility: there could be a data path bottleneck somewhere.

Under "Hardware Tessellator Progression", the slide says "Gen 8 - AMD Radeon HD 6900 Series - Scalability and off-chip buffering". You keep trying to dodge around "off-chip buffering" as though it has nothing to do with making tessellation faster. If moving data off-die is a performance win it's probably for the same reason as seen in GS: coarse granularity data, in huge wodges, is too voluminous to keep on-die.

Anyway, it turns out that LDS wasn't the buffer under strain, it was the register file, which appears to make the strain worse...

You could have bank conflicts,

Yes, those can happen, definitely. Why would they scale with tessellation factor? LDS reads are solely for HS params that are inputs to DS, i.e. control-points and tessellation factors.

or maybe limitations from all fragments in the DS accessing the same control point.

Broadcast is fine in ATI as far as I know.

Caches used for regular vertex processing (where all data comes from RAM) may alleviate that.

The L1 texture/vertex cache you mean?

The DS I referenced earlier does read stuff from RAM (two VFETCH instructions). I suspect it might be something to do with the original stream of patches - perhaps it's just scale/bias and offset data for the patch buffer, to enable calculation of the right address in LDS to fetch HS params from.

In the same vein, the HS I referenced earlier reads two ints using a single VFETCH, which appear to be to generate LDS write addresses.

In both cases VFETCH doesn't look like it could be a bottleneck. There's a lot of ALU work in HS/DS for the SIMD to hide VFETCH latency.

SirPauly · Nov 1, 2010

Jawed said:
Any links?

http://www.gamersdailynews.com/story-20361-HAWX-2-DirectX-11-Optimized-Features.html

R300King! · Nov 1, 2010

shiznit said:
found this on techreport.com, slide says VLIW4 shaders on 6900 series.

http://www.hardware-infos.com/news.php?news=3748

any truth to this?

From that link.. googled of course.

Solid information on the number of Shader units are still unavailable. Hottest rumor is the fact that 4D-480 units (1920 stream processors) instead of 320 units 5D (1600 stream processors) are installed. That would bring the shader performance increase in frequency between themselves 20 and 40 percent percent - einberechnet an also planned, higher-performance Tessellation did not.

Umm, why would anyone think the 6970 would have the same number of shaders as the 5870?

I don't get it. I think it'll probably be 4-D shaders but the total will be more than 1920. How many more, not sure, maybe 2240.

TOAO_Cyrus · Nov 2, 2010

R300King! said:
From that link.. googled of course.

Umm, why would anyone think the 6970 would have the same number of shaders as the 5870?

I don't get it. I think it'll probably be 4-D shaders but the total will be more than 1920. How many more, not sure, maybe 2240.

I dont get what your saying, it doesnt have the same number of shaders, its got 50% more if the 480 number is to be believed. Correct me if I am wrong but removing one lane probably doesnt save that much die area so its a pretty substantial increase for the same process.

Mintmaster · Nov 2, 2010

Jawed said:
The basis of my position isn't that all the vertices need to be stored. It's that when a huge lump of vertices are the result of one or a few patches owned by a single SIMD, TS throughput can be affected.

In that case, what you're saying is that the DS doesn't run fast enough. Could have saved a lot of typing and miscommunication that way...

I don't agree with the assertion that a vertex must be domain shaded by the SIMD that created the patch from which it was born. You can put some control point data in the GDS, and the workload sharing is very similar to vertices being pumped into several SIMDs for regular geometry sans geometry shader or tessellation. It's not some Fermi exclusive technology.

You keep trying to dodge around "off-chip buffering" as though it has nothing to do with making tessellation faster.

I've done no such thing. In fact, my entry into this discussion was about the fact that the BW required by off-chip buffering of TS output is very low, so it may well make things faster. They don't have the potentially 50-100 byte vertex size of non-tessellated geometry.

Anyway, it turns out that LDS wasn't the buffer under strain, it was the register file, which appears to make the strain worse...

By your own calculations, the buffer can still be very big. In your scenario, TS is not stalling because the buffer is inadequate, it's stalling because the DS isn't working fast enough. Unless you increase the speed of the DS, more buffer space won't help.

Yes, those can happen, definitely. Why would they scale with tessellation factor? LDS reads are solely for HS params that are inputs to DS, i.e. control-points and tessellation factors.

What concrete evidence do we have about this scaling? Recall this thread:
http://forum.beyond3d.com/showthread.php?t=57035
Cypress showed the same 6 clocks per triangle added when the factor is 25, 50, and 100.

In both cases VFETCH doesn't look like it could be a bottleneck. There's a lot of ALU work in HS/DS for the SIMD to hide VFETCH latency.

What I was suggesting is that a VS with VFETCH is fast enough to do at least one vert per clock, so regardless of what is holding back the TS/DS, reading the tessellated triangles (and maybe even the control points) from memory has the potential to hit the same rate. This basically transforms the DS into a VS.

3dcgi · Nov 2, 2010

Jawed said:
This shader also has an integer in R0.z as input. What is it?

Something has to tell you where to fetch HS output data from.

R300King! · Nov 2, 2010

TOAO_Cyrus said:
I dont get what your saying, it doesnt have the same number of shaders, its got 50% more if the 480 number is to be believed. Correct me if I am wrong but removing one lane probably doesnt save that much die area so its a pretty substantial increase for the same process.

The actual number of shaders is less than 50% increase. Only if your counting the clusters from 320 to 480, then yes. But the 320 had 5, and the 480 only has 4. So the actual increase in shaders is 20%.

And my first statement was..

The article was saying that the hottest rumor was that the NEW 6970 has 1920 4D and not 1600 5D. Why on earth would the 6970 have 1600 5D shaders? WHY? The OLD 5870 had that, why would they even think the new chip would have the same damn thing. That's what I was saying.

hkultala · Nov 2, 2010

R300King! said:
The actual number of shaders is less than 50% increase. Only if your counting the clusters from 320 to 480, then yes. But the 320 had 5, and the 480 only has 4. So the actual increase in shaders is 20%.

And my first statement was..

The article was saying that the hottest rumor was that the NEW 6970 has 1920 4D and not 1600 5D. Why on earth would the 6970 have 1600 5D shaders? WHY? The OLD 5870 had that, why would they even think the new chip would have the same damn thing. That's what I was saying.

Cypress has 20 processor cores, each of which are 16-way SIMD, and for each SIMD line there is 6-way VLIW capable of 5 FP ALU operations and one branch.

6970 might be 30 processor cores, each having 16-way SIMD, and for each SIMD line 5-way VLIW, capable of 4 FP ALU operations and one branch.

The number of processors, the total number of SIMD lines, the number of threads operated simultaneously will grow by 50%.

But:

I think the shared counts are not very interesting here:

6870 series got almost same performance as 5870 series, with about 80% of the pixel shader power, 80% of the texturing power, with 87.5% of the memory bandwidth, and 106% of theoretical pixel output performance, and with some tweaks to internal buffering, thread handling etc.

So I don't see much point of increasing shader power much from 5800 to 6900 series, as 5870 seems to already have "too much" shading power, and especially if they stay at 32 ROPs, there just would not be any use for much more shader power in any games coming out soon. Of course, GPGPU performance , where ROPs are not used at all would increase, but it's no a priority.

Going from 32 to 64 ROPs would propably give much bigger performance increase than going from 480x4 to some >500x4 shaders.

So, I think either of the following makes sense:

1) Just the internal buffering and thread handling tweaks Barts got, better geometry performance, and no big increase in shader count, and quite small increase of die size. maximum of 480x4 shaders, maybe only 400x4.

2) considerably bigger chip, doubling of ROPs from 32 to 64, at least 480x4 shaders, either very high-clocked 256-bit memory bus or 512-bit memory bus. And of course all the buffering and thread handling tweaks Barts has and better geometry performance

hkultala · Nov 2, 2010

Another thought/speculation about Barts:

We know, that originally Barts was supposed to be "6700" manufactured with 32nm.

So when the 32nm process was cancelled, what was changed.

I think the 256-bit memory bus and "slow-clocked" memory controller came with that:

With 32nm the chip might have been too small for the padding of 256-bit memory bus, so the original plan was probably 128-bit bus, same 32 ROPs, but a very high-clocked memories for the 6770 sku.

When they had to change to 40nm and the chip size increased, they had space for the pads 256-bit memory bus required. And then they could use the "lower-end" memory controller and slower clocked memories and still get better bandwidth than 128-bit high-cloked memory.

AnarchX · Nov 2, 2010

We already calculated that the 256-Bit ~4Gbps pads might fit on a ~200mm² die. So the 32nm performance chip might was such a sized die, but maybe with some more SIMDs (2x8 ~1280SPs) or 4D design.

no-X · Nov 2, 2010

hkultala: Now I think, that Barts was originally a refresh part, which was planned for launch around Easter. I think it's possible, that Bart wasn't prepared for 32nm manufacturing, but it was 40nm part from the start. And during its development it was completed by some NI's features (better filtering, UVD3.0 etc.), belated and launched as a part of NI's family. Because of this reasons it could be quite misleading to use Barts as a basis for extrapolating Cayman's performance (or anything)...

Jawed · Nov 2, 2010

Mintmaster said:
In that case, what you're saying is that the DS doesn't run fast enough. Could have saved a lot of typing and miscommunication that way...

Yes. I originally contemplated using TF=64 as my argument, but I was expecting people to reject that as totally unreasonable "because TF=64 is something no-one would ever use". Only later did I realise that in something like terrain rendering would TF=64 be completely normal - as would TF=1024 if it were available, because application of a height-field to a plane isn't helped by a high density of control points.

I don't agree with the assertion that a vertex must be domain shaded by the SIMD that created the patch from which it was born. You can put some control point data in the GDS, and the workload sharing is very similar to vertices being pumped into several SIMDs for regular geometry sans geometry shader or tessellation.

The shaders I've played with show that HS is writing patch params to LDS and DS is reading them from LDS. Sure, there may be some magic happening behind the scenes. The inconvenient truth here is that Evergreen tanks when given high TF, and in my view that's consistent with TS stalling because the sink buffer is too small.

It's not some Fermi exclusive technology.

It shouldn't be, no. But Cypress had features cut due to the 40nm fuck-up...

I've done no such thing. In fact, my entry into this discussion was about the fact that the BW required by off-chip buffering of TS output is very low, so it may well make things faster. They don't have the potentially 50-100 byte vertex size of non-tessellated geometry.

So what is the mechanism by which an off-chip buffer increases tessellation performance (or allows it to increase along with other changes for better scaling in the architecture)?

By your own calculations, the buffer can still be very big. In your scenario, TS is not stalling because the buffer is inadequate, it's stalling because the DS isn't working fast enough. Unless you increase the speed of the DS, more buffer space won't help.

The usual way to increase DS throughput is to increase the count of SIMDs running DS. But this is where coarse-grained granularity hits hard: a high TF overflows the capacity of any single SIMD, unless DS throughput is >1 vertex per cycle (in which case 1 SIMD would be enough anyway). Once a SIMD is chock full of vertices from a patch, TS has got nowhere to send more vertices, because a single SIMD is the only place those vertices can go (according to the locked HS/DS theory).

If TS could just switch to another SIMD for DS execution, then it wouldn't have to stall. Well, unless there's not enough SIMDs with available capacity (which is more likely on the smaller GPUs).

An off-die buffer can soak-up the peaks caused by high-TF. Without it, even occasional high-TF patches will cause grief. That's the basis of my interpretation of that line on the slide. Producer-consumer relies upon big-enough intermediate buffers, "80:20 rule", whatever. The original ATI GS design uses an off-die buffer to avoid stalling caused by large amplification.

Obviously there's the underlying argument in some people's eyes that Cypress/Barts achieves adequate performance in its 80:20 compromise. If Cayman is re-designed for "scalability" in its tessellation performance, and buffering is re-worked to achieve that increased performance, then it points to the current design having inadequate buffering to support scaling.

What concrete evidence do we have about this scaling? Recall this thread:
http://forum.beyond3d.com/showthread.php?t=57035
Cypress showed the same 6 clocks per triangle added when the factor is 25, 50, and 100.

6 clocks (you later adjusted to 6.5 clocks) is so slow, even at LOD 25, that it can't get any slower? A comparison with Juniper and Redwood could be useful, as only math would vary.

What is TF for LOD 25? Is the count of patches constant regardless of LOD?

What I was suggesting is that a VS with VFETCH is fast enough to do at least one vert per clock, so regardless of what is holding back the TS/DS, reading the tessellated triangles (and maybe even the control points) from memory has the potential to hit the same rate. This basically transforms the DS into a VS.

And the other crucial factor being that the GPU can use as many or as few SIMDs for DS as are necessary - without the "static" allocation issue that the locked-HS/DS theory implies. It's real load-balancing, whereas the current architecture appears to behave like a non-unified GPU - adequate only for low TFs.

From that thread:

Mintmaster said:
So it looks like the numbers are legit. ATI's tessellator is either faster at integer tesselation or the bottleneck is somewhere else, like feeding the domain shader.

Jawed · Nov 2, 2010

3dcgi said:
Something has to tell you where to fetch HS output data from.

Thanks, yes, you've probably since noticed the penny had dropped last night.

trinibwoy · Nov 2, 2010

hkultala said:
6870 series got almost same performance as 5870 series, with about 80% of the pixel shader power, 80% of the texturing power, with 87.5% of the memory bandwidth, and 106% of theoretical pixel output performance, and with some tweaks to internal buffering, thread handling etc.

Sure it might look rosy if you compare to the 5870. What happens when you consider the 6870 has 124% of the clock-speed and fill-rate of the 5850, 97% of the flops and texture rate, 105% bandwidth and is only 6-7% faster? Looks to me that pixel fill-rate isn't the determining factor at all and bandwidth/texturing still rules. Couple that with the much higher clock speed and Barts isn't doing anything magical with respect to its theoretical numbers vs Cypress.

For Cayman to achieve higher performance than its theoretical numbers would indicate vs Cypress it would need to have more dramatic efficiency improvements than we've seen in Barts.

Dave Baumann · Nov 2, 2010

hkultala said:
Another thought/speculation about Barts:

We know, that originally Barts was supposed to be "6700" manufactured with 32nm.

So when the 32nm process was cancelled, what was changed.

You know these things?

Barts did not exist on the roadmap in 32nm. Barts in 40nm turned up before the 32nm cancellation.

AMD: R9xx Speculation

Jawed

AnarchX

caveman-jim

CarstenS

Moderator

Mintmaster

Jawed

SirPauly

R300King!

TOAO_Cyrus

Mintmaster

3dcgi

R300King!

hkultala

hkultala

AnarchX

no-X

Jawed

Jawed

trinibwoy

Meh

Dave Baumann

Gamerscore Wh...

Similar threads