AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

7850 with what CPU...

What does it matter? The CPU unleashes the GPU's full potential, it doesn't make it faster than it is. It seems as long as you have a half decent tri-core or better you can max out a 7850.

its not logical to assume the consoles are CPU limited in this game as if they were they'd be running at higher resolutions than they are. Put another way, the resolution (or other graphical effects) will clearly be dialled up in the console versions until the GPU's are maxed out at the target framerate.
 
The hardware tessellator only needs tessellation factors.
That's what I meant by patch data.
It outputs UV's and the DS does the rest.
I was giving a suggestion for even avoiding the UV calculation in the tesselator, but if AMD does do it, then that's great. The problem is that their data flow management for tessellation sucks if it needs off chip storage.
The DS needs access to the HS output and if you only have a few wavefronts in flight performance will suck. Hence you want DS waves from the same patch to execute on multiple CUs/SMXs in parallel.
I don't see how this is so different from triangles from vertex shaders generating pixels that run on multiple CUs/SMXs.

AMD chips don't push polygons into an off-chip buffer there. They run the VS on however many CUs are deemed appropriate, fill a FIFO on chip to temporarily store the output, and when it's full the CUs stop doing VS work while the rasterizers start emptying the FIFO by creating pixel wavefronts for the CUs to process.

This is how tesselation should work. Run the HS, fill a FIFO, when full stop HS work and let tessellators empty the FIFO by creating DS wavefronts.
 
That's what I meant by patch data.
I was giving a suggestion for even avoiding the UV calculation in the tesselator, but if AMD does do it, then that's great. The problem is that their data flow management for tessellation sucks if it needs off chip storage.

I don't see how this is so different from triangles from vertex shaders generating pixels that run on multiple CUs/SMXs.

AMD chips don't push polygons into an off-chip buffer there. They run the VS on however many CUs are deemed appropriate, fill a FIFO on chip to temporarily store the output, and when it's full the CUs stop doing VS work while the rasterizers start emptying the FIFO by creating pixel wavefronts for the CUs to process.

This is how tesselation should work. Run the HS, fill a FIFO, when full stop HS work and let tessellators empty the FIFO by creating DS wavefronts.
HS's can be very expensive so you need a lot of them in flight to perform well at low tessellation cases. The FIFO you suggest is the limiting factor for how many HS waves can execute in parallel.

Say you have 4 control points per patch, 64 bytes of output per control point, plus 6 32-bit tessellation factors per patch and you need 1000 control points in flight to hide the latency of the HS. That's 70000 bytes of storage. You really want it to be double that because you can't free the storage until the DS's are done with it. That's 140000 bytes for fairly skinny control points so why not use your cache or spill to memory if necessary. This way all that storage is available when you're not using tessellation.

As the tessellation level rises you need less storage in the FIFO, but you don't know that until the HS calculates the tessellation factors.
 
HS's can be very expensive so you need a lot of them in flight to perform well at low tessellation cases. The FIFO you suggest is the limiting factor for how many HS waves can execute in parallel.
First of all, that's a failure in optimization if AMD is sacrificing high-factor performance to improve low-factor performance. Their own devrel material suggests not using tessellation at low factors, and poor HS performance won't matter then anyway.

Say you have 4 control points per patch, 64 bytes of output per control point, plus 6 32-bit tessellation factors per patch and you need 1000 control points in flight to hide the latency of the HS.
The DS is where displacement mapping happens, so I don't see why the HS needs 1000 control points to hide latency. The vast majority of HSes will be pure arithmetic.

Anyway, HS output going to memory doesn't explain AMD's performance in the least. The higher the tessellation factor, the more DS verts per patch, and the less that the BW/latency of off-chip HS data will affect the triangle throughput of the tessellator.

In reality, AMD's tris per clock goes drastically down with tessellation factor. That must mean they are storing tessellator output off chip (maybe 5-6 bytes per DS vert), not HS output. There's no need to do this with properly designed hardware.
 
Anyway, HS output going to memory doesn't explain AMD's performance in the least. The higher the tessellation factor, the more DS verts per patch, and the less that the BW/latency of off-chip HS data will affect the triangle throughput of the tessellator.
Correct. This is why it doesn't make sense to have a large on chip FIFO.

In reality, AMD's tris per clock goes drastically down with tessellation factor. That must mean they are storing tessellator output off chip (maybe 5-6 bytes per DS vert), not HS output. There's no need to do this with properly designed hardware.
If the DS output went off chip you'd see performance scale with memory bandwidth.
 
Yet, Hawaii is scaling just fine with occlusion culling with or without tessellation. It's only when the data stream hits the rasterizer the rate seems to be capped.
 
I still don't understand how "driver issues" can affect one SKU when others based on the same chip aren't similarly affected.

I guess this will be explained by the launch reviews - or look at Kaotik's reply. Very insightful he has proven time and again. ;)
 
Last edited by a moderator:
Correct. This is why it doesn't make sense to have a large on chip FIFO.
It doesn't need to be large. Your calculations were dependent on having 1000 HS wavefronts, which is at least 50x too many. It's extremely rare for a HS to need texture access.

If the DS output went off chip you'd see performance scale with memory bandwidth.
BW wouldn't be an issue (4 verts/clk is < 24 GB/s), unless there is an internal restriction. Latency and/or extra clocks for export/import probably is.
 
It doesn't need to be large. Your calculations were dependent on having 1000 HS wavefronts, which is at least 50x too many. It's extremely rare for a HS to need texture access.

BW wouldn't be an issue (4 verts/clk is < 24 GB/s), unless there is an internal restriction. Latency and/or extra clocks for export/import probably is.
I said 1000 control points not 1000 HS wavefronts. If you don't think bandwidth is an issue then I'm not sure why you think that would be a stupid design.
 
I said 1000 control points not 1000 HS wavefronts. If you don't think bandwidth is an issue then I'm not sure why you think that would be a stupid design.
My mistake. Ignore what I wrote in that paragraph.

But sustaining 1000 control points is silly for a shader without texture loads in a load-balanced system (even if it does initially issue that many), and even if you're right that the tens of kB needed to buffer enough control points is too much of a cost, problems with HS streamout would result in a t-factor effect on performance that trends opposite of what we're seeing.

BW for tessellator streamout is not an issue in theoretical tests, but it'll still sap significant BW (BTW, I forgot to multiply those numbers by 2 for write and read, but it's still a lot less than total BW. I assumed two 16-bit UVs plus a couple other bytes for indexing/overhead). So it still is stupid to stream UVs out, especially when they're so tiny and can be created on demand by the tessellator and packed directly into DS wavefronts.

But why are you denying that there's a problem? I'm still waiting for your explanation, but AFAICS the only reason to have lower triangle throughput with high t-factor is UV streamout (high t-factor, in fact, reduces data flow in the pipeline), with stalls appearing either when writing out or reading back in.

It's a flawed design.
 
My mistake. Ignore what I wrote in that paragraph.

But sustaining 1000 control points is silly for a shader without texture loads in a load-balanced system (even if it does initially issue that many), and even if you're right that the tens of kB needed to buffer enough control points is too much of a cost, problems with HS streamout would result in a t-factor effect on performance that trends opposite of what we're seeing.

BW for tessellator streamout is not an issue in theoretical tests, but it'll still sap significant BW (BTW, I forgot to multiply those numbers by 2 for write and read, but it's still a lot less than total BW. I assumed two 16-bit UVs plus a couple other bytes for indexing/overhead). So it still is stupid to stream UVs out, especially when they're so tiny and can be created on demand by the tessellator and packed directly into DS wavefronts.

But why are you denying that there's a problem? I'm still waiting for your explanation, but AFAICS the only reason to have lower triangle throughput with high t-factor is UV streamout (high t-factor, in fact, reduces data flow in the pipeline), with stalls appearing either when writing out or reading back in.

It's a flawed design.
Don't underestimate how much latency there is in a graphics chip nor how complicated some HS's are. Especially once sub-division surfaces start being used more and there are a lot of control points per patch.

I'm not denying there is a problem and that the design is flawed. It's just not flawed due to off-chip buffering. I can't give more detail than that. I am curious to see if other vendors make a similar mistake as they add tessellation support. Nvidia did not though their first design wasn't perfect either. You can see how Nvidia improved between the 580 and 680.
http://techreport.com/review/22653/nvidia-geforce-gtx-680-graphics-processor-reviewed/6
 
Don't underestimate how much latency there is in a graphics chip nor how complicated some HS's are. Especially once sub-division surfaces start being used more and there are a lot of control points per patch.
Again, I'm accepting your claim that HS output may warrant going off chip. And latency is exactly why I think UV streamout can cause stalls.

But once HS data is read back, it makes no sense for high tessellation-factors to have reduced triangle throughput over low factors. The HS overhead gets amortized over more triangles, so throughput should go up.

I'm not denying there is a problem and that the design is flawed. It's just not flawed due to off-chip buffering.
Well, improved off-chip buffering is really the only thing AMD has mentioned between generations, so that's what I assumed. I guess it could be related to DS wavefront generation and distribution, or bank conflicts, or whatever, but it's hard to think of anything that makes throughput go down with higher factors.

Whatever the issue is, AMD must be doing something stupid in the tessellator, and it has now been there for four generations.
 
Back
Top