AMD: R9xx Speculation

Maybe Firaxis are using something like what is proposed on page 19:
http://developer.download.nvidia.com/presentations/2010/gdc/Tessellation_Performance.pdf

Screen Space Adaptive Tessellation
• Triangles under 8 pixels are not efficient
• Consider limiting the global maximum TessFactor by screen resolution
• Consider the screen space patch edge length as a scaling factor

But then, it's unlikely, that a GTX 470 is faster than a GTX 480, or that the overclocked Evga GTX 460 is slower than a stock GTX 460 with 1GB. *sigh*
I would think adjusting the max TessFactor by screen resolution would mean less tessellation at lower resolutions which is opposite of the results showing performance increase with resolution.
 
That's how you write the shader, but internally each of the two coordinates are 16 bit fixed point numbers from 0 to 1.
The DS I have here is reading floats from LDS, not ints. It's not reading ints and doing i2f conversions.

You're assuming that the tessellated triangles from many patches are generated in parallel.
No, I'm simply describing scenarios where multiple patches are shaded by a single HS hardware thread. These would, presumably, form a contiguous set of entries in TS's input buffer (GDS, 64KB). If that's the case, then TS cannot output vertices to the patch-originating SIMD once LDS is full.

TS could only continue working if there's enough patches from differing SIMDs. In theory TS should be able to select patches freely from GDS, rather than treating it as a FIFO, so TS shouldn't stall due to a lack of patches or lack of SIMD destinations for vertices - unless there's not enough SIMDs outputting patches.

That would be the methodology of tessellation emulated by the GS, but when you have fixed function hardware that's not how it works.

The tessellator will have a stream of input patches (edge/face tessellation factors and nothing else), read one, generate coordinate pairs one at a time to create a triangle list until the patch is complete, and then repeat. I would think that it wouldn't take many transistors to generate one barycentric coordinate pair per clock this way.
I'm not disagreeing with any of that. Though it should be obvious that running only 4 control points in an HS hardware thread is quite wasteful, if there's capacity to execute 64. Presumably that's resolved in the driver...

Like I said, the TS output stream is then very small. Whether it goes to GDS or an off chip ring buffer before being read into LDS shouldn't matter. You still need the control point data of the hull shader, but that should be able to remain in the LDS. For tessellation this shouldn't be an issue. The throughput needed on Cypress (one tri per clock) is very tiny. If it can balance vertex shaders, then it can balance domain shaders.
Scheduling granularity appears to be the issue for tessellation though - once a SIMD is allocated some HS work, then it's committed for a highly uncertain period to all the DS work that derives from the patch(es) that that instance of HS produces.

If off-die buffering of tessellation data is a performance enhancement for Cayman (as apparently advertised), then in my view that points directly to reduced sensitivity to granularity.

It might be that the combination of 64KB for patch parameters in GDS plus the fragmented LDSs of 32KB each are jointly creating the granularity problem, rather than either on its own. Hard to say...
 
Maybe Firaxis are using something like what is proposed on page 19:
http://developer.download.nvidia.com/presentations/2010/gdc/Tessellation_Performance.pdf

Screen Space Adaptive Tessellation
• Triangles under 8 pixels are not efficient
Anyone know their average area in HAWX2?

• Consider limiting the global maximum TessFactor by screen resolution
• Consider the screen space patch edge length as a scaling factor
The GTC presentation goes into this subject at some length.

But then, it's unlikely, that a GTX 470 is faster than a GTX 480, or that the overclocked Evga GTX 460 is slower than a stock GTX 460 with 1GB. *sigh*
Do you think PCGH will do any benchmarking?
 
The DS I have here is reading floats from LDS, not ints. It's not reading ints and doing i2f conversions.
Well, I don't know what to say. I'm just telling you the DX11 specs for tessellation.

No, I'm simply describing scenarios where multiple patches are shaded by a single HS hardware thread. These would, presumably, form a contiguous set of entries in TS's input buffer (GDS, 64KB). If that's the case, then TS cannot output vertices to the patch-originating SIMD once LDS is full.
You're making a lot more assumptions than that. There's no reason to simultaneously store all the triangles produced from tesselating a thread full of patches. You have a HS thread producing patch data, store that in the GDS until the input buffer is full (64k is overkill), and then let the HS thread sit idle until there's room again. In the meantime, TS is cranking out barycentric points one at a time, DS is consuming them 64 at a time, and it has no need to finish the shader any faster than 64 per clock because setup can't consume them any faster than that.

There's no need to store 337*16 triangles.
TS could only continue working if there's enough patches from differing SIMDs. In theory TS should be able to select patches freely from GDS, rather than treating it as a FIFO, so TS shouldn't stall due to a lack of patches or lack of SIMD destinations for vertices - unless there's not enough SIMDs outputting patches.
:?: enough patches from differing SIMDs? Nothing wrong with a FIFO. Why would the TS stall? It's by far the slowest stage in the pipeline.
Scheduling granularity appears to be the issue for tessellation though - once a SIMD is allocated some HS work, then it's committed for a highly uncertain period to all the DS work that derives from the patch(es) that that instance of HS produces.
So what? There's a lot of SIMDs per chip, and you could even be clever in your compiling so that the HS exist alongside another shader in the same SIMD. Occupation of SIMDs is not the performance problem of Evergreen or NI. It's triangle throughput.

You're being really narrow minded if you think off-die buffering can only improve performance if granularity sensitivity is a problem. There's all sorts of possibilities. If ATI GPUs can do 1 tri per clock with a regular VS using VBs/IBs from RAM, then they should be able to do the same with a DS, whether it fetches the barycentric coords from RAM, GDS or LDS. IMO the only possible bottleneck is the tessellator itself or a data path.
 
Using Civ 5 as a benchmark...well tbh I dunno how they even do it and get repeatable numbers.

I've seen zooming in, zooming out, leader benchmarks, all giving wildly inconsistent numbers. I for one can barely tell any difference between max dx11 + tessellation and low settings dx9, apart from having no AA in the dx9 game.
 
ATI Still Suffers from TSMC's 40nm Problems - Company

Even though Taiwan Semiconductor Manufacturing Company claims that the yields on its most advanced 40nm node are high enough for mass production of advanced chips, actual designers and suppliers of graphics processing units still seem to have problems.

"We have had very good demand for our Radeon HD 5700-series and 5800-series. [...] We could not meet all our demand in the third quarter of this year on the Radeon HD 5800- and 5700-series. But the overall situation was getting better in Q3," said Matt Skynner, corporate vice president and general manager of GPU division at AMD in an interview with X-bit-labs.

In the third quarter of 2010, demand for TSMC’s wafers continued to increase, and wafer shipments in all major semiconductor market segments, except computer, increased from their second quarter levels. The latter fact means that companies like Advanced Micro Devices and Nvidia Corp. are shifting orders from older manufacturing processes to newer ones.

However, 40nm production capacities may be not enough for the two. Given the fact that TSMC’s revenue for 40nm chips increased its share to 17% of the company’s earnings, which means that actual revenue from 40nm process tech increased by 13.6% in Q3 2010 to $623.25 million. In the meantime, in the third quarter Nvidia released several new mainstream and performance-mainstream graphics processors built using 40nm process tech, whereas its rival ATI started to manufacture its new performance-mainstream graphics processing unit, the Radeon HD 6870.

It should be noted that while AMD does not expect to meet demand on its newest chip this quarter, it does not blame TSMC for that.

"We have just launched a new product. With the launch of every new product the demand often outstrips supply as we ramp up production. I do not see it as a 40nm capacity issue, but it is an normal issue during the ramp up process. There could be issues with availability of our new products, but those may not be capacity issues," added Mr. Skynner.

Source: http://www.xbitlabs.com/news/video/...uffers_from_TSMC_s_40nm_Problems_Company.html
 
If ATI GPUs can do 1 tri per clock with a regular VS using VBs/IBs from RAM, then they should be able to do the same with a DS, whether it fetches the barycentric coords from RAM, GDS or LDS. IMO the only possible bottleneck is the tessellator itself or a data path.
On top of that, an off-die buffer would be completely off, considering advantages tessellation has are lower memory footprint and bandwidth requirement compared to a "full precision" mesh... hell, why not pre-tessellating, or even not tessellating at all and use full precision meshes from the start, then?

That's not a reason to push the slider to the 64 mark because we can, though, nor is it to give their shapes to cars only from boxes and textures as that creates new issues, with shadowing for example.
 
Well, I don't know what to say. I'm just telling you the DX11 specs for tessellation.
That's the spec for computation, I believe, not storage. And since we were talking about the space consumed in LDS, then 8 bytes per vertex is the truth.

You're making a lot more assumptions than that. There's no reason to simultaneously store all the triangles produced from tesselating a thread full of patches.
It's impossible, anyway, currently, since LDS is too small when TF=64.

You have a HS thread producing patch data, store that in the GDS until the input buffer is full (64k is overkill), and then let the HS thread sit idle until there's room again. In the meantime, TS is cranking out barycentric points one at a time, DS is consuming them 64 at a time, and it has no need to finish the shader any faster than 64 per clock because setup can't consume them any faster than that.
Of course. DS shaders tend to take a while.

There's no need to store 337*16 triangles.
Well, for a square patch the correct number would be 450*16 (thanks to AlexV for the correction - doh for not even thinking about it). Clearly LDS can't store them all. A ring buffer would have no trouble storing them all, of course. Or the 8192 vertices that TF=64 produces.

And when TS is working its way through the patch that results in thousands of vertices, it can't load-balance itself and switch to another patch, when the destination LDS for the first patch runs out of capacity.

:?: enough patches from differing SIMDs? Nothing wrong with a FIFO. Why would the TS stall? It's by far the slowest stage in the pipeline.
B3D article says HS is 5% arithmetic throughput - i.e. 1 SIMD on Cypress. I think that's prolly too low for the generic case and potentially more of a reflection of the workload balance they gave the chip in the test. But still, there's likely to be a tendency to minimise the number of SIMDs running locked pairs of HS/DS.

If enough SIMDs are allocated to HS/DS then even with high TF there should be enough patches for TS to consume. But we've got zero data on SIMD allocation in high-tessellation scenarios...

So what? There's a lot of SIMDs per chip, and you could even be clever in your compiling so that the HS exist alongside another shader in the same SIMD.
That's not a compilation problem.

It's an open question whether a SIMD can support more than 2 different types of shaders. With HS and DS appearing to be a locked pair, a third type seems unlikely... This is primarily a register allocation problem: if you're allocating registers, how do you segregate the shader types and avoid fragmentation, with more than 2 shader types?

If Cayman uses off-die buffering when tessellation is active, one possibility is that all output from HS and TS goes off-die. This would allow asymmetric configurations of HS/DS to occupy SIMDs and allow more-fluid load-balancing of HS and DS workload.

Occupation of SIMDs is not the performance problem of Evergreen or NI. It's triangle throughput.
So off-die buffering is irrelevant?

You're being really narrow minded
:oops: Cayman's T ALU says hi :LOL:

if you think off-die buffering can only improve performance if granularity sensitivity is a problem. There's all sorts of possibilities.
Which you still haven't eludicated.

If ATI GPUs can do 1 tri per clock with a regular VS using VBs/IBs from RAM, then they should be able to do the same with a DS, whether it fetches the barycentric coords from RAM, GDS or LDS. IMO the only possible bottleneck is the tessellator itself or a data path.
The "data path". Why does ATI's GS always push data off die?
 
Back
Top