AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

In ATI/AMD GPU architecture rasteriser and ROPs go together, they are tightly linked. NVidia has these independent of each other.

Hawaii and Tahiti are no different in this respect. The ROPs are also on the shader side of the shader/memory crossbar, which is how the ROPs in Tahiti do not line-up with the memory channels.

Both diagrams also show that all geometry engines communicate with all rasterisers.

The fundamental structure of these things hasn't changed. There's just more of them.

So is the pixel-to-shader mapping static for AMD? With the rasterizer feeding fragments to statically assigned shaders?
 
Does 1MB of L2 have any effect there? Will we ever find out?

I dare say it seems a bit ironic, but we're likely to find out more about this architecture (in graphics) because of the consoles than we've learnt so far.

I'm tempted to say we'd have found out already, it's been 2 years and Hawaii is only 33% bigger in this respect.

On the other hand, again, once console developers start digging...

The L2 was scaled by 33%, matching compute increase by 33%.

We are not going to see more cache until the compute workloads start to get a bit more complex. And even then, I am sceptical. LRB or even KC is too far from kepler/gcn. The latter are still too optimized for massively parallel workloads.
 
Actually, I'll temper that a bit. 8MP may well be where this chip shines. Developers this past few years mostly haven't been writing compute-heavy graphics. When they do, reviewers leave that option off because NVidia is screwed. So, games are compute-light, which means that 8MP monitors are home territory for Hawaii.

Absolutely. L2, compute scaled by by a middling 30%. Most of the work here seems to have gone into the frontend, geometry and ROPs. Frontend, probably because it is cheap and after the consoles they had the IP lying around. Geometry will help for the 4K and ROPs seem made for 4K.
 
Four rasterisers with pixels locked to those rasterisers' ROPs certainly looks like the corner that NVidia studiously painted away from. It doesn't give an impression of robustness when presented with tricky workloads.
So you think Nvidia doesn't tie its ROPs to specific pixels?

Well, I'm hoping that putting the geometry units inside the shader engines is symbolic of AMD reworking the data flow instead of algorithmically cramming tessellation into the geometry shader streamout model.
That's just it, though: AMD's method needlessly uses a lot of on-chip space when it should barely use any at all.

You don't need to start with 32 or 64 patch verts and generate all the tessellated triangles from that (which could require a lot of space or even streamout). The wavefronts should be post-tessellated vertices (you know how many there are from the tess factors) which read patch parameters (possibly as few as three) to calculate their barycentric coords. There absolutely should not be any performance degradation with higher scaling factors, even beyond the D3D max.
What makes you think AMD requires more on chip space than Nvidia?

I don't understand your second paragraph. Which wavefronts are you saying should be post-tessellated vertices? I assume you're referring to DS waves as that's what you're describing, but I'm not sure of the link between your last sentence and the rest of the paragraph.
 
So is the pixel-to-shader mapping static for AMD? With the rasterizer feeding fragments to statically assigned shaders?
No, simple round-robin, as long as the compute unit has space. I'm not sure if that's what you're asking though. The mapping from pixel to wavefront index (work item ID within a wavefront) is static, because hierarchical-Z maps its tile hierarchy to render target pixels statically.

We are not going to see more cache until the compute workloads start to get a bit more complex.
Yes, console compute is going to take a while to kick in. Consoles are "stuck at 1920x1080" so there'll prolly be a relatively rapid climb in interest in graphics-compute, but games have 1-5 year+ development cycles... On the other hand, there will be compute on the console GPUs which may be left as CPU compute on games when they are transferred to PC.

Absolutely. L2, compute scaled by by a middling 30%. Most of the work here seems to have gone into the frontend, geometry and ROPs. Frontend, probably because it is cheap and after the consoles they had the IP lying around. Geometry will help for the 4K and ROPs seem made for 4K.
When you say frontend, are you referring to the increase in ACEs? That should be compute friendly (e.g. in "guaranteeing" response times for certain compute tasks), but again that's going to take a while.
 
So you think Nvidia doesn't tie its ROPs to specific pixels?
Titan has 5 rasterisers and 48 ROPs.

I've realised I've been sloppy and should have been referring to fragments. Fragments in AMD are locked to ROPs. I don't see anything like that in NVidia.

I suppose it's possible NVidia has a fixed tiling of ROPs to render target pixels, but the hierarchy of rasterisation, render back end, L1, memory crossbar, L2 and memory channels doesn't seem to require that.

NVidia's implementation of hierarchical-Z could be a factor here, fixing certain things. So, maybe I'm missing something there.
 
Rasterizers should be locked to specific pixels (a set of render target tiles) and ROPs are also locked to specific pixels (a set of render target tiles, ROPs are tied to memory channels in nV's architecture, so a ROP can only access a subset of the render target anyway; even on Tahiti the crossbar between ROPs and memory controllers is not complete, meaning the same should be true to some extent). But nV doesn't have to use the same sets for front and back end pixel processing (resulting in some interleaving scheme distributing the load). ;)
 
The Hawaii diagram indicates that all geometry engines can feed all rasterisers
Well that's a given. You can't avoid that.

You're apparently suggesting a non-FF tessellator, I think.
By FF you mean fixed function? Actually, I'd be even happier with that, but I don't think AMD did FF. I think AMD made a few small tweaks to the geometry shader so that tessellation could be done with shader code.

The ideal sol'n, AFAICS, is to have a FIFO with patch tesselation factors (one wavefront is more than enough plenty) and params, and then some FF-tessellator logic (which should be very tiny, given the limited data paths needed) generates barycentric coords at a rate of 1-4 per clock. Those coords stuff a new wavefront and the domain shader proceeds from there.

Alternatively, and what I suggested in my previous post, is that there should be some FF logic that just generates simple indices for each vertex to be created by the tessellator, , stuffs them into wavefronts, and a shader generates the barycentric coords (possibly even at the front of the domain shader itself).

But when we see bad scaling with tessellation factor
6j0Udjb.png

and talk about off chip buffer support for high tessellation levels, it's clear to me that AMD is doing something wrong.

There is no need for more than a token amount of buffering space. I think they are generating all triangles from a set of control points, and then doing the domain shader after they are generated. That's the only explanation I can think of for needing off-chip storage.

What makes you think AMD requires more on chip space than Nvidia?
See the graph above.

I don't understand your second paragraph. Which wavefronts are you saying should be post-tessellated vertices? I assume you're referring to DS waves as that's what you're describing, but I'm not sure of the link between your last sentence and the rest of the paragraph.
Okay, I probably didn't describe it well.

Suppose you have a hull shader wavefront with 64 tripatches (I don't know if this figure is correct, but let's assume so). I think AMD processes tessellation in parallel, so if they all had a factor of 4 (24 tris per patch, 19 verts), we'll get 1536 tris. For higher factors, this number gets out of control, so AMD dumps the generated tessellation (uv pairs to become verts via the DS) to RAM.

What I think they should be doing is just buffering the tesselation factors for the 64 tripatches, and then have a FF unit go through the patches one by one, i.e patch #1 has 19 verts, patch #2 has 19 verts, etc, and then put together a domain shader wavefront of 64 verts (19 from each of patches #1-3, 7 from patch #4). This wavefront will either have barycentric coords calculated by a FF unit or simply indices so that the shader can calculate the coords. It will also generate an index buffer (alternatively, you could use more verts per patch and do implicit tristrips). Now you don't need to store 1536 tris.

I believe NVidia is doing something of this sort. Their polygon throughput is constant with tesselation factor.
 
Last edited by a moderator:
Whatever picture you tried to post didn't 'work, you instead got a "no deeplinking please!" placeholder.
 
No, simple round-robin, as long as the compute unit has space. I'm not sure if that's what you're asking though. The mapping from pixel to wavefront index (work item ID within a wavefront) is static, because hierarchical-Z maps its tile hierarchy to render target pixels statically.

If the pixel to wavefront ID is mapped statically, then how do they fill the wavefronts fully? A triangle is quite likely to not fill a wavefront fully and multiple triangles will generate a lot of superfluous fragments along the edges due to quad shading. Those superfluous fragments will have to fill the next wavefront, which won't have fragments from the bulk of triangle to fill up.

When you say frontend, are you referring to the increase in ACEs? That should be compute friendly (e.g. in "guaranteeing" response times for certain compute tasks), but again that's going to take a while.
Well, tbh, with the single threaded graphics dispatch, they aren't going to do any good. May be that's where Mantle will help. True parallel submission.
 
Titan has 5 rasterisers and 48 ROPs.

I've realised I've been sloppy and should have been referring to fragments. Fragments in AMD are locked to ROPs. I don't see anything like that in NVidia.

I suppose it's possible NVidia has a fixed tiling of ROPs to render target pixels, but the hierarchy of rasterisation, render back end, L1, memory crossbar, L2 and memory channels doesn't seem to require that.

NVidia's implementation of hierarchical-Z could be a factor here, fixing certain things. So, maybe I'm missing something there.

If ROPs sit on the other side of the shader-memory cross bar, then ROPs have to be statically tiled. At least when ROPs are a multiple of the memory channels. So, I would think they have fixed tiling on nv but not on AMD, which has ROPs on the shader side.
 
If the pixel to wavefront ID is mapped statically, then how do they fill the wavefronts fully? A triangle is quite likely to not fill a wavefront fully and multiple triangles will generate a lot of superfluous fragments along the edges due to quad shading. Those superfluous fragments will have to fill the next wavefront, which won't have fragments from the bulk of triangle to fill up.
I never got a full answer on this topic, but I think up to 4 triangles each of up to 16 fragments can share a wavefront, or combinations thereof (on the basis that the rasteriser has a granularity of 16 fragments, and they are derived from a single triangle, per clock).

The edges is a question I don't know how to answer.

Arguably the simplest solution is to say that my 1:1 mapping from earlier is wrong.
 
I never got a full answer on this topic, but I think up to 4 triangles each of up to 16 fragments can share a wavefront, or combinations thereof (on the basis that the rasteriser has a granularity of 16 fragments, and they are derived from a single triangle, per clock).

The edges is a question I don't know how to answer.

Arguably the simplest solution is to say that my 1:1 mapping from earlier is wrong.
Each pixel wave can be made up of as many as 16 triangles. The smallest granularity is a quad.
 
Back
Top