AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
Setup does a coarse rasterisation, identifying all the tiles that a triangle at least partially covers, then giving the rasteriser(s) a list of tiles and triangle data in order to rasterise. I suspect the rasteriser has a tile-centric view of rasterisation, not a triangle-centric view. That's because threads of 16 quads of fragments need to be despatched, and those need to be strictly tile-aligned (because the render target is tiled). Though I also expect it to handle triangles in strict order.

Is that based on some documentation you saw? Why would setup need to determine tile coverage? Shouldn't it be: vertices -> setup -> triangle -> raster -> tiles -> shaders?

One of the key questions that's still unanswered is can a thread of fragments refer to more than one triangle (e.g. 5 adjacent small triangles from a strip)?

Nvidia had a patent on something like that so people are thinking about it.

I think it might be a matter of practicality in instancing a block of hardware rather than re-jigging things for 32-rasterisation. I don't think the number 32 is problematic (since other ATI GPUs have 4-, 8- and 12-rasterisers) merely that scaling isn't free of latency/pipelining issues across the entire width of the unit.

Jawed

Oh where'd you get those tile sizes from?

It's a pity setup rate wasn't doubled (I still think it would be easy given that the rasterizers deal with different tile sets) but meh ... being a bitch about getting implementation details just because they aren't relevant to performance I see as counter-productive, I'd still rather hear them than not.

I want details. You want details. I don't see the difference.....

It's not like it was a paper launch where misunderstandings could fester for months ...

Heh, I'm sure you'll be able to sleep again soon once the DP througput is confirmed :LOL:
 
I want details. You want details. I don't see the difference.
If it's just an increase in scan conversion throughput it's stupid.
We know it's just an increase in scan conversion throughput now ... regardless of it's the truth you are saying it's stupid the double rasterizers were in the diagram AFAICS.
 
We know it's just an increase in scan conversion throughput now ... regardless of it's the truth you are saying it's stupid the double rasterizers were in the diagram AFAICS.

Nope, not exactly. I'm saying that if it's the same kind of increase that we've always been getting then it's weird that they suddenly chose to market it like this, given how much people have been hyping the setup bottleneck. If (as people are speculating) the implementation is tangibly different in some meaningful way then I'm just as interested as you to know more.
 
Re: Bandwidth

Look what 5870 does with only +33% bandwidth compared to 4870. So I'm guessing that a new card with slightly more bandwidth than 4770 can deliver some nice gains.

I'm kinda curious what's going to replace 4350, 4550 and 4650.
 
If (as people are speculating) the implementation is tangibly different in some meaningful way then I'm just as interested as you to know more.
It reduces fanout inside the rasterizer and allows you to get the rasterizers closer to the shader cores, also it was probably less work.
 
Is that based on some documentation you saw? Why would setup need to determine tile coverage? Shouldn't it be: vertices -> setup -> triangle -> raster -> tiles -> shaders?
No I haven't seen any documentation on this for R800. I was thinking in terms of each rasteriser working solely on the tiles it's given - the alternative is that both rasterisers get all triangles and decide which tiles they own by first doing the coarse rasterisation themselves. Since you queried it, I now think the latter approach is more likely.

Oh where'd you get those tile sizes from?
Those aren't tile sizes, they're rasterisation rates. The rasterisation rate equals the colour fillrate on ATI. So my X1950Pro has 12 RBEs and therefore needs a 12-rasteriser. The tiles it works with might be 8x8 pixels or larger. Thread size is 48.

The mapping of tiles and threads is something I don't understand. e.g. does this mean that tiles are 12x12 pixels on X1950Pro (144 pixels is 3x48) and restricted to such multiples? Or does this point to more flexibility with the way quads of fragments are assigned to threads?

If distinct (discontiguous) quads of fragments from multiple tiles can share a thread, then it's only a small step from that capability to being able to run multiple triangles' fragments together in one thread, even if they share an edge.

If that's the case, then there would be a speed-up from two rasterisers on small triangles, contrary to my view in the last post.

But I don't know how loose or strict quad assignment to threads is.

My original understanding of ATI's architecture is that tile sizes are small and thread sizes are no bigger - and this goes right back to R300. In stark contrast with NVidia's approach which on older GPUs has threads of as many as thousands of fragments (NV40) - under which regime the need to have multiple triangles' fragments share a thread is considerably more important.

A key feature of R300 onwards until R800 is that a fragment is fully described before pixel shading starts. Every vertex attribute is interpolated per pixel location, i.e. every fragment is independent of all other fragments and independent of the triangle that generated it. Though I think Primitive ID is carried forwards to be used in RBE, in order that primitives are written in their original order (as defined by setup) into the render target.

With that kind of independence, there's no reason why multiple triangles' fragments can't share a thread.

Except that Primitive ID may be something held by the scheduler (rather than per quad of fragments) - apart from anything else it's a key upon which to prioritise scheduling. But it also enables correct ordering for RBE. In which case there's no support for multiple triangles' fragments per thread.

The preponderance of small triangles in a tessellating pipeline might have made the designers bias towards Primitive ID-keyed quads of fragments. Alternatively they might have left the design alone, on the basis that the architecture is already happy with small triangles. Maybe the change was made in R600 when the tessellator first appeared?

R800's deletion of SPI means fragments are no longer independent of the triangle that generated them once pixel shading is under way. That theoretically means that each thread, at minimum, has triangle data associated with it (attributes + barycentrics). It may be that each quad of fragments is keyed for triangle, in order to perform interpolation, which brings us back to having multiple triangles' fragments sharing a thread.

The problem with that is that attribute interpolation (thread-wide) has variable throughput if a variable number of triangles shares the thread (similar to waterfalling fetches when constants are dynamically indexed). A compromise that isn't untenable, but makes me wonder if it's worth handling. Attribute/barycentrics fetch for interpolation may be handled by the constant hardware, in which case it's not a special case and easily handled.

So, I dunno...

Jawed
 
fusion.gif


*runs for cover*
 
Back
Top