AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
There is slightly better co-issue, slightly better precision for FP, an additional instruction, etc.
The co-issue, MUL and dependent-ADD, I don't really understand. Is that producing two resultants in one cycle, i.e. the result of the MUL and the result of the MUL+ADD?

There's been a patent application for something like the SAD instruction for years now. Can I be bothered to dig it up and check it out?...

Then there is the reduced-precision integer math on the slim units that appears to be a step up from RV770, where only shifts were available.
This is important for addressing arithmetic, so will produce a very welcome speed-up in compute.

This might mean that as time drags on that AMD might inch towards 32-bit integer operations across all units in the SIMD, as process nodes afford more space.
It's vastly more expensive than the double-precision arithmetic, so could be waiting a long time.

I'm still waiting on what has doubled the complexity of Evergreen's scheduler.
20 clusters instead of 10?

Jawed
 
No. I demonstrated, that nVidia is in significant loss, not profitable. That's all.

ATi's slight loss can be explained by manufacturing of DX11 GPUs, which were stockpiled so far and will be sold soon.

You don't account for costs until you sell the stocks (cost of sales). That's not how it works in financial reporting...
 
Yeah, looks problematic to me. Can't think of any justification other than "scaling bandwidth to 20 clusters is extremely hard".

Is prunedtree's matrix multiplication L2->L1 bandwidth-limited?
The bandwidth numbers appear to be in excess of the L2-L1 bandwidth, so maybe not to a first-order approximation.
I'm not sure how well it would hold if the doubled ALU capabilities in Cypress effectively halve the L2 bandwidth per SIMD.

The granularity of such accesses may not have changed, either. There may be some internal latency to cache port collisions that would be more numerous with 20 SIMDs.
 
The SIMD array is apparently banked, with two separate arrays.

The ROP groups do not appear to be similarly divided.
Is this just a diagram simplification, or is there something more to this?
My first guess is that we are looking at two independent Setup engines (being fed by a common vertex stream) followed by two independent shader engines (fragment shading). Since RBEs have to accept pixels from any cluster in R700, R600 etc., R800 is no different in that RBEs are screen-space tiled but can be fed by many clusters.

We don't know how screen space tiles are mapped across clusters in R700, for example. There's a 4-way tiling across RBEs, but there's 10 clusters.

So R800 has 4 RBEs, sinking pixels from 20 clusters.

The 2 rasterisers can each have distinct regions of screen space. This means that the two hierarchical-Z/stencil systems can work independently, yet still accelerate a common back buffer.

My problem with this is how the two rasterisers get their work. Until you've done a coarse rasterisation (identifying a screen-space tile, the first stage of hierarchical rasterisation), you don't know which rasteriser to use. Whoops, coarse rasterisation is the first step in rasterisation. etc.

A simple solution is that the two exchange coarse rasterisation results with each other, when they discover that they have tiles which they don't own screenspace for. The data at this point is still cheap to exchange (i.e. vertex coordinates). Alternatively, coarse rasterisation is so cheap it's easiest for every triangle to be coarse rasterised by both rasterisers and then the screen space tile or tiles (since a triangle could easily span tiles owned by both rasterisers) resulting from this first step determine how the remaining rasterisation steps proceed.

Jawed
 
The L shaped purple blocks on the side of each ROP block apparently have links to both hierarchical Z blocks.
These blocks should be colour/Z/stencil buffer caches.

If the ROPs still work as a unit, maybe it could be that one SIMD bank might be running a fill-rate limited set of code, while the other could be running some ALU or texture-limited work, and the ROPs could concentrate on one side of the chip over the other.
This is already the case in R700 and earlier GPUs. There's no way to make all clusters produce complete pixels at an equal rate.

Also bear in mind that in R700 there are 10 ALU cycles per RBE cycle. In R800 the ratio is the same.

Jawed
 
It's funny how the old 1 byte per flop (per second) bandwidth rule is now more a 0.5 bit per flop rule ;)
Well prunedtree's MM re-balances that somewhat...

Also I expect that "rule" was based on ADD/MUL capable ALUs, not on MAD capable ALUs.

Jawed
 
My problem with this is how the two rasterisers get their work. Until you've done a coarse rasterisation (identifying a screen-space tile, the first stage of hierarchical rasterisation), you don't know which rasteriser to use. Whoops, coarse rasterisation is the first step in rasterisation. etc.
(Coarse) rasterization probably doesn't happen in a single clock cycle so that doesn't sound like a big issue per se. To evenly distribute the work among the two rasterizers is technically more challenging, I wonder if they have simply split the screen in two zones or if they use some tiling (that would be obviously better..)
 
For architectures like RV870 MCM designs makes no sense at all imo. (regarding performance scaling / die scaling ratio...)
You can connect them with a lot more wires on a flipchip substrate, faster too. Of course as long as you use AFR that's not really an important issue.
 
Which begs the question, what's the tessellator doing? Generating a stream of interpolation requests, to be performed by a special tessellation kernel?
Dunno what subset of the work is doing but I don't see it existing as a fixed function unit for many generations to come :)
 
(Coarse) rasterization probably doesn't happen in a single clock cycle so that doesn't sound like a big issue per se. To evenly distribute the work among the two rasterizers is technically more challenging, I wonder if they have simply split the screen in two zones or if they use some tiling (that would be obviously better..)
Why wouldn't they use tiling? Two multiplications and truncations per vertex and 3 integer comparisons per triangle to determine whether the triangle is entirely inside a tile isn't exactly a lot of work.

PS. on reflection doing it for split screen is exactly the same amount of work, they are just very large very poor loadbalancing tiles.
 
"Fast GDDR5 Link Retraining - Allows Voltage & Clock Switching on the Fly without Glitches"

Do they actually really change voltage too? Would be welcome change. No cards I'm aware of do this with gddr3, I think simply because those dram parts don't come with voltage/frequency lists in the datasheet (they have a fixed voltage and max frequency there). Though it might not be possible to lower voltage a lot even in case of pretty low frequency (though surely at least those factory-overvolted parts could run at default voltage at lower frequency).
 
Haven't looked at the specs of the PWM Volterra GDDR5 controller, but no probs with programmable input. Don't know if it's dynamic.
 
Why wouldn't they use tiling? Two multiplications and truncations per vertex and 3 integer comparisons per triangle to determine whether the triangle is entirely inside a tile isn't exactly a lot of work.

PS. on reflection doing it for split screen is exactly the same amount of work, they are just very large very poor loadbalancing tiles.
Sure, I wasn't saying that tiling harder than simply splitting the screen in two parts, but things are never simple, there might be interactions with part of the architecture we know nothing about that would make one or the other method more feasible.

Given that this is part that is meant to push highly tesselated meshes is nice to see that they have put a second rasterizer on board (and perhaps a second setup unit as well..), though I wonder if their fixed function hw is smart enough to skip unnecessary hops (i.e. hierarchy levels) and directly rasterize small triangles.
 
The co-issue, MUL and dependent-ADD, I don't really understand. Is that producing two resultants in one cycle, i.e. the result of the MUL and the result of the MUL+ADD?

If I had to guess, I'd say it means that operations like

a=b*c;
d=a+x;

can now be coalesced into a single FMA.
 
To me tesselation is not like vanilla linear interpolation. In lerp, you have two numbers and a parameter, and the result is a single number.

Tesselator produces variable number of outputs depending on the parameter.
Which begs the question, what's the tessellator doing? Generating a stream of interpolation requests, to be performed by a special tessellation kernel?

Jawed
I assume you mean here that tesselator generates a bunch of equivalent linear intepolation parameters which are then actually interpolated by the shader cores. If so, then it seems like a more efficient use of hardware to me. After all, the linear interpolators can possibly be reused for texture filtering as well. ;)
 
Jawed said:
A simple solution is that the two exchange coarse rasterisation results with each other, when they discover that they have tiles which they don't own screenspace for.
It's a little more complicated than that, because you need to also keep the triangles in the correct order. Z buffering by itself isn't sufficient for correctness.
 
If I had to guess, I'd say it means that operations like

a=b*c;
d=a+x;

can now be coalesced into a single FMA.
That's basically what I was saying, but not in the form of a single FMA, after all fused means that "a" is lost. What I imagine happens is that "a" is copied to another lane to use that lane's store unit. Meanwhile the original lane's ADD after the MUL takes a and x to make d.

This is similar to the way that dot product has to move data across lanes.

I don't understand why this wasn't already in the ALUs, to be honest, I don't see what's novel as the wiring was already there for DP4 as far as I can tell. "More flexible dot products" implies to me that a DP3 will no longer waste a lane - currently DP3 is implemented as a DP4 with the fourth lane idle. I've never investigated what happens with the DP2 instructions (I think there's a couple).

Jawed
 
To me tesselation is not like vanilla linear interpolation. In lerp, you have two numbers and a parameter, and the result is a single number.

Tesselator produces variable number of outputs depending on the parameter.
Yes, the interpolation of a vertex attribute (position, colour, normal etc.) simply has more input data. The math is of the same form. So the interpolation to create new vertices takes the form of vertex attribute interpolation, but only position is interpolated by the TS stage.

So, the tessellator first converts the tessellation factor (e.g. 4.7) into a list of interpolation weights (a weight per new vertex, i.e. 2 weights for 2 new vertices in this case). It then assembles vertex coordinates and weights for an interpolation kernel to consume. Weights are more complicated, though, when tessellation is non-uniform.

I think we're agreeing, but you're seeing interpolation from the texture filtering point of view, while I'm seeing it from the vertex attribute rasterisation point of view.

I assume you mean here that tesselator generates a bunch of equivalent linear intepolation parameters which are then actually interpolated by the shader cores. If so, then it seems like a more efficient use of hardware to me. After all, the linear interpolators can possibly be reused for texture filtering as well. ;)
No, not for texture filtering, but for all vertex attributes.

So, of course, this is exactly what NVidia does with interpolation, with an ALU (multifunction interpolator) that produces interpolated vertex attributes. Does ATI need a dedicated "interpolate" instruction, or are there enough ALUs to just do it as a normal math over several cycles? Since ALU:TEX hasn't changed with this architecture, it would seem likely ATI has a dedicated instruction.


Jawed
 
It's a little more complicated than that, because you need to also keep the triangles in the correct order. Z buffering by itself isn't sufficient for correctness.
I'm still not sure about this, but I think RBEs sort pixels by primitive ID in existing GPUs. Have never really go to grips with the mechanics of this, as the numbers appear to be treacherous. No idea if the per cluster load-balancing of fragment shading heavily biases in favour of primitive ID (I'm assuming that there's only one primitive per batch of <=64 fragments), nor whether this is commoned across all clusters. If both techniques are used then that would make RBE-sorting less arduous.

Jawed
 
Back
Top