AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
So Early-Z is querying RBE-Z before the shader starts processing the pixel? Yuck, long long long latency.
No, it doesn't work that way. Early-Z means do the full RBE Z before the pixel shader.
Isn't that exactly what I said? :???:

This way you reduce shader workload. Late-Z means do the full RBE Z after the shader. These are independent of HiZ and have been around since at least R300.
Hierarchical-Z is a coarse-grained (e.g. quad-level) low-precision (e.g. 12-bit) conservative rejection test and any kind of RBE Z is an acceptance test.

So a short shader (typical for DX8, I guess) is recommended to set Late-Z, which means a conservative hierarchical-Z rejection is possibly made. Acceptance is then performed after shading.

Jawed
 
I noticed something the critics of dual-slot cooling solutions which the Radeon 5000 family members pictured thus far have failed to notice.

You can't fit 4 display outputs on a single slot backplate and exhaust the heat out of the case.

But why let reason get in the way of a good bashing session? ;)
 
Last edited by a moderator:
I honestly gave up a few GPUs ago worrying over single/dual slot, I figure any of the high-end cards from any maker will be dual anymore.
 
Tom Forsyth had a presentation for SIGGRAPH2008.
Found it, I keep forgetting that :cry:

Wouldn't allocating triangles to a bin require the rasterization portion of the workload as well?
Forsyth's slides apparently included this in the front-end estimate.
It's a tile-level rasterisation only. So very cheap in terms of rasterisation, but requires that all vertex shading that affects the position attribute is computed.

Slide 26 says front-end is ~10% of the entire compute effort.

The actual cost I see is the creation of a bin and then having any core pick up a bin for processing. Both would be more expensive to do.

Forsyth's slides also indicated that a bin contains tris, shaded verts, and rasterized fragments.
I'm not sure if the fragments would be a concern for the distribution phase that might be passing over the interconnect.
The amount of data in a bin varies, you've described a heavy-weight bin. A flimsy bin with nothing more than triangle IDs would be cheap in a multi-chip solution. This would make the back-end more compute-heavy, which would hide some of the latency associated with NUMA. This trade-off between light/heavy is seen in current games where developers elect either to compute all attributes during vertex shading or leave some of them for computation during pixel shading (these are attributes derived from other attributes, normally) - you can view this as a form of compression of the per-vertex data.

Since the memory subsystem should maintain a coherent image of memory across the chips, there is no algorithmic reason why it would be single-chip.
The costs of this work have been evaluated as being sufficiently low only for a single-chip scenario, however.
Consumption of vertex data is basically a streaming problem, i.e. quite latency tolerant, if you have some decent buffers. Vertex data, due to the connectivity of triangles, strips, etc., never neatly fits precisely into cache lines, so the best approach is just to read big-ish chunks rather than individual vertices/triangles. So two chips (conventional or Larrabee) consuming from a common stream are going to be slightly more wasteful in this regard - this is similar to the wastages that occur with different vertex orders in PTVC.

But Larrabee can run multiple render states in parallel. So most trivially you can have the two chips working independently. Whether two successive render states are working with the same vertex inputs (e.g. shadow buffer passes, one per light?) or whether they're independent vertex inputs, the wastage is down purely to NUMA effects.

The flexibility of the software pipeline is the reason for Forsyth's estimate for front-end work being so wide.
It's 10% if deferring attribute, vertex, and tesselation work to the back end. It's variable because those three can be done in either front or back end.
Bin size would be the most amenable for sending to another chip if this work is deferred, but back-end burden and bin spread would be worse.

If done in the front-end, bin size becomes much larger and more costly to send to a remote pool of cores, though the bins themselves would be much better-behaved.
Bin spread should fall if flimsy bins are used, since the tiles can be larger (which reduces bin spread).

Back-end burden would be perfectly spread across both chips. Sure, two chips won't achieve 100% scaling - we aren't expecting that. Even Intel's estimates/simulations for scaling with core count on a single chip aren't linear...

If the front-end is duplicated, we about double the computation required for PrimSet dispersal and front-end work, but with minimal increase in synchronization or bandwidth burden on the interface. The developer would be much more free to decide on where to put work between the front and back ends.
I don't understand how you get double.

The PrimSet distribution by one core is actually well-suited to the likely ring-bus configuration Larrabee will use.
I don't understand what you mean by PrimSet distribution. Each PrimSet can run independently on any core. The data each produces is a stream of bins. They consume vertex streams and, if they already exist, render target tiles.

Some scheduler, somewhere, must then assign bin sets to cores. This is not a heavy task. Back-end has to consume the bin and create/update the render target tile. The scheduler isn't delivering bin data to back-end tasked cores.

It's also the case that if a bin is set up and ready for back-end processing, a scheme that is not aware of multi-chip NUMA is going to have much more traffic over the interconnect--something that will not happen if the setup scheme has duplicate front-ends that specifically minimize inter-chip rendering traffic.
If these are flimsy bins then I don't see the issue. If these are bins with in-progress render target tiles, then that's a bit more costly. Clearly heavy-weight bins are going to be the mostly costly. There's zero reason to build a multi-chip non-NUMA-aware software pipeline - Intel clearly intends not to build a one-size-fits-all software pipeline. Though I'll happily agree that multi-chip is low priority until single-chip is working really well, apart from anything else because it's harder.

Tesselation is about creating more triangles. At some level, amplifying the number of triangles and then turning them into a bandwidth+latency cost is a liability any scheme that apportions work heedless of chip location will take on.

It would be functional so long as Intel keeps inter-chip coherence, but Larrabee's bandwidth savings would be mitigated if the chip link is saturated, even if in absolute GB/s consumption is lower.

I guess in theory Intel could massively overspecify the inter-chip connections, but that sounds expensive.
Overall, though, I would expect that a memory-bandwidth:link-bandwidth ratio of X would serve Larrabee better than traditional GPUs. You have a huge amount of programmer freedom with Larrabee to account for the vicissitudes of NUMA.

Jawed
 
So Early-Z is querying RBE-Z before the shader starts processing the pixel? Yuck, long long long latency.
It's a long long latency whichever side of the pixel shader you do it on though, so if the pixel shader is light (both in time and context) then doing it in front might make sense. Either way you are going to have to store data until the Z check is resolved.
The two kinds of shader shouldn't be able to overlap in their execution because a state change, implying pipeline flush, is required to switch between these two modes.
Since the pipeline is virtual it's not a huge deal though ... the shaders can simply start doing something else (or rendering to a different part of the screen which has nothing in the pipeline).
 
Yep, it is natural at this early stage of GDDR5.
This happened in the past also, for example 4870 had 4Gbps Qimonda ICs and ATI clocked them at 900MHz (instead of 1GHz)
I was talking about what ICs is logical for ATI to buy from companies like Samsung and Hynix at this stage.
Which only intensifies the need for architectural improvements. For what it's worth, doubling the RBEs per unit bus width is doing that.

Thanks, seems reasonable and seems to indicate that drivers haven't changed the balance between the two in a significant way.

Jawed
 
It's a long long latency whichever side of the pixel shader you do it on though, so if the pixel shader is light (both in time and context) then doing it in front might make sense. Either way you are going to have to store data until the Z check is resolved.
Except for ATI cards the documentation is quite explicit that early-Z is not recommended for short shaders. I take this to imply that the buffers post-rasterisation/pre-interpolation are too small compared with the buffers post-pixel-shading. OpenGL guy hasn't given a reason, so far.

Since the pipeline is virtual it's not a huge deal though ... the shaders can simply start doing something else (or rendering to a different part of the screen which has nothing in the pipeline).
The pipeline is only virtual in this sense on Larrabee (anything else?). It's a real, single-pixel-shader-at-a-time pipeline on current ATI GPUs, excepting the virtualisation required to share the unified shaders amongst VS/GS/PS (and others).

Jawed
 
Except for ATI cards the documentation is quite explicit that early-Z is not recommended for short shaders. I take this to imply that the buffers post-rasterisation/pre-interpolation are too small compared with the buffers post-pixel-shading. OpenGL guy hasn't given a reason, so far.
There are reasons to use late-Z in such cases, you can contact AMD developer relations for more info.
The pipeline is only virtual in this sense on Larrabee (anything else?). It's a real, single-pixel-shader-at-a-time pipeline on current ATI GPUs, excepting the virtualisation required to share the unified shaders amongst VS/GS/PS (and others).
You're certain of this? A close look at the reg specs may give more information.
 
XFX version is being bundled with Dirt 2.

http://www.xfxforce.com/en-us/Features/RadeonHD5870.aspx#1

fullscreencapture919200.jpg


fullscreencapture919200.jpg


fullscreencapture919200y.jpg
 
HDMI used to use the same transmitters as single link DVI's, but when they increased pixel counts and bit depths with HDMI 1.3 they stayed with just one transmitter but increased its clock rate to 340MHz, thus diverging from DVI specs.
Well actually dual link hdmi exists too (since hdmi 1.0) but requires different connector and I don't think a single piece of consumer electronics with such a connector exists :).
I'm wondering though does rv8xx (or actually older ones too) support those increased clock rates for hdmi 1.3? Certainly some 30" monitors don't, making hdmi input rather useless.
Also, how do these active DP -> dual link dvi converters work? Are they "true" DP devices so the graphic card output works in display port mode then they just translate to dual link dvi? Only apple really seems to sell them in any quantity. In any case, how do things like hdcp work?
 
Also, how do these active DP -> dual link dvi converters work? Are they "true" DP devices so the graphic card output works in display port mode then they just translate to dual link dvi? Only apple really seems to sell them in any quantity. In any case, how do things like hdcp work?
Yes, they have DP recievers and transmitters for the type of output they are going to. Likewise, they will need their own HDPC keys.

Apples may be the only prominent vendor at the moment, but we are working with others and we expect more to become available.
 
Some independent benchmarks from one lucky XS member:

first verification pic
DSC00205.JPG


now some scores,

HD5870 stock - i7 965 stock (3.2GHz)

3DMark06 - 16xAF forced in CCC
22383 3DMarks
SM 2.0 Score 8704
SM 3.0 Score 10655
CPU Score 6282

No AF forced in CCC
22549 3DMarks

DMC 4 1920x1080 8xAA 16xAF
scene 1 162.78
scene 2 123.86
scene 3 221.67
scene 4 118.59

resident evil 5 1920x1080 max details 8xAA
97fps
 
XFX version is being bundled with Dirt 2.

I wonder if Dirt 2 will carry a redistributable Dx11 package?

Also wouldn't it be really cool if ATI did a tesselation demo similar to the old Toy Store one that had way back when? That's still one of the best graphics demo's I've ever seen.

Regards,
SB
 
Back
Top