AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
Simply preceding rasterisation with a tile-rasteriser would solve a whole load of these kinds of problems in the dual-rasteriser of RV870. A small amount of queuing on inputs and outputs should be enough to always keep both rasterisers running at full speed, assuming "average" primitives aren't all hitting one tile, or one rasteriser's tiles.

I suppose there could be a tile-rasterizer block that AMD chose not to show in the diagram.

Well, the rule of thumb is each doubling of cache produces 10% performance gain, and the RBEs already have colour/Z/stencil caches. Trouble is, this has absolutely no visibility outside of the IHVs. Also, if it was that easy (memory is cheap) wouldn't it already be in place?
The evolutionary pressures were different in the past, and devoting SRAM to merely holding ROP tiles would not have been the best use, as bandwidth scaling was still ramping and SRAM is not the most dense form of memory.
EDRAM at finer geometries in an era where GPUs become strangled by bandwidth might tip the scales if designers want to defer abandoning the forward renderers for a generation or so.


The PrimSets are stored off chip until their time comes to be rasterised. The lack in Larrabee is essentially the connecting of multiple chips and pooling of resources - the binning scheme seems fine, otherwise, to ride on top of such infrastructure.

Bear in mind that tiles are allocated to cores dynamically - there's no locality of bins/tiles for cores. So there'd be no particular locality in a multi-chip scheme.
There is a locality to off-chip memory with a NUMA setup.
There is likely to be a significant bandwidth and latency difference between local memory and the links between the chips.
I doubt Intel can afford a ring bus sized QPI link between two chips to afford the freedom of movement that it offers within one die.
 
How about this, consider a gt400 (say) instead of bulking up on registers, it goes cache heavy to store context (like lrb). But you attach, say a micro RBE per tile. The RBE talks to the on chip tile only, but is more area/power efficient than doing the rbe's work in alu's. It can be configured by the driver at shader compile time.
A key part of Larrabee's scheme is the binning. Without binning you'll be constantly swapping tiles into tile cache, which eats bandwidth, which is where GPUs are now.

Jawed
 
I can't tell, the R6xx_3D_Registers document is fairly opaque. See if you can work it out:
As far as I can see they don't actually recommend turning off Hi-Z ...

The DB needs hints from the driver to when to enable certain performance optimizations such as EarlyZ, Re-Z, HiZ
and HiS. Most of these are in DB_SHADER_CONTROL, which has descriptions in the register spec. In general
Late Z is best for very short shaders, Early Z for middle length shaders and ReZ is only good for very long shaders.
The Z_ORDER field should be derived from the required shader setup.

You can have late Z checking with Hi-Z (semantically Hi-Z is early Z-checking, but technically it isn't).
 
As far as I can see they don't actually recommend turning off Hi-Z ...

You can have late Z checking with Hi-Z (semantically Hi-Z is early Z-checking, but technically it isn't).
What does late-Z with hierarchical-Z mean? Update hierarchical-Z, but don't use it for culling?

I presume your technical point about hierarchical-Z is merely relating to the lack of precision, meaning that it has to be conservative.

Jawed
 
There is a locality to off-chip memory with a NUMA setup.
There is likely to be a significant bandwidth and latency difference between local memory and the links between the chips.
I doubt Intel can afford a ring bus sized QPI link between two chips to afford the freedom of movement that it offers within one die.
Generating and reading bins shouldn't suffer at the hands of NUMA - it's lightweight isn't it?

Jawed
 
What does late-Z with hierarchical-Z mean? Update hierarchical-Z, but don't use it for culling?
You can almost always do early rejection, but not always early acceptance. Thus, HiZ with late Z makes sense. Think about alpha test or tex kill.
I presume your technical point about hierarchical-Z is merely relating to the lack of precision, meaning that it has to be conservative.
Being conservative is good for many reasons.
 
What does late-Z with hierarchical-Z mean? Update hierarchical-Z, but don't use it for culling?
Early/Late-Z is the normal per pixel Z-check. Hierarchical Z is the conservative per tile Z-check, you can do this and still do a late per pixel Z check. Since the hierarchical Z-check never goes to memory (ie. has negligible latency relatively speaking) it would be kind of weird to disable it, the extra storage required to buffer the tris while it's being done doesn't really seem relevant in the big scheme of things.

PS. I guess somewhere in there there has to be a scoreboard too to block early Z shading of quads still owned by a running late Z shader, this is probably done inside the hierarchical Z unit at a tile resolution.
 
Last edited by a moderator:
A key part of Larrabee's scheme is the binning. Without binning you'll be constantly swapping tiles into tile cache, which eats bandwidth, which is where GPUs are now.

Jawed

Good point. I think the rasterizer (sw or hw) would be the best place to sort fragments into bins. Stream out the fragments being generated by the rasterizer into global memory and then sort them. Or may be sort them on chip. At least the fragments generated in a single draw call should have large amount of spatial locality.
 
RV770 has 40 vec4 32-bit TUs. L1 bandwidth into the ALUs is 480GB/s, 48GB/s per cluster, 64 bytes per clock per TU.
It has 10*16 FP32-Samplers, but only 10*4 Adress-Units and Filters.
AMD gives a figure of 120GTex/sec. 32 bit Texture fetch rate - no mentioning, if this is done for FP or INT-Textures.

Additionally, I see strong indication that apart from Texture fetch, other operations seem limited to 32-Bit INT full-speed.
 
The first shop to offer the HD5870 for pre-order in The Netherlands charges a mere 319.99 Euro. This is quite promising.

Edit: the product page has been removed.
 
Last edited by a moderator:
Generating and reading bins shouldn't suffer at the hands of NUMA - it's lightweight isn't it?

Some slides by Tom Forsyth gave a figures of about 10-20% of the work being in the front-end, depending on what is being done.

This amount of work can remain relatively unchanged if the bin setup occurs on one chip and the other chip or chips wait for work to be farmed out to them.
Creating a producer/consumer relationship between the chips will have some additive effect, as cores on the other chips will have to wait until various bins are set up by the main core and then stored off to their local memory pool.

A bigger concern that occurred to me is that I'm not sure if the bins themselves are a worse bandwidth burden than just sending the command and vertex streams to both chips.
If the front-end is at the lower 10% range, it may be more compact as at that range the front end only does triangle placement and forestalls other vertex work and tesselation.

If at the higher range, it means other vertex work and tesselation was done in the front-end, and then with sufficient amplification the bins may pose a bandwidth bottleneck over the interconnect.
I'm pretty sure the interconnect can handle the raw command stream, as the worst this can be is the maximum the PCI-E bus can carry.
A large number of tesselation-amplified bins can potentially be a problem.
Even without tesselation, I am curious if a bin is more compact than the commands that spawned it.

The dual front-end case would also overlap more setup latency, as opposed to the transfer latency that is injected by a single setup process sending to all chips.
There is potentially less synchronization necessary here as well.
The output phase would need some amount of synchronization (not sure if this is more than would be needed in any other multi-chip case), but the rest of the process may be allowed to flow pretty independently.
 
I added up all CJ's comparable benchmarks of 5870 vs GTX285 across 11 games and came up with 5870 being 54% faster on average. From a high of average 125% faster on Stormrise to a low of average 23% faster on Crysis Warhead.

However then I tossed the Stormrise results, due to it being DX10.1, not a usual benchmark game, mediocre reviewed with low sales, and the biggest outlier. Of the remaining 10 games, 5870 averaged 44% faster.

The Crysis Warhead results are a little surprising to me since that was one game, at least the non-Warhead Crysis, where 4890 excelled.

If anything 5870 typically increased it's lead the higher the resolution/AA/AF settings were turned, so I dont see a bandwidth limiting problem, at least versus the current competition.
 
Yeah, it's only $70 more than the US retail price :p

Although I am much more interested in the Radeon HD 5870 SIX with Mini-DisplayPort :)

Doesn't EU usually get a 1:1 Euro-dollar conversion? Of course if so the implications of a 319 Euro 5870 are obvious.
 
Yeah, it's only $70 more than the US retail price :p

Not really. This includes VAT (19%), and as a rule you can say that the USD price translates directly to the EUR price. To compare, the cheapest GTX285 costs EUR 262, and the cheapest GTX295 costs EUR 375. The cheapest HD4890 costs EUR 153, and the cheapest HD4870X2 costs EUR 240.
 
Last edited by a moderator:
Doesn't EU usually get a 1:1 Euro-dollar conversion? Of course if so the implications of a 319 Euro 5870 are obvious.

Yeah, I wasn't complaining about the price. That were to be expected ;)

But usually they charge a bit more for preorders.

Not really. This includes VAT (19%), and as a rule you can say that the USD price translates directly to the EUR price. To compare, the cheapest GTX285 costs EUR 262, and the cheapest GTX295 costs EUR 375. The cheapest HD4890 costs EUR 153, and the cheapest HD4870X2 costs EUR 240.

It could be much worse, I agree. We pay 25% VAT here in Denmark, so we are used to high prices.
 
AMD already has a license for Rambus stuff.

http://www.rambus.com/us/news/press_releases/2006/060103.html

(Though it was mostly to avoid legal issues with GDDR type memory which Rambus thinks it owns :)mad:), but it does mention XDR.)

Does the fact that Rambus has patents on octal and hex data rates mean no one else could do it or it could never be an open standard like GDDR? (For example, could GDDR6 be made to transfer 8x per clock? (not technically, but legally?)
 
Last edited by a moderator:
Back
Top