AMD: R8xx Speculation

3dilettante · Sep 16, 2009

Jawed said:
Simply preceding rasterisation with a tile-rasteriser would solve a whole load of these kinds of problems in the dual-rasteriser of RV870. A small amount of queuing on inputs and outputs should be enough to always keep both rasterisers running at full speed, assuming "average" primitives aren't all hitting one tile, or one rasteriser's tiles.

I suppose there could be a tile-rasterizer block that AMD chose not to show in the diagram.

Well, the rule of thumb is each doubling of cache produces 10% performance gain, and the RBEs already have colour/Z/stencil caches. Trouble is, this has absolutely no visibility outside of the IHVs. Also, if it was that easy (memory is cheap) wouldn't it already be in place?

The evolutionary pressures were different in the past, and devoting SRAM to merely holding ROP tiles would not have been the best use, as bandwidth scaling was still ramping and SRAM is not the most dense form of memory.
EDRAM at finer geometries in an era where GPUs become strangled by bandwidth might tip the scales if designers want to defer abandoning the forward renderers for a generation or so.

The PrimSets are stored off chip until their time comes to be rasterised. The lack in Larrabee is essentially the connecting of multiple chips and pooling of resources - the binning scheme seems fine, otherwise, to ride on top of such infrastructure.

Bear in mind that tiles are allocated to cores dynamically - there's no locality of bins/tiles for cores. So there'd be no particular locality in a multi-chip scheme.

There is a locality to off-chip memory with a NUMA setup.
There is likely to be a significant bandwidth and latency difference between local memory and the links between the chips.
I doubt Intel can afford a ring bus sized QPI link between two chips to afford the freedom of movement that it offers within one die.

Jawed · Sep 16, 2009

rpg.314 said:
How about this, consider a gt400 (say) instead of bulking up on registers, it goes cache heavy to store context (like lrb). But you attach, say a micro RBE per tile. The RBE talks to the on chip tile only, but is more area/power efficient than doing the rbe's work in alu's. It can be configured by the driver at shader compile time.

A key part of Larrabee's scheme is the binning. Without binning you'll be constantly swapping tiles into tile cache, which eats bandwidth, which is where GPUs are now.

Jawed

MfA · Sep 16, 2009

Jawed said:
I can't tell, the R6xx_3D_Registers document is fairly opaque. See if you can work it out:

As far as I can see they don't actually recommend turning off Hi-Z ...

The DB needs hints from the driver to when to enable certain performance optimizations such as EarlyZ, Re-Z, HiZ
and HiS. Most of these are in DB_SHADER_CONTROL, which has descriptions in the register spec. In general
Late Z is best for very short shaders, Early Z for middle length shaders and ReZ is only good for very long shaders.
The Z_ORDER field should be derived from the required shader setup.

You can have late Z checking with Hi-Z (semantically Hi-Z is early Z-checking, but technically it isn't).

Jawed · Sep 17, 2009

MfA said:
As far as I can see they don't actually recommend turning off Hi-Z ...

You can have late Z checking with Hi-Z (semantically Hi-Z is early Z-checking, but technically it isn't).

What does late-Z with hierarchical-Z mean? Update hierarchical-Z, but don't use it for culling?

I presume your technical point about hierarchical-Z is merely relating to the lack of precision, meaning that it has to be conservative.

Jawed

Jawed · Sep 17, 2009

3dilettante said:
There is a locality to off-chip memory with a NUMA setup.
There is likely to be a significant bandwidth and latency difference between local memory and the links between the chips.
I doubt Intel can afford a ring bus sized QPI link between two chips to afford the freedom of movement that it offers within one die.

Generating and reading bins shouldn't suffer at the hands of NUMA - it's lightweight isn't it?

Jawed

OpenGL guy · Sep 17, 2009

Jawed said:
What does late-Z with hierarchical-Z mean? Update hierarchical-Z, but don't use it for culling?

You can almost always do early rejection, but not always early acceptance. Thus, HiZ with late Z makes sense. Think about alpha test or tex kill.

I presume your technical point about hierarchical-Z is merely relating to the lack of precision, meaning that it has to be conservative.

Being conservative is good for many reasons.

MfA · Sep 17, 2009

Jawed said:
What does late-Z with hierarchical-Z mean? Update hierarchical-Z, but don't use it for culling?

Early/Late-Z is the normal per pixel Z-check. Hierarchical Z is the conservative per tile Z-check, you can do this and still do a late per pixel Z check. Since the hierarchical Z-check never goes to memory (ie. has negligible latency relatively speaking) it would be kind of weird to disable it, the extra storage required to buffer the tris while it's being done doesn't really seem relevant in the big scheme of things.

PS. I guess somewhere in there there has to be a scoreboard too to block early Z shading of quads still owned by a running late Z shader, this is probably done inside the hierarchical Z unit at a tile resolution.

rpg.314 · Sep 17, 2009

Jawed said:
A key part of Larrabee's scheme is the binning. Without binning you'll be constantly swapping tiles into tile cache, which eats bandwidth, which is where GPUs are now.

Jawed

Good point. I think the rasterizer (sw or hw) would be the best place to sort fragments into bins. Stream out the fragments being generated by the rasterizer into global memory and then sort them. Or may be sort them on chip. At least the fragments generated in a single draw call should have large amount of spatial locality.

CarstenS · Sep 17, 2009

Jawed said:
RV770 has 40 vec4 32-bit TUs. L1 bandwidth into the ALUs is 480GB/s, 48GB/s per cluster, 64 bytes per clock per TU.

It has 10*16 FP32-Samplers, but only 10*4 Adress-Units and Filters.
AMD gives a figure of 120GTex/sec. 32 bit Texture fetch rate - no mentioning, if this is done for FP or INT-Textures.

Additionally, I see strong indication that apart from Texture fetch, other operations seem limited to 32-Bit INT full-speed.

ferro · Sep 17, 2009

The first shop to offer the HD5870 for pre-order in The Netherlands charges a mere 319.99 Euro. This is quite promising.

Edit: the product page has been removed.

3dilettante · Sep 17, 2009

Jawed said:
Generating and reading bins shouldn't suffer at the hands of NUMA - it's lightweight isn't it?

Some slides by Tom Forsyth gave a figures of about 10-20% of the work being in the front-end, depending on what is being done.

This amount of work can remain relatively unchanged if the bin setup occurs on one chip and the other chip or chips wait for work to be farmed out to them.
Creating a producer/consumer relationship between the chips will have some additive effect, as cores on the other chips will have to wait until various bins are set up by the main core and then stored off to their local memory pool.

A bigger concern that occurred to me is that I'm not sure if the bins themselves are a worse bandwidth burden than just sending the command and vertex streams to both chips.
If the front-end is at the lower 10% range, it may be more compact as at that range the front end only does triangle placement and forestalls other vertex work and tesselation.

If at the higher range, it means other vertex work and tesselation was done in the front-end, and then with sufficient amplification the bins may pose a bandwidth bottleneck over the interconnect.
I'm pretty sure the interconnect can handle the raw command stream, as the worst this can be is the maximum the PCI-E bus can carry.
A large number of tesselation-amplified bins can potentially be a problem.
Even without tesselation, I am curious if a bin is more compact than the commands that spawned it.

The dual front-end case would also overlap more setup latency, as opposed to the transfer latency that is injected by a single setup process sending to all chips.
There is potentially less synchronization necessary here as well.
The output phase would need some amount of synchronization (not sure if this is more than would be needed in any other multi-chip case), but the rest of the process may be allowed to flow pretty independently.

Rangers · Sep 17, 2009

I added up all CJ's comparable benchmarks of 5870 vs GTX285 across 11 games and came up with 5870 being 54% faster on average. From a high of average 125% faster on Stormrise to a low of average 23% faster on Crysis Warhead.

However then I tossed the Stormrise results, due to it being DX10.1, not a usual benchmark game, mediocre reviewed with low sales, and the biggest outlier. Of the remaining 10 games, 5870 averaged 44% faster.

The Crysis Warhead results are a little surprising to me since that was one game, at least the non-Warhead Crysis, where 4890 excelled.

If anything 5870 typically increased it's lead the higher the resolution/AA/AF settings were turned, so I dont see a bandwidth limiting problem, at least versus the current competition.

Pressure · Sep 17, 2009

ferro said:
The first shop to offer the HD5870 for pre-order in The Netherlands charges a mere 319.99 Euro. This is quite promising.

Yeah, it's only $70 more than the US retail price

Although I am much more interested in the Radeon HD 5870 SIX with Mini-DisplayPort

bowman · Sep 17, 2009

Prices are way higher than US prices as expected. ATI can keep this one.

Rangers · Sep 17, 2009

Pressure said:
Yeah, it's only $70 more than the US retail price

Although I am much more interested in the Radeon HD 5870 SIX with Mini-DisplayPort

Doesn't EU usually get a 1:1 Euro-dollar conversion? Of course if so the implications of a 319 Euro 5870 are obvious.

Bouncing Zabaglione Bros. · Sep 17, 2009

bowman said:
Prices are way higher than US prices as expected. ATI can keep this one.

Is that ATI or initial retailer gouging? If it's the latter, you can expect prices to come down as stock gets out and retailers compete. I personally think that price is too high as it works out to £285.

ferro · Sep 17, 2009

Pressure said:
Yeah, it's only $70 more than the US retail price

Not really. This includes VAT (19%), and as a rule you can say that the USD price translates directly to the EUR price. To compare, the cheapest GTX285 costs EUR 262, and the cheapest GTX295 costs EUR 375. The cheapest HD4890 costs EUR 153, and the cheapest HD4870X2 costs EUR 240.

Pressure · Sep 17, 2009

Rangers said:
Doesn't EU usually get a 1:1 Euro-dollar conversion? Of course if so the implications of a 319 Euro 5870 are obvious.

Yeah, I wasn't complaining about the price. That were to be expected

But usually they charge a bit more for preorders.

ferro said:
Not really. This includes VAT (19%), and as a rule you can say that the USD price translates directly to the EUR price. To compare, the cheapest GTX285 costs EUR 262, and the cheapest GTX295 costs EUR 375. The cheapest HD4890 costs EUR 153, and the cheapest HD4870X2 costs EUR 240.

It could be much worse, I agree. We pay 25% VAT here in Denmark, so we are used to high prices.

MarkoIt · Sep 17, 2009

I have read your concerns about the next iteration of Ati's current architecture, due to the limits of bandwidth availability with GDDR5. But what if Ati replaces GDDR5 with faster types of memory like XDR 2?
http://www.rambus.com/us/products/xdr2/xdr2_vs_gddr5.html
http://www.rambus.com/us/products/tbi/index.html
Wouldn't resolve,partially, the problem?

Mat3 · Sep 17, 2009

AMD already has a license for Rambus stuff.

http://www.rambus.com/us/news/press_releases/2006/060103.html

(Though it was mostly to avoid legal issues with GDDR type memory which Rambus thinks it owns

mad

, but it does mention XDR.)

Does the fact that Rambus has patents on octal and hex data rates mean no one else could do it or it could never be an open standard like GDDR? (For example, could GDDR6 be made to transfer 8x per clock? (not technically, but legally?)

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

3dilettante

Jawed

MfA

Jawed

Jawed

OpenGL guy

MfA

rpg.314

CarstenS

Moderator

ferro

3dilettante

Rangers

Pressure

bowman

Rangers

Bouncing Zabaglione Bros.

ferro

Pressure

MarkoIt

Mat3

Similar threads