AMD: R8xx Speculation

elsence · Sep 16, 2009

Jawed said:
What do you perceive? I'm getting other feedback that beyond 5Gbps is troublesome.

From the one AMD slide that i saw,

I think that AMD is talking about the techniques they used in the 5XXX memory controller / interface, that enabled them to compensate (have better efficiency) in the temperature department as GDDR5 escalates its speed.

The 5Gbps comment, i think is what ATI could achieve with this particular 40nm 5XXX GPU's memory controller / interface design, based on GDDR5 memory ICs that ATI & its partners can buy in the right price & quantity now or next quarter (60nm & some 50nm probably) as the memory manufacturers are going to smaller processes for memory (below 50nm) they will be able to achieve in the same thermal envelope much higher clocked ICs.

I don't see a problem for the GDDR5 future, i haven't heard anywhere that Samsung or Hynix have problems to execute their roadmap plans regarding GDDR5.

Also ATI is getting more and more experienced with GDDR5 based controllers with each design they make, so i guess they will be able in their future based on smaller processes products, to use similar and improved techniques for temperature compensation in their future memory controllers / interfaces.

That's my point of view about the slide, but i may misunderstand it since i don't have technological background.

About Larrabee and the future design direction of ATI & NV GPUs, i guess this is not the right topic to expand, if i find the time tomorrow i'll post more about them if i can find similar themed topics.

Take care,

Elsense

MfA · Sep 16, 2009

3dilettante said:
The worst-case would be the updates from the RBE, which is what I was saying when I stated that it was the cap.

Are you still assuming they aren't using tiling for distributing the work between them? I just don't see why anyone would make that assumption ... it's trivially parallizeable (fixed latency, unlike normal rasterization). Just having both the rasterizers tile all triangles and only rasterize those which actually belong to it's tiles is trivial, order is maintained, efficiency of HZ is unchanged. Done. Why make it any more difficult than that?

Or are you simply saying that shaders which mess with Z mess can make HZ redundant in the worst case? That's a truism, but is completely unaffected by having 2xHZ.

rpg.314 · Sep 16, 2009

Jawed said:
Ultimately, yes, but in the meantime they (or at least AMD) are looking like they want to drag out the maximal-FF approach for as long as possible. Perhaps with improved multi-chip to give it a shot in the arm, in AMD's case. I wouldn't mind, but I suspect this misses a trick when it comes to CS/OpenCL.

Jawed

May be not.

How about this, consider a gt400 (say) instead of bulking up on registers, it goes cache heavy to store context (like lrb). But you attach, say a micro RBE per tile. The RBE talks to the on chip tile only, but is more area/power efficient than doing the rbe's work in alu's. It can be configured by the driver at shader compile time.

If they make a single level of cache, unlike 2 in lrb, and add a DMA unit to pull in the large render tiles, it becomes a hybrid of an spe, a lrb core and a gpu.

rasterizer and triangle setup are harder to unlock. May be have a dedicated micro rasterizer/tesselator per core too. Or may be just give up on this and use sw load balancing.

So there is a definite possibility that there will be place for ff hw, if it used smartly. I am sure both ATI and nv knows about this and the present hole of ff inefficiencies.

Regarding OCL, real shared memory parallelism in GPU's will be a heaven sent gift for all those who use ocl. Let's hope it comes true.

LordEC911 · Sep 16, 2009

Jawed said:
What do you perceive? I'm getting other feedback that beyond 5Gbps is troublesome.

Jawed

Regarding what?
Power and voltage?

Jawed · Sep 16, 2009

Differently sized balls

The forum software is still fucking up pasted links so:

http://v3.espacenet.com/publication...T=D&date=20090903&CC=US&NR=2009218689A1&KC=A1

The idea is to use bigger balls where mechanical stress is greatest (e.g. due to temperature changes) or where currents are highest. But use smaller balls most of the time, to maintain overall density.

So be on the look out for big balls. Pix of naked chips are urgently sought. etc.

Jawed

Jawed · Sep 16, 2009

elsence said:
From the one AMD slide that i saw,

I think that AMD is talking about the techniques they used in the 5XXX memory controller / interface, that enabled them to compensate (have better efficiency) in the temperature department as GDDR5 escalates its speed.

The 5Gbps comment, i think is what ATI could achieve with this particular 40nm 5XXX GPU's memory controller / interface design, based on GDDR5 memory ICs that ATI & its partners can buy in the right price & quantity now or next quarter (60nm & some 50nm probably) as the memory manufacturers are going to smaller processes for memory (below 50nm) they will be able to achieve in the same thermal envelope much higher clocked ICs.

I don't see a problem for the GDDR5 future, i haven't heard anywhere that Samsung or Hynix have problems to execute their roadmap plans regarding GDDR5.

Also ATI is getting more and more experienced with GDDR5 based controllers with each design they make, so i guess they will be able in their future based on smaller processes products, to use similar and improved techniques for temperature compensation in their future memory controllers / interfaces.

That's my point of view about the slide, but i may misunderstand it since i don't have technological background.

I think that's fine, but it seems to me there are implications that GDDR5's clocks are scaling slowly, because engineering the GPU end is harder - to a certain extent it seems the memory chip is "relatively passive" - I don't know if that's fair though. 6Gbps chips are supposedly available. Or maybe with the disappearance of Qimonda they aren't, now.

Additionally, AMD has built something that is "2x" HD4890 but it's strangled by significantly less bandwidth. GDDR5 bandwidth, even if it gets to 7Gbps, is only about 40% higher than HD5870 will have.

A single chip with a 256-bit bus of the current architecture can't scale dramatically with that prospect of available bandwidth. And that's the best case imaginable.

It seems to me that RV770's GDDR5 implementation was a kind of baby step, hence the problems with power consumption and glitches with varying clocks. So RV870 solves those problems and is, overall, refined.

Though it has to be said, even if RV870 supported 6Gbps GDDR5, that would still be quite a constraint on overall performance.

Jawed

3dilettante · Sep 16, 2009

MfA said:
Are you still assuming they aren't using tiling for distributing the work between them?

I'm going off on a speculative tangent.
I'm assuming only that such a scheme is possible and that it is a scheme that would require the least amount of change to the rasterizer and the least effort overall.

I just don't see why anyone would make that assumption ... it's trivially parallizeable (fixed latency, unlike normal rasterization). Just having both the rasterizers tile all triangles and only rasterize those which actually belong to it's tiles is trivial, order is maintained, efficiency of HZ is unchanged. Done. Why make it any more difficult than that?

You've added a rasterizer-aware tile check and some partial rasterization or arbitration for triangles that cross boundaries.
It's all perfectly doable.

I think my idea is way more trivial: "add another rasterizer+HZ and immediately stop caring".

Or are you simply saying that shaders which mess with Z mess can make HZ redundant in the worst case? That's a truism, but is completely unaffected by having 2xHZ.

It means that the worst-case is unchanged from the single-rasterizer case, meaning that things at that extreme don't get any worse.

I'm also going by the arrows of the simplified AMD slides, which only show arrows between the rasterizers and their local HZs, and arrows linking those HZs to the RBE caches.
I didn't speculate on other paths for updates that had no arrows drawn.

MfA · Sep 16, 2009

3dilettante said:
I think my idea is way more trivial: "add another rasterizer+HZ and immediately stop caring".

Depends on how they maintain sort order at the moment

3dilettante · Sep 16, 2009

MfA said:
Depends on how they maintain sort order at the moment

If it requires a change downstream, it could be more expensive.
I was under the impression that this was something that was already being tracked.

Jawed · Sep 16, 2009

3dilettante said:
Larrabee's bandwidth savings by keeing a screen tile on-chip until it is completed would be something most architectures will probably go towards.
I don't know if there are any good ways to do so without extending tiling throughout the GPU architecture.

That's basically tiling for setup->rasterisation->shading->back-end.

Shading and back-end are tiled, so now it appears there's a good chance that rasterisation is tiled in RV870. Who knows...

It's interesting that Larrabee can supposedly play fast and loose with the placement of rasterisation in the software pipeline. And it's also noteworthy that PrimSets can't be constructed without doing the most coarse rasterisation of triangles. So the default rasterisation in Larrabee actually consists of two distinct rasterisation phases.

Simply preceding rasterisation with a tile-rasteriser would solve a whole load of these kinds of problems in the dual-rasteriser of RV870. A small amount of queuing on inputs and outputs should be enough to always keep both rasterisers running at full speed, assuming "average" primitives aren't all hitting one tile, or one rasteriser's tiles.

The least elegant way to defer tiling would be to hope the foundry masters EDRAM and then slather a big ROP tile cache that holds the framebuffer up to some ridiculously high resolution.
A much larger cache in general would reduce bandwidth needs, as has been found in other computing realms.
It's not great, but it is a hack that is available that requires little more effort than allocating it die space.

Well, the rule of thumb is each doubling of cache produces 10% performance gain, and the RBEs already have colour/Z/stencil caches. Trouble is, this has absolutely no visibility outside of the IHVs. Also, if it was that easy (memory is cheap) wouldn't it already be in place?

There is still a H-Z block per rasterizer. If each rasterizer is allowed to update its local copy as the arrows in the diagram indicate, maybe the design assumes that with an even distribution to each rasterizer each local H-Z will start to approach a similar representation in high-overdraw situations.
This would lead to an incremental decrease in effectiveness for short-run situations, and then there is the chance of a long run of pair-wise overlapping triangles pathologically alternating between rasterizers.
The cap would be the RBE z-update latency.

And the problem is that this latency could be so long that it effectively means hierarchical-Z is "off". If overdraw is typically 5 (though some would argue it's lower - and it's obviously lower for early-Z based engines, such as Source) then 2 or 3 of that overdraw could be within the timespan of a set of batches of fragments that are concurrently in flight (shape of a character).

Another argument might be that hierarchical-Z is no longer important because developers do early-Z themselves and there's so many deferred engines being used (for which hierarchical-Z would seem to be too slow).

Who knows?...

I don't think these problems have been fully solved for any mult-chip GPU solution.
Not even Intel has shown a path, as Larrabee's binning scheme has been defined for only for a single-chip solution.

The PrimSets are stored off chip until their time comes to be rasterised. The lack in Larrabee is essentially the connecting of multiple chips and pooling of resources - the binning scheme seems fine, otherwise, to ride on top of such infrastructure.

The rasterizer portion of the scheme may need to have a local run on the full stream on each chip, with a quick reject of primitives that do not fall within a chip's screenspace.

Bear in mind that tiles are allocated to cores dynamically - there's no locality of bins/tiles for cores. So there'd be no particular locality in a multi-chip scheme.

On-chip solutions have much more leeway.

Sure, which is why I suggested that RV870's two rasterisers update each other continuously, resulting in minimal hierarchical-Z latency.

Cypress could send two triangles at the same time to be rasterized. Each rasterizer gets a copy of this pair, and an initial coarse rasterization stage can allow each rasterizer to decide if it will pick or punt each triangle.
Such a process would be much more expensive if crossing chips.
It's a matter of a duplicate data path and an additional coarse check if on-chip.

As MfA says, the initial tile check should be devastatingly cheap and fast. There would need to be extra buffering around the tiling-rasterisation, and it's obviously a bottleneck, but there's no reason it couldn't have twice or higher throughput to ensure the rasterisers aren't idling.

Jawed

Jawed · Sep 16, 2009

3dilettante said:
It means that the worst-case is unchanged from the single-rasterizer case, meaning that things at that extreme don't get any worse.

The worst-case might not be worse but there might not be a best-case

As shaders get longer, it becomes more important (the recommendation is to not bother with hierarchical-Z for short shaders).

Jawed

CarstenS · Sep 16, 2009

LordEC911 said:
Nothing about type, just AA patterns.

He could possibly without violating any sort of NDA he might be under.

Jawed said:
This is part of the confusion with the texel rates that Vantage reports, I believe, something that's "never been fixed" because the numbers are right, I think. ATI can fetch way more fp16 or int8 texels than NVidia per clock per unit, because the hardware is designed to be full speed fetching fp32 texels.

Are you talking solely about fetching or filtering also? At any rate - all the results from various programs i can remember right now seem to indicate, that at least the assumption with FP32 accelerating FP16 or INT8 fetch rate is incorrect.

Kaotik said:
http://home.akku.tv/~akku38901/hemlock.jpg
(hosting it myself, original at tweakers.net forum)

Is it hemlock? It's not HD5870 for sure, nor the SIX model, only other option would be HD5850 but the length seems to match the old leak Hemlock-pic

The upper one is supposed to be hemlock - not HD 5870 like in here:
http://www.pcgameshardware.de/aid,6...t/Grafikkarte/News/bildergalerie/?iid=1192326

Jawed · Sep 16, 2009

LordEC911 said:
Regarding what?
Power and voltage?

Simply making it work. I don't know any details I'm afraid.

Jawed

MfA · Sep 16, 2009

Jawed said:
As shaders get longer, it becomes more important (the recommendation is to not bother with hierarchical-Z for short shaders).

You can turn off HZ culling per draw call? Does it still update even if you turn the culling off?

Jawed · Sep 16, 2009

CarstenS said:
Are you talking solely about fetching or filtering also?

Fetching. In that Vantage test, RV770 is fetching 26 fp16 texels per clock. It only has 16 samplers, but they are 32-bit samplers that can do 2x rate for fp16 or 4x rate for int8 (I presume, haven't seen that explicitly stated anywhere, at least that I can remember right now). I don't know why it only achieves 26 texels per clock and not 32...

At any rate - all the results from various programs i can remember right now seem to indicate, that at least the assumption with FP32 accelerating FP16 or INT8 fetch rate is incorrect.

That directly contradicts AMD's own materials though (even if those materials are actually quoting Vantage).

Jawed

CarstenS · Sep 16, 2009

As far as i know (and that's only remembering right now - not extensive research!), R600 and RV670 had 16 full speed FP16-TUs plus an additional 16 FP32-Samplers, which could only assist in unfiltered fetching, but as soon as you're going to filter the results, you need to go the slow and painfull way. RV770 had 40 INT8-TUs (and could interpolate for 32 of them at a time).

LordEC911 · Sep 16, 2009

Jawed said:
Simply making it work. I don't know any details I'm afraid.

Jawed

Interesting... I haven't heard anything of the sort.
Wish we had some details because that seems very strange.
Guess that is why we aren't seeing any of those 6-7ghz IC's on the new cards.

Thanks for the info!

Jawed · Sep 16, 2009

MfA said:
You can turn off HZ culling per draw call? Does it still update even if you turn the culling off?

I can't tell, the R6xx_3D_Registers document is fairly opaque. See if you can work it out:

http://www.x.org/docs/AMD/

I presume Z is always kept up to date and there's a resummarisation bit-blit operation (DB

B_RENDER_OVERRIDE:FORCE_Z_READ) that I guess queries Z to update hierarchical-Z, which I guess would be needed to re-align hierarchical-Z after it's been off for a while. Erm, dunno...

I did find this intriguing pair of registers though:

PREFETCH_WIDTH 11:6 none The Prefetch window width. Prefetcher tries to keep this window around the last rasterized htile in cache at all times.

PREFETCH_HEIGHT 17:12 none The Prefetch window height. Prefetcher tries to keep this window around the last rasterized htile in cache at all times.

But I think that's merely for caching hierarchical-Z tiles in the hierarchical-Z buffer, i.e. when the render target is so large that the on-chip hierarchical-Z buffer isn't big enough.

Jawed

Alistair · Sep 16, 2009

Alexko said:
Wow, 75°C at full load and 30% fan speed, that's quite an improvement from RV770/790.

Sorry if this is the wrong thread, but I'm interested in the effect of the rumoured low power draw of these cards on thermals. If a particular period of say gameplay can be rendered at 150 fps, is there any thermal advantage to a chip downclocking in someway to restrict itself to 60fps? Can I expect a chip to do this, and save the thermal headrooom for scenes that require more rendering oomph?

Jawed · Sep 16, 2009

CarstenS said:
As far as i know (and that's only remembering right now - not extensive research!), R600 and RV670 had 16 full speed FP16-TUs plus an additional 16 FP32-Samplers, which could only assist in unfiltered fetching, but as soon as you're going to filter the results, you need to go the slow and painfull way.

I don't know the detailed form of the Vantage test, i.e. is it filtering?...

RV770 had 40 INT8-TUs (and could interpolate for 32 of them at a time).

RV770 has 40 vec4 32-bit TUs. L1 bandwidth into the ALUs is 480GB/s, 48GB/s per cluster, 64 bytes per clock per TU.

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

elsence

MfA

rpg.314

LordEC911

Jawed

Jawed

3dilettante

MfA

3dilettante

Jawed

Jawed

CarstenS

Moderator

Jawed

MfA

Jawed

CarstenS

Moderator

LordEC911

Jawed

Alistair

Jawed

Similar threads