AMD: R8xx Speculation

Pantagruel's Friend · May 3, 2009

CarstenS said:
May I ask whether or not you've tested also without any AA/AF? Because that would greatly benefit HD3870s architectural traits (despite it's nominal greater bandwidth) as you know.

And of course, the choice of tests, resolutions etc.pp. also influences the results.

As for AF, I always turn it up to 16x if the game has such an option. If not, I'm not forcing it from the CCC / NCP.
For AA, I do my tests with 0xAA and also 4xAA. And 8xAA with stronger cards. And yes, my tests also show that the 4670 is closer to the 3870 with 4xAA than without.
Resolutions, it's 1280x1024, 1680x1050 and 1920x1200.

About my test cases (also @Jawed), I do use built in benchmarks whenever it's feasible (in 6 of the current 12 games - though three of these are the CoJ benchmark, the Tropics benchmark and Clear Sky's sun rays benchmark). When it's not, then I look for a section in the game where it's most taxing on the VGA, define a test path and run the tests there.

Here are the results - this is not a test actually, only the two result sets pitted against each other. Click on the bars for % values.

Sound_Card · May 4, 2009

JasonCross said:
Like, how deep/big the SIMD units are (is it 4 TMUs per SIMD unit, or 2 TMUs per SIMD and twice as many of them that are half as deep?). How big the local memory is. How big other caches are, what the scheduler is like, if the ROPs have any other improvements to them, etc. I wouldn't expect the RV870 to simply be a "bigger RV770 that does the minimum required to be DX11 compliant." I'd be surprised if there weren't other important tweaks, even if the general architecture isn't radically new.

You know it's funny you bring that up. I was talking to someone about this before.
I was thinking three times the SIMD cores, but half as wide with half the TMU's. This gives you 240 ALU's (1200), and 60 TU'S. Assuming they have the same die size target as before, I believe it fits quite nicely.

TimothyFarrar · May 4, 2009

Jawed said:
If NVidia was ultra-radical and deleted the blending/Z-testing part of the ROPs in favour of using the ALUs, would that save enough area? Can some of the texturing calculations be done in the ALUs, too, e.g. LOD/bias and filtering?

So is it possible for NVidia to be radical enough to leave the ROPs to handle the messy parts of interacting with memory (coalescing/caching, (de)compression) and the TMUs the messy part of memory (caching/decompression)?

As for ROPs,

For any ROP to be efficient, it is going to have to limit transactions to main memory. Which means some form of coalescing. One nice thing about ROP blending, is that the results are not needed back in the ALU units. In contrast to this is CUDA's atomic*() operations (or PTX's atom.* instructions) which return the result of an atomic operation. Note, PTX also has red.* instructions (strangely labeled as "reduction") which do atomic operations without returning any value. In theory these are closer to what an ROP unit might do, except that red.global.* operations can write to areas of memory readable by other cores (I'm making the assumption here that rasterization work distribution to core is fixed, so ROP output region to core mapping is fixed as well).

I don't know how G80 to GT200 ROPs work, but I'd guess that blend inputs themselves were coalesced in an in-order queues, and applied to large enough granularity regions as to optimize the global memory traffic.

Now if NVidia wanted to move ROP operations into ALUs I'd guess that it would involve fragment shader outputs going to a queue based on output region, and a second kernel fetching from the queue, and servicing the ROP operations. Raster step insures that warp sized groupings of outputs from pixel shaders (read from the queue) can do bank conflict free scatter/gather into a local CUDA shared memory tile of the frame buffer.

CarstenS · May 4, 2009

A side question from one not so apt at programming and chipdesign: I take it, that running "other" operations in the shaders, such as Rasterizer, ROP or TMU functionality would be concurrent with normal shading, correct? If so, then those additional operations would be issued just as threads/wavefronts/whatevers like normal shader programms, right? And this in turn would mean that every additional function carried out in the shaders increases register pressure...?

I mean, surely nobody would want to make this rather drastic step forward and then devote one or more programmable SIMDs solely to emulating ff-hardware, wouldn't one?

Jawed · May 4, 2009

TimothyFarrar said:
For any ROP to be efficient, it is going to have to limit transactions to main memory. Which means some form of coalescing.

With screen-space tiling and quad-oriented pixel-shading, on-die buffers can coherently cache render targets very effectively, making it relatively trivial for the ROPs/MCs to use memory efficiently.

One nice thing about ROP blending, is that the results are not needed back in the ALU units. In contrast to this is CUDA's atomic*() operations (or PTX's atom.* instructions) which return the result of an atomic operation. Note, PTX also has red.* instructions (strangely labeled as "reduction")

Why "strangely"? In stream terminology (or in SQL

) they're reductions.

Now if NVidia wanted to move ROP operations into ALUs I'd guess that it would involve fragment shader outputs going to a queue based on output region, and a second kernel fetching from the queue, and servicing the ROP operations. Raster step insures that warp sized groupings of outputs from pixel shaders (read from the queue) can do bank conflict free scatter/gather into a local CUDA shared memory tile of the frame buffer.

Yep it makes sense that one cluster or one multiprocessor owns an area of screen space, just like what's been described for Larrabee so that L2 can hold that tile entirely in memory for the duration of all rendered pixels.

Why not just add "output merger" instructions onto the tail end of the pixel shader? This saves on workload associated with destroying and then creating states and allows the memory latencies to be hidden by more instructions.

Jawed

nAo · May 4, 2009

TimothyFarrar said:
One nice thing about ROP blending, is that the results are not needed back in the ALU units.

That's not a nice thing, It's a huge limitation

Jawed · May 4, 2009

CarstenS said:
A side question from one not so apt at programming and chipdesign: I take it, that running "other" operations in the shaders, such as Rasterizer, ROP or TMU functionality would be concurrent with normal shading, correct? If so, then those additional operations would be issued just as threads/wavefronts/whatevers like normal shader programms, right? And this in turn would mean that every additional function carried out in the shaders increases register pressure...?

Yes.

The payback is that the fixed function units, which often sit there doing nothing are deleted. So you use that space for more programmable ALUs and register space.

The key question is what's the break-even point for GPUs that have a lot of other fixed-function units. The expectation is that Larrabee's going to be unable to compete because it's coming before the break-even point. Larrabee is spending so little on graphics-dedicated fixed function hardware, though, that it may not be such an uneven match.

I mean, surely nobody would want to make this rather drastic step forward and then devote one or more programmable SIMDs solely to emulating ff-hardware, wouldn't one?

You'd never be devoting anything. It varies with workload.

If the frame-rate is 60fps and you're drawing 4 million pixels per frame (240MPixel/s) why do you make the GPU able to do 20GP/s+?

Obviously MSAA and pure Z-fill scenarios demand way more fillrate than that, so there's your answer (and things like G-buffer creation in deferred shading engines can happily soak the ROPs in work). And with the right design for the ROPs the colour/Z/MSAA functionality all overlaps to the extent that the colour rate, per se, is pretty much "free".

Jawed

3dilettante · May 4, 2009

Jawed said:
Yes.

The payback is that the fixed function units, which often sit there doing nothing are deleted. So you use that space for more programmable ALUs and register space.

The fallacy here is that having some amount of silicon doing nothing is the worst thing that could happen.

I'm all for utilization, but there's a difference between doing useful work and just making busy work.

All the extra ALUs in the world that can be gained by dropping ROP sections doesn't mean a thing if the chip is power limited.
The TDP for Tesla chips that have raster-specific portions idling does decrease a fair amount.
What has not been determined is what the power draw would be if those sections were replaced by actively switching vector units, particularly if a larger amount of ALU silicon winds up running loops to emulate the ROPs.

Have we determined that the programmable pipeline results in a power savings at a given level of performance? This is more important so long as chips are bumping up at the limits of acceptable power draw.

I'd rather have silicon doing nothing (that is stupid), than having it have it contribute to hitting TDP limits for a net loss in performance.

Jawed · May 4, 2009

The real fallacy is stamping out yet more and more of the dumb units to run fixed-function graphics pipelines when developers want fully flexible pipes.

It's why GPUs now have programmable shader pipes, not the fixed-function straight-jackets of yesteryear. The tragedy seems to be that D3D and OpenGL are still being weaned off the bizarre oddities of that fixed-function mindset.

So D3D11 introduces read/write render targets/UAVs into pixel shading...

Jawed

3dilettante · May 4, 2009

Jawed said:
The real fallacy is stamping out yet more and more of the dumb units to run fixed-function graphics pipelines when developers want fully flexible pipes.

They say can what they want.
They'd also want threads that never deadlock, can share infinite amounts of data with no latency, and produce deterministic results regardless of how many random pointers they chase in their code.

It doesn't change the market dynamics that require evolutionary change, nor does it alter the fact that so much of the consumer parallel programming pool can barely handle two cores.

It's why GPUs now have programmable shader pipes, not the fixed-function straight-jackets of yesteryear. The tragedy seems to be that D3D and OpenGL are still being weaned off the bizarre oddities of that fixed-function mindset.

So D3D11 introduces read/write render targets/UAVs into pixel shading...

I'm loathe to knock what John Carmack has lauded as the most successful attempts managing multithreaded execution in recent decades. Not because it is "teh Carmack" but because the point is a good one.

The abstractions they supply are useful. The success of other attempts at similar programming on a commercial scale is far more limited.
The tool chains they have evolved are established commercial and training interests.
Larrabee I is going to be primarily an OGL and D3D chip for its commercial lifetime.

Intel is smart to get some input into the graphics API process now, before Nvidia and AMD start pushing for extensions to the APIs specifically designed to undermine coherent SMP multicore chips.

nAo · May 4, 2009

ROPs are really a bad example if compared to what, for instance, LRB does, giving that keeping the data close to the ALUs instead of sending it out to an external device will likely save a ton of bandwidth and power.

3dilettante · May 4, 2009

If the numbers back it up, then that would be the case.
The crossover point has not yet been found outside of theory land.

TimothyFarrar · May 4, 2009

Jawed said:
Yep it makes sense that one cluster or one multiprocessor owns an area of screen space, just like what's been described for Larrabee so that L2 can hold that tile entirely in memory for the duration of all rendered pixels.

Except that quads are gathered/grouped (at least on non-LRB GPUs) into vectors for efficient pixel shading (of small triangles), and thus have to be scattered during ROP. So something like Larrabee's L1/L2 arrangement looses efficiency in that if all the gathered quads arn't on the same cache line then you have to eat that gather (blend) scatter cost.

Why not just add "output merger" instructions onto the tail end of the pixel shader? This saves on workload associated with destroying and then creating states and allows the memory latencies to be hidden by more instructions.

This is actually what I was hoping for myself. This is exactly what the CUDA PTX .surf memory space (and not documented dedicated R/W surface instructions) was for. But after it wasn't in GT200 hardware, I started to wonder if it ever was going to get implemented.

Back in July 2007 there was this thread on the CUDA forum which seemed to verify that it will indeed be in future hardware,

prkipfer : "10. The .surf state space is not supported currently. Am I correct to assume that this one will refer to framebuffer memory in the future? Will this include multisample buffers?"

Simon Green (NVidia) : "Yes. I'm not sure about support for multisample buffers."

Also good info from that post, didn't know G80 had 16-bit instructions/registers which could also could be quite useful when doing blending in the ALUs,

Simon Green (NVidia) : "Section 7.6 refers to the 16-bit integer types { .b16, .u16, .s16, } for instructions {add, sub, mul, mad, div, rem, sad, min, max, set, setp, shr, shl, mov, ld, st, tex, cvt, ... }

The 16-bit PTX instructions generally read and write 16-bit PTX registers. PTX 16-bit registers use half the space of 32-bit registers.

Current 8-series GPUs support 16-bit registers, but future GPUs may implement them as 32-bit registers. Section 7.6 is trying to say that the semantics of the 16-bit PTX instructions are specified such that a GPU may promote 16-bit registers and instructions to 32-bit, which allows some results like shift right to be machine-specific rather than strictly as pure 16-bit width. The 8-series GPUs execute 16-bit instructions with the same performance as 32-bit instructions, so the main value is reducing register space."

Jawed · May 5, 2009

3dilettante said:
They say can what they want.
They'd also want threads that never deadlock, can share infinite amounts of data with no latency, and produce deterministic results regardless of how many random pointers they chase in their code.

I'm loathe to knock what John Carmack has lauded as the most successful attempts managing multithreaded execution in recent decades. Not because it is "teh Carmack" but because the point is a good one.

The abstractions they supply are useful. The success of other attempts at similar programming on a commercial scale is far more limited.
The tool chains they have evolved are established commercial and training interests.
Larrabee I is going to be primarily an OGL and D3D chip for its commercial lifetime.

I agree with all that - no sense throwing the baby out with the bathwater.

But, for example, imagine CUDA without shared memory...

It doesn't change the market dynamics that require evolutionary change, nor does it alter the fact that so much of the consumer parallel programming pool can barely handle two cores.

Nothing about D3D10 mandated unified shaders...

Intel is smart to get some input into the graphics API process now, before Nvidia and AMD start pushing for extensions to the APIs specifically designed to undermine coherent SMP multicore chips.

What extensions might they be?

Jawed

Jawed · May 5, 2009

TimothyFarrar said:
Except that quads are gathered/grouped (at least on non-LRB GPUs) into vectors for efficient pixel shading (of small triangles), and thus have to be scattered during ROP. So something like Larrabee's L1/L2 arrangement looses efficiency in that if all the gathered quads arn't on the same cache line then you have to eat that gather (blend) scatter cost.

Multiples of 10s of cycles compared with always having to hide ~500 cycles? It's not even close. And Larrabee has 4 hardware threads to help amortise L2 latencies. Just as those threads amortise branch mis-prediction latency. And it's not as if Intel is forced to stick with 4 hardware threads with later versions.

This is actually what I was hoping for myself. This is exactly what the CUDA PTX .surf memory space (and not documented dedicated R/W surface instructions) was for. But after it wasn't in GT200 hardware, I started to wonder if it ever was going to get implemented.

It was just a matter of time. UAVs and render targets in D3D11.

Back in July 2007 there was this thread on the CUDA forum which seemed to verify that it will indeed be in future hardware,

prkipfer : "10. The .surf state space is not supported currently. Am I correct to assume that this one will refer to framebuffer memory in the future? Will this include multisample buffers?"

Simon Green (NVidia) : "Yes. I'm not sure about support for multisample buffers."

Also good info from that post, didn't know G80 had 16-bit instructions/registers which could also could be quite useful when doing blending in the ALUs,

Simon Green (NVidia) : "Section 7.6 refers to the 16-bit integer types { .b16, .u16, .s16, } for instructions {add, sub, mul, mad, div, rem, sad, min, max, set, setp, shr, shl, mov, ld, st, tex, cvt, ... }

The 16-bit PTX instructions generally read and write 16-bit PTX registers. PTX 16-bit registers use half the space of 32-bit registers.

Current 8-series GPUs support 16-bit registers, but future GPUs may implement them as 32-bit registers. Section 7.6 is trying to say that the semantics of the 16-bit PTX instructions are specified such that a GPU may promote 16-bit registers and instructions to 32-bit, which allows some results like shift right to be machine-specific rather than strictly as pure 16-bit width. The 8-series GPUs execute 16-bit instructions with the same performance as 32-bit instructions, so the main value is reducing register space."

Aha, I've either forgotten about 16-bit registers (for space reasons) or didn't know. Well I think that explains why warps started off as 16-wide ("half-warps" are G80's unit of execution), as operand collector bandwidth and convoy organisation of operand collection/instruction-despatch seem like they were based on 16-bit banks (instead of 32-bit, now) fetched from per clock.

Jawed

CarstenS · May 5, 2009

Jawed said:
You'd never be devoting anything. It varies with workload.

That was not the way I meant it.

I was trying to imply, that one could (theoretically) limit this workload to not occupy more than a given number of SIMD so as to not load up the register space needed for other per pixel work.

Somehow I have a feeling that this is being done already for vertex work, hence some of the strange results in 3Dmarks feature test or AMDs own DX10-shadertest when comparing older and newer drivers.

Jawed said:
The real fallacy is stamping out yet more and more of the dumb units to run fixed-function graphics pipelines when developers want fully flexible pipes.

Sorry to interject this statement, but most of the time over the last two years I keep seeing console ports instead of genuine new technology for the pc space in terms of game development. Until the consoles have that all new, all flexible and shiny pipelines, I think what you're referring to is the wet dreams of developers before having spoken to financial departments and publishers.

Jawed · May 5, 2009

CarstenS said:
That was not the way I meant it. I was trying to imply, that one could (theoretically) limit this workload to not occupy more than a given number of SIMD so as to not load up the register space needed for other per pixel work.

Somehow I have a feeling that this is being done already for vertex work, hence some of the strange results in 3Dmarks feature test or AMDs own DX10-shadertest when comparing older and newer drivers.

I guess I've forgotten what the anomalies with 3DMark are.

Is it worth asking the AMD driver team what's happening?

Sorry to interject this statement, but most of the time over the last two years I keep seeing console ports instead of genuine new technology for the pc space in terms of game development. Until the consoles have that all new, all flexible and shiny pipelines, I think what you're referring to is the wet dreams of developers before having spoken to financial departments and publishers.

Often the PC ports contain extra graphics features.

Yeah it seems games like Stormrise and Cryostatis are victims of wet dreams of one sort or another.

Jawed

CarstenS · May 5, 2009

We've discussed this previously here:
http://forum.beyond3d.com/showthread.php?p=1260970&highlight=2900#post1260970
The high results of HD2900 XT @techreport.com are reproducable at my rig, the enormous drop-off (to ~1/5th) with later drivers also.

Additionally, lopri showed interesting results, with HD 4890 having a much higher Vertex throughput but being less able with more complex vertex shaders:
http://forum.beyond3d.com/showthread.php?t=54065

That's what I meant with strange behaviour.

Jawed · May 5, 2009

Strange, but seemingly deliberate.

Jawed

3dilettante · May 5, 2009

Jawed said:
Nothing about D3D10 mandated unified shaders...

That decision was helpful in increasing the orthogonality of D3D.
I suppose Microsoft could have allowed emulation of a unified pipeline on top of physically segregated units. I'm not clear if the specification required physical unification.

What extensions might they be?

None that I know of right now, but my view that the established players have an interest in not making it easy to emulate everything they do in software. If they can set up certain parts of functionality that serve the majority of the market (their cards) but just happen to conveniently perform at a mediocre level on a new entrant, they'd be remiss if they didn't try.

The first avenue I see that most directly can hurt Larrabee's flexibility is adding new incompatible texture formats and then evangelizing the heck out of them.
Sure Larrabee could emulate support on the main cores, but not without a performance cost.
Intel might even go along with this. Forced obsolescence will sell more Larrabees than if they can be upgraded in software forever.
This one isn't necessarily a problem with Larrabee's cache structure unless the format conveniently is a bad fit for 64 byte cache lines.

The future direction I think the GPU makers can take is them getting cute with writes to shared memory or shared registers, with multiple small scatters that cause different threads to modify elements within the same cache line.

Larrabee could handle this case correctly, but the penalties from false sharing would rise, particularly if the current shared memory or global register pools it has to emulate remain relatively small.

Granted, current GPUs like RV790 don't like writing to shared memory in a non-aligned and dynamically determined manner. However, it's not like AMD has tried mightily sell the world on the current LDS setup it has currently...

Wacky writing to weird formats is something ROPs could be used to accellerate.

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Pantagruel's Friend

Sound_Card

TimothyFarrar

CarstenS

Moderator

Jawed

nAo

Nutella Nutellae

Jawed

3dilettante

Jawed

3dilettante

nAo

Nutella Nutellae

3dilettante

TimothyFarrar

Jawed

Jawed

CarstenS

Moderator

Jawed

CarstenS

Moderator

Jawed

3dilettante

Similar threads