AMD: R8xx Speculation

nicolasb · Jul 23, 2008

Freak'n Big Panda said:
I'm also curious to know how well games scaled from the 1 chip Voodoo 4 to the 2 chip voodoo 5. It should be somewhere in the neighborhood of 20-30% if SFR/Tiling scaling on modern hardware is anything to go by.

I suspect it was a hell of a lot better than that, but things were simpler in those days....

NocturnDragon · Jul 23, 2008

Freak'n Big Panda said:
Do you know if the Voodoo 5 suffered from the geometry scaling problem that SLI/Xfire setups suffer from when using SFR or a tiling approach? I would assume it would...

Geometry processing was still done on the CPU, so every GPU didn't have to do it...

nicolasb · Jul 23, 2008

NocturnDragon said:
Geometry processing was still done on the CPU, so the every GPU didn't have to do it...

(nod)

The GeForce 256 and GeForce 2 were doing geometry acceleration by then, but Voodoo 5 didn't - one of the reasons it didn't sell very well....

Freak'n Big Panda · Jul 23, 2008

Dragon said:
Geometry processing was still done on the CPU, so the every GPU didn't have to do it...

Oh that explains a lot, thanks!

I checked some benchmarks here and here and scaling is around 40-50%. And the cards are horribly CPU limited compared to GF2 hardware which makes sense given the CPU based geometry processing.

nicolasb · Jul 23, 2008

Voodoo 5 came out very late. If it had come out when it was originally intended to it would have been competing against GeForce 256 (which failed to hit its target clock speeds and had only nominal geometry acceleration) and would have done very well. But it had no chance against GF2.

Getting off-topic, sorry!

Kaotik · Jul 23, 2008

nicolasb said:
The ill-fated Glaze 3D was multi-chip as well.

It was? The 2 chips that got to silicon, Pyramid3D and Axe, were at least both single chips

Mat3 · Jul 23, 2008

With two chips working on different frames, if you wanted them to share memory, what besides textures would need to be shared? Assuming that's all you wanted to send from one to the other, how much bandwidth total (bi-directional) would be needed for that?

Pressure · Jul 23, 2008

fellix said:
The primary benefit of that "looped" bus was namely relaxing the trace wiring between hosts and clients within the chip, isn't it?

I vaguely remember someone saying something about hot spots on the core.
Could be wrong though.

Ailuros · Jul 23, 2008

Nice glimpse into the past with all the antiquities here in this thread.

I haven't a single idea what RV8x0 could look like, but I for one wasn't laughing at all when the 800SP rumours appeared which were quite rare opposed to the 480 nonsense that circulated in tons. We had seen internal notes that they wanted to improve amongst others texturing and Z fillrates. Increasing the amount of TMUs for a design like R6x0/RV6x0 sounded like a one way street if one wasn't to invest ungodly amounts of R&D resources and the simple fact that ALUs and TMUs are tied together in a SIMD logic.

There's also a note somewhere that they supposedly tried to increase the ALU frequency in RV670 and failed. I'm not saying they did or didn't, yet if you sit back and think that a 1000 SPs at twice the frequency can do as much work as 2000SPs with half the frequency, I personally consider the 1000SPs not too little but a quite interesting question mark and a quite scary prospect for the foreseeable future.

Bottomline there are many scenarios that could eventually make sense; without knowing the exact parameters there's nothing in my mind that sounds "too little" or "too lot".

kyetech · Jul 23, 2008

Ailuros,

What exactly is your point? That you think r8xx could double the FLOP performance over r7xx?

Jawed · Jul 23, 2008

Pressure said:
I vaguely remember someone saying something about hot spots on the core.
Could be wrong though.

I, for one, went down a blind alley with a misinterpretation along those lines.

The conclusion was that it was purely about routing congestion - too many wires trying to fit into a small area - a routing hotspot.

Jawed

Pressure · Jul 24, 2008

Jawed said:
I, for one, went down a blind alley with a misinterpretation along those lines.

The conclusion was that it was purely about routing congestion - too many wires trying to fit into a small area - a routing hotspot.

Jawed

Ah ok, thanks for clearing that up

It made sense at the time though.

nicolasb · Jul 24, 2008

nicolasb said:
The ill-fated Glaze 3D was multi-chip as well.

Kaotik said:
It was? The 2 chips that got to silicon, Pyramid3D and Axe, were at least both single chips

Glaze3D designs were a long time after Pyramid 3D: this was the "Extreme Bandwidth Architecture" part, with embedded-DRAM. I definitely recall discussions about how the multi-chip versions would divide the screen up into tiles, and that this choice was made because it would thrash the texture caches less than (say) the Voodoo 2 SLI approach of rendering alternate horizontal scan-lines.

My memory is a little hazy but I think they may have talked about a 4-chip version of this, as well as 2-chip.

Kaotik · Jul 24, 2008

nicolasb said:
Glaze3D designs were a long time after Pyramid 3D: this was the "Extreme Bandwidth Architecture" part, with embedded-DRAM. I definitely recall discussions about how the multi-chip versions would divide the screen up into tiles, and that this choice was made because it would thrash the texture caches less than (say) the Voodoo 2 SLI approach of rendering alternate horizontal scan-lines.

My memory is a little hazy but I think they may have talked about a 4-chip version of this, as well as 2-chip.

I know it was after Pyramid3D, but it was also earlier than Axe.
Anyway, I checked about it and indeed there was apparently plans for multichip, with Glaze3D and Thor chip, where Thor would be both TnL unit and bridge for multichip solutions.

Humus · Jul 24, 2008

Mat3 said:
With two chips working on different frames, if you wanted them to share memory, what besides textures would need to be shared? Assuming that's all you wanted to send from one to the other, how much bandwidth total (bi-directional) would be needed for that?

Anything that the GPU writes to that the other GPU reads from needs to be shared. Render targets would be the most common thing, but they don't need to be shared if they are cleared and rendered to each frame, which should be true for most render targets. In DX10 it could be StreamOut buffers as well.

I would say the main problem with AFR is not the actual copying that may be necessary and the bandwidth needed for that, but the synchronization. For instance take a simple exposure implementation. GPU0 renders its frame. Then it averages the pixels to compute overall exposure. This ends up in a 1x1 render target. In the next frame the frame brightness is adjusted using this render target as input. GPU1 now needs to wait until GPU0 is finished rendering to the render target. Although the data copied only amounts to just one pixel, each GPU ends up idle most of the frame just because it doesn't have all its data ready from the other GPU. Even if the GPUs had a shared memory pool it wouldn't help, you'd still see scaling of say less than 10%.

Lukfi · Jul 24, 2008

A lame question: what is that exposure good for?

Humus · Jul 24, 2008

To compress the dynamic range to something the monitor can show. Otherwise HDR would not look any different from traditional rendering since the highlights would just clip.

nAo · Jul 24, 2008

Humus said:
I would say the main problem with AFR is not the actual copying that may be necessary and the bandwidth needed for that, but the synchronization. For instance take a simple exposure implementation. GPU0 renders its frame. Then it averages the pixels to compute overall exposure. This ends up in a 1x1 render target. In the next frame the frame brightness is adjusted using this render target as input. GPU1 now needs to wait until GPU0 is finished rendering to the render target. Although the data copied only amounts to just one pixel, each GPU ends up idle most of the frame just because it doesn't have all its data ready from the other GPU. Even if the GPUs had a shared memory pool it wouldn't help, you'd still see scaling of say less than 10%.

Good example, but this can be fixed quite easily as you don't really need the GPU to readback that value.
Let the CPU do it (in the following frame(s)) and send it back to the GPU(s) as a pixel shader constant. No sync points between GPU(s) and no need to sample exposure on a per pixel basis anymore while tone mapping. Double win

Mintmaster · Jul 24, 2008

Humus, I see you advocate HDR to be done the same way I do. FP10 and similar formats give you plenty of range this way, as the scale factor from that 1x1 lets you span as many orders of magnitude as you want for brightness.

For this particular application, though, it won't make much difference if you use the 1x1 texture from two frames ago. This is especially true when you consider the time constant for exposure adjustment, as two GPUs will render twice as fast. I suppose there are some minor drawbacks, as you could get some funny stuff happening with, for example, muzzle flash that goes off every other frame.

Mintmaster · Jul 24, 2008

nAo said:
Let the CPU do it (in the following frame(s)) and send it back to the GPU(s) as a pixel shader constant.

I've considered precisely the same thing before, but what kind of latency is there for GPU readback? Can it be done asynchronously like HDD access or will it stall the CPU during this time?

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

nicolasb

NocturnDragon

nicolasb

Freak'n Big Panda

nicolasb

Kaotik

Drunk Member

Mat3

Pressure

Ailuros

Epsilon plus three

kyetech

Jawed

Pressure

nicolasb

Kaotik

Drunk Member

Humus

Crazy coder

Lukfi

Humus

Crazy coder

nAo

Nutella Nutellae

Mintmaster

Mintmaster

Similar threads