AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
I'm also curious to know how well games scaled from the 1 chip Voodoo 4 to the 2 chip voodoo 5. It should be somewhere in the neighborhood of 20-30% if SFR/Tiling scaling on modern hardware is anything to go by.
I suspect it was a hell of a lot better than that, but things were simpler in those days....
 
Last edited by a moderator:
Dragon said:
Geometry processing was still done on the CPU, so the every GPU didn't have to do it...

Oh that explains a lot, thanks!

I checked some benchmarks here and here and scaling is around 40-50%. And the cards are horribly CPU limited compared to GF2 hardware which makes sense given the CPU based geometry processing.
 
Voodoo 5 came out very late. If it had come out when it was originally intended to it would have been competing against GeForce 256 (which failed to hit its target clock speeds and had only nominal geometry acceleration) and would have done very well. But it had no chance against GF2.

Getting off-topic, sorry! :oops:
 
With two chips working on different frames, if you wanted them to share memory, what besides textures would need to be shared? Assuming that's all you wanted to send from one to the other, how much bandwidth total (bi-directional) would be needed for that?
 
Nice glimpse into the past with all the antiquities here in this thread.

I haven't a single idea what RV8x0 could look like, but I for one wasn't laughing at all when the 800SP rumours appeared which were quite rare opposed to the 480 nonsense that circulated in tons. We had seen internal notes that they wanted to improve amongst others texturing and Z fillrates. Increasing the amount of TMUs for a design like R6x0/RV6x0 sounded like a one way street if one wasn't to invest ungodly amounts of R&D resources and the simple fact that ALUs and TMUs are tied together in a SIMD logic.

There's also a note somewhere that they supposedly tried to increase the ALU frequency in RV670 and failed. I'm not saying they did or didn't, yet if you sit back and think that a 1000 SPs at twice the frequency can do as much work as 2000SPs with half the frequency, I personally consider the 1000SPs not too little but a quite interesting question mark and a quite scary prospect for the foreseeable future.

Bottomline there are many scenarios that could eventually make sense; without knowing the exact parameters there's nothing in my mind that sounds "too little" or "too lot".
 
I vaguely remember someone saying something about hot spots on the core.
Could be wrong though.
I, for one, went down a blind alley with a misinterpretation along those lines.

The conclusion was that it was purely about routing congestion - too many wires trying to fit into a small area - a routing hotspot.

Jawed
 
I, for one, went down a blind alley with a misinterpretation along those lines.

The conclusion was that it was purely about routing congestion - too many wires trying to fit into a small area - a routing hotspot.

Jawed

Ah ok, thanks for clearing that up :)

It made sense at the time though.
 
nicolasb said:
The ill-fated Glaze 3D was multi-chip as well.
It was? The 2 chips that got to silicon, Pyramid3D and Axe, were at least both single chips
Glaze3D designs were a long time after Pyramid 3D: this was the "Extreme Bandwidth Architecture" part, with embedded-DRAM. I definitely recall discussions about how the multi-chip versions would divide the screen up into tiles, and that this choice was made because it would thrash the texture caches less than (say) the Voodoo 2 SLI approach of rendering alternate horizontal scan-lines.

My memory is a little hazy but I think they may have talked about a 4-chip version of this, as well as 2-chip.
 
Glaze3D designs were a long time after Pyramid 3D: this was the "Extreme Bandwidth Architecture" part, with embedded-DRAM. I definitely recall discussions about how the multi-chip versions would divide the screen up into tiles, and that this choice was made because it would thrash the texture caches less than (say) the Voodoo 2 SLI approach of rendering alternate horizontal scan-lines.

My memory is a little hazy but I think they may have talked about a 4-chip version of this, as well as 2-chip.

I know it was after Pyramid3D, but it was also earlier than Axe.
Anyway, I checked about it and indeed there was apparently plans for multichip, with Glaze3D and Thor chip, where Thor would be both TnL unit and bridge for multichip solutions.
 
With two chips working on different frames, if you wanted them to share memory, what besides textures would need to be shared? Assuming that's all you wanted to send from one to the other, how much bandwidth total (bi-directional) would be needed for that?

Anything that the GPU writes to that the other GPU reads from needs to be shared. Render targets would be the most common thing, but they don't need to be shared if they are cleared and rendered to each frame, which should be true for most render targets. In DX10 it could be StreamOut buffers as well.

I would say the main problem with AFR is not the actual copying that may be necessary and the bandwidth needed for that, but the synchronization. For instance take a simple exposure implementation. GPU0 renders its frame. Then it averages the pixels to compute overall exposure. This ends up in a 1x1 render target. In the next frame the frame brightness is adjusted using this render target as input. GPU1 now needs to wait until GPU0 is finished rendering to the render target. Although the data copied only amounts to just one pixel, each GPU ends up idle most of the frame just because it doesn't have all its data ready from the other GPU. Even if the GPUs had a shared memory pool it wouldn't help, you'd still see scaling of say less than 10%.
 
To compress the dynamic range to something the monitor can show. Otherwise HDR would not look any different from traditional rendering since the highlights would just clip.
 
I would say the main problem with AFR is not the actual copying that may be necessary and the bandwidth needed for that, but the synchronization. For instance take a simple exposure implementation. GPU0 renders its frame. Then it averages the pixels to compute overall exposure. This ends up in a 1x1 render target. In the next frame the frame brightness is adjusted using this render target as input. GPU1 now needs to wait until GPU0 is finished rendering to the render target. Although the data copied only amounts to just one pixel, each GPU ends up idle most of the frame just because it doesn't have all its data ready from the other GPU. Even if the GPUs had a shared memory pool it wouldn't help, you'd still see scaling of say less than 10%.
Good example, but this can be fixed quite easily as you don't really need the GPU to readback that value.
Let the CPU do it (in the following frame(s)) and send it back to the GPU(s) as a pixel shader constant. No sync points between GPU(s) and no need to sample exposure on a per pixel basis anymore while tone mapping. Double win :)
 
Humus, I see you advocate HDR to be done the same way I do. FP10 and similar formats give you plenty of range this way, as the scale factor from that 1x1 lets you span as many orders of magnitude as you want for brightness.

For this particular application, though, it won't make much difference if you use the 1x1 texture from two frames ago. This is especially true when you consider the time constant for exposure adjustment, as two GPUs will render twice as fast. I suppose there are some minor drawbacks, as you could get some funny stuff happening with, for example, muzzle flash that goes off every other frame.
 
Let the CPU do it (in the following frame(s)) and send it back to the GPU(s) as a pixel shader constant.
I've considered precisely the same thing before, but what kind of latency is there for GPU readback? Can it be done asynchronously like HDD access or will it stall the CPU during this time?
 
Back
Top