GPU speculation

nAo · May 15, 2005

Apoc said:
It's not inadequate, but seeing 6800u, knowing that G70 and r520 are gonna have 24 or 32 pipes, and after watching people saying that it 'd be a lot more powerful that pc graphic cards I thought it'd have at least 12 pipes.

Knowning what? no one except people under NDA what g70 and r520 will have, moreover number of pipelines doesn't directly equal to the number of pixel output per clock.
R500 has plenty of fill rate (and edram bandwith too!) to support 720p.

Apoc · May 15, 2005

Fafalada said:
Only 8 pipes? Having to render at 720p I thought it'd have at least 12 pipes.

Click to expand...

On what grounds?
720P is only 3x higher then 480P and chips with ~1GPix have done the job quite well at the latter, even though the rasterizers were rather primitive, only capable of 1-2operations per pixel at peak rate, and not extremely efficient at reaching peak rates either.

Console chips don't have the necesity to waste transistors on accelerating legacy software like PC ones do.

Ok, I think I'm used to the inefficiency of pc games.

DemoCoder · May 15, 2005

nAo said:
NV40: 2 ALUs per pipeline, first ALU is capable of a vec4 fmadd per clock, the other one is capable of a vec4 mul per clock, so we have 12 ops per pipeline x 16 pipeline x 425 Mhz = 81.6 Gigaflop/s (+ 20 Gigaflop/s if we factor in vertex shaders too)

What about free NRM_PP and some of the special mini-ALU PS1.1 ops? With 24 pipes, we have 12 * 24 *425 = 122gflops/s. If you count NRM and BIAS, that's an extra 2, so 14 * 24 * 425 = 142gflops

R500: 48 ALUs, each ALU is capable of a vec4 and a scalar fmadd, so we have 10 ops per pipeline x 16 x 500 Mhz = 240 GigaFlop/s

Shouldn't that be written as 10 * 48 * 500? There's not 16 pipes.

nAo · May 15, 2005

DemoCoder said:
What about free NRM_PP and some of the special mini-ALU PS1.1 ops? With 24 pipes, we have 12 * 24 *425 = 122gflops/s. If you count NRM and BIAS, that's an extra 2, so 14 * 24 * 425 = 142gflops

I deliberately ignored it cause I dunno how R500 ALUs perform reciprocal and
bias. (maybe it uses the scalar path).
The weird thing is that it seems NV40 can do 2 reciprocals at the same time:

● Each pipeline is capable of performing a four-wide, coissue-able multiply-add (MAD)
or four-term dot product (DP4), plus a four-wide, coissue-able and dual-issuable
multiply instruction per clock in series, as shown in Figure 30-11. In addition, a
multifunction unit that performs complex operations can replace the alpha channel
MAD operation. Operations are performed at full speed on both fp32 and fp16 data,
although storage and bandwidth limitations can favor fp16 performance sometimes.
In practice, it is sometimes possible to execute eight math operations and a texture
lookup in a single cycle.

● Dedicated fp16 normalization hardware exists, making it possible to normalize a
vector at fp16 precision in parallel with the multiplies and MADs just described.

● An independent reciprocal operation can be performed in parallel with the multiply,
MAD, and fp16 normalization described previously.

Shouldn't that be written as 10 * 48 * 500? There's not 16 pipes.

It should, my fault, the result is correct though

DemoCoder · May 15, 2005

Fafalada said:
Only 8 pipes? Having to render at 720p I thought it'd have at least 12 pipes.

Click to expand...

On what grounds?
720P is only 3x higher then 480P and chips with ~1GPix have done the job quite well at the latter, even though the rasterizers were rather primitive, only capable of 1-2operations per pixel at peak rate, and not extremely efficient at reaching peak rates either.

Console chips don't have the necesity to waste transistors on accelerating legacy software like PC ones do.

Well, but next-gen games should be capable of dynamic soft shadows and global illumination algorithms, all of which require many fill passes, for example to fill shadow buffers or lay down stencil volumes, and also to generate cube maps for doing real-time reflections. If I've got 16 light sources, I need 16 passes for them. Plus, each cube map needs 6 passes. Also depth complexity should go way up as well.

If next-gen consoles only render scenes using similar illumination and shadow quality as previous generation consoles, just at 720p, I will be thoroughly disappointed.

Guden Oden · May 15, 2005

Apoc said:
It's not inadequate

So what are you complaining about?

It's extremely overpowered compared to the resolution it's targetted at, 72 fills per frame at 60 frames/sec is total overkill. Agreed, yes/no?

and after watching people saying that it 'd be a lot more powerful that pc graphic cards I thought it'd have at least 12 pipes.

Never listen to people. People don't know shit.

Why would you need 12 pipes anyway, would 108 fills per frame make graphics look even one bit better? You only have 22GB system bandwidth anyway and lots of it is going to be eaten up by those three CPU cores, even with the eDRAM there's a good chance the system would be bandwidth-limited. Eight pipes at half a gigahertz clock gives tremendous fill performance, no doubt about it. Main strength is in the shading units anyway, why waste die space on more pixel pipes? Plain textured fill is like I already demonstrated high enough as it is for any purpose you could reasonably think of.

DemoCoder · May 15, 2005

Guden Oden said:
Apoc said:

It's not inadequate

Click to expand...

So what are you complaining about?

It's extremely overpowered compared to the resolution it's targetted at, 72 fills per frame at 60 frames/sec is total overkill. Agreed, yes/no?

That's only if you have no overdraw and you are not doing any unified shadows or illumination. If you have an overdraw of say 10, reduced to say, 5 by EarlyZ, you're down to 14 fills. And if you are doing shadow buffers or stencil shadow volumes, you need 1 pass for each light source. Let's assume D3-style with 2-3 light sources. Now you're down to 3-5 fills per frame. And if you need to auxillary render targets, like making some cubemaps? You might want to do supersampled AA as well. And what of HDR, that might slash your fillrate in half.

That fact that D3, HL2, and UE3 are still heavily influenced by fillrate mean that there is still a demand for more fillrate. Having lots of shader power is great, but we are seeing more global lighting solutions, and those require multipass and lots of rendertarget work.

Plain textured fill is like I already demonstrated high enough as it is for any purpose you could reasonably think of.

Demonstrably false, unless you think a GeForce 6600 has enough fillrate for any game you could possibly think of.

nAo · May 15, 2005

We don't even know if fill rate may double when color writes are off, 8 Gigapixel/s (sustained) wouldn't be that bad..

Jawed · May 15, 2005

The leak suggests that R500 can write 16 z/stencil per clock, as opposed to 8 colour/z/stencil, i.e. with no colour it's writing twice as fast.

Jawed

DemoCoder · May 16, 2005

Is that only with AA tho? The same was true since the R300. It was Nvidia that introduced the "double pumped" Z/Stencil on the NV30 in non-AA modes.

Jawed · May 16, 2005

DemoCoder said:
Is that only with AA tho? The same was true since the R300. It was Nvidia that introduced the "double pumped" Z/Stencil on the NV30 in non-AA modes.

I don't know of a way to find out from the leak. Perhaps someone knows better.

To be quite honest I don't have a clue what the deal is with the double-speed z/stencil when AA is turned on, in R300 up. How come that even works? What makes it possible?

Presumably R300's ability to do double-Z/stencil when AA is on doesn't actually help with D3's shadowing engine.

Jawed

DemoCoder · May 16, 2005

When 2xMSAA is turned on, each R300 pipeline can write 2 zvalues per clock, just like the NV30 when 2xMSAA is turned on. The only different is, when MSAA is turned off, the NV30 can still write 2 zvalues per clock, whereas the R300 can't. Nothing magic about the R300's support, other than the fact that it can write 2 samples in MSAA mode with no fillrate loss.

What I was asking is, can the R500 write 16 zvalues when MSAA is turned off, and if so, can it write 32 values when 2x MSAA is turned on?

ninelven · May 16, 2005

erased

Jawed · May 16, 2005

DemoCoder said:
When 2xMSAA is turned on, each R300 pipeline can write 2 zvalues per clock, just like the NV30 when 2xMSAA is turned on. The only different is, when MSAA is turned off, the NV30 can still write 2 zvalues per clock, whereas the R300 can't. Nothing magic about the R300's support, other than the fact that it can write 2 samples in MSAA mode with no fillrate loss.

What I was asking is, can the R500 write 16 zvalues when MSAA is turned off, and if so, can it write 32 values when 2x MSAA is turned on?

My problem is trying to understand where the "free" doubling of z/stencil comes from when AA is turned on in R300.

Unless it's not free, it's just a relabelling of something that's happening anyway. In ATI's edge MSAA the AA samples have no colour, they only have Z - and R300 does 2xAA per "loop". So, ahem, 2x Z rate just drops out of the AA sample technique, which is a fairly meaningless way of counting Z bandwidth, argh!!!! as it's solely about writing AA samples.

Well, that's how it seems to me. Is that it?...

Jawed

DemoCoder · May 16, 2005

Yes, so now I'm asking, is that how the R500 works as well, or is something about the way it can write z-values at 2x the rate fundamentally different than the way the R300/R420 could.

Jawed · May 16, 2005

DemoCoder said:
Yes, so now I'm asking, is that how the R500 works as well, or is something about the way it can write z-values at 2x the rate fundamentally different than the way the R300/R420 could.

Hey, I know that's what you're asking (ever since you brought the topic up). I just don't know...

Jawed

GPU speculation

nAo

Nutella Nutellae

Apoc

DemoCoder

nAo

Nutella Nutellae

DemoCoder

Guden Oden

Senior Member

DemoCoder

nAo

Nutella Nutellae

Jawed

DemoCoder

Jawed

DemoCoder

ninelven

PM

Jawed

DemoCoder

Jawed

Similar threads