R520 launch delayed?

Pete said:
Forgive me for not keeping up with all of the internet, but 3? I know 8 pixels per clock and 48 ALUs, but I can't figure out 3. 3-pass, 8x MSAA?
Imagine 48 ALUs grouped in 3 blocks..
 
nAo said:
Pete said:
Forgive me for not keeping up with all of the internet, but 3? I know 8 pixels per clock and 48 ALUs, but I can't figure out 3. 3-pass, 8x MSAA?
Imagine 48 ALUs grouped in 3 blocks..

. . .it's easy if you try.
No real pipelines below us; above us only clipped sky.
Imagine all the functional units, living cycles in peace.

<woohoo>

You may say I'm a dreamer, but I'm not the only one.

. . .err, damn those liquid lunches.
 
..Imagine 1 thread per ALUs 'group' , 3 threads running at the same time out of 4..;)
(no, I'm not dreaming, just read last year X360 leaked specs)
 
nAo said:
Imagine 48 ALUs grouped in 3 blocks..
Hmmm. Why would they be grouped this way? Because not all are "general purpose?" For memory access reasons? For manufacturing reasons (die has four blocks for 64ALUs, but one defective block is disabled)?

Or is this related to the triple-core CPU?

Edit: OK, you posted the answer to my Q at the same time as I asked it, even though you were answering geo's tribute to the muses of grains and hops. Way to freak me out. :)

But how does this relate to framerate? How does three threads compare to four quads or 16 pipes? Does this mean fewer but far more processed pixels per clock, or can a triple-threaded GPU run as fast as, say, a (500MHz, 8 ROP) 6600GT in a shaderless title like Quake 3?
 
My hypothesis: each thread is about a batch of (16?) pixels or vertices, ALUs are reassigned to work on pixels or vertices batches on a per group basis.
Completed threads generates shaded vertices or shaded pixels; shaded vertices are sent to primitives setup engine (and then to the rasterizer, from here to be shaded pixels are sent back to ALUs as a new thread), shaded pixels are sent to ROPs.

EDIT: corrected 'shaded pixels' to 'shaded vertices'
 
b3d18.gif


Jawed
 
Jawed said:

Can you explain what each of the blocks are? Especially the red blue and yellow in the shader units. Im guessing 2 texture units and 2 texture ALU's but what about the yellow block?
 
Xmas said:
The ALU stuff doesn't add up. There should be 48 (vec3+scalar) ALUs.

Well if you have a better idea as to how 48 ALUs work together to produce 8 fragments per clock then go ahead...

I'm assuming that 8 fragments per clock is the peak theoretical fill-rate.

I'm also assuming that R500 will only render pixel command threads in quads.

Jawed
 
NV4x can dual-issue various combinations of Vec3 or Vec2 + scalar. R420 can do Vec3+scalar or two Vec2. R420 is limited because it can only dual-issue PS1.4 ops. R420's maximum capacity is 5 scalars, or Vec3 + 2 scalars. So R420 lags NV4x.

I'm suggesting that R500 has a symmetrical dual-issue capability. As opposed to NV40's asymmetric ALUs, with one ALU further tied up by texturing.

I'm referring to this:

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=10

As I see it, R500 is focussed on ensuring that stalls in the ALU Engine are short-lived and can be filled rapidly by an alternative command thread (if the stall is caused by branch prediction failure). Similarly stalls caused by texturing (which are entirely predictable by the compiler) can be filled-out with a context switch to a different command thread, hence no stall ever takes place due to texturing.

So in my diagrams I'm guessing that R500 can dual-issue texture-address + Vec4 + scalar. Completely symmetrically, unlike NV4x. So I'm guessing that R500's peak capability is to execute 16 "lines" of code per clock, as each of the eight Unified Shader Pipelines dual-issues 2 "lines" per clock.

The biggest doubt in my mind is to do with the texture address ALU. I'm not sure if such an ALU is redundant in SM3 (I suspect it might be...). If it is then plainly my guess needs re-organising, e.g. to two Vec + scalar co-issue (3 ALUs), in a symmetrical dual-issue configuration, i.e. 6 ALUs total.

Jawed
 
Jawed said:
NV4x can dual-issue various combinations of Vec3 or Vec2 + scalar. R420 can do Vec3+scalar or two Vec2. R420 is limited because it can only dual-issue PS1.4 ops. R420's maximum capacity is 5 scalars, or Vec3 + 2 scalars. So R420 lags NV4x.
NV4x can do either vec4 or co-issue vec3+scalar or vec2+vec2 in each pixel ALU. R3xx/4xx can only do vec4 or co-issue vec3+scalar, it has no vec2+vec2 capability.
In the vertex pipeline, both are vec4+scalar.

I'm suggesting that R500 has a symmetrical dual-issue capability. As opposed to NV40's asymmetric ALUs, with one ALU further tied up by texturing.
The first ALU isn't tied up by texturing, it's just that there is no other datapath for texture coordinates than through this ALU. Which means you can still multiply the texcoords in the same cycle.

R500 doesn't have this limitation, but there are three times more (full vec4+1, thanks nAo) ALUs than texture samplers.

As I see it, R500 is focussed on ensuring that stalls in the ALU Engine are short-lived and can be filled rapidly by an alternative command thread (if the stall is caused by branch prediction failure). Similarly stalls caused by texturing (which are entirely predictable by the compiler) can be filled-out with a context switch to a different command thread, hence no stall ever takes place due to texturing.
Stalls caused by texturing aren't predictable at all. Think cache misses, AF.

So in my diagrams I'm guessing that R500 can dual-issue texture-address + Vec4 + scalar. Completely symmetrically, unlike NV4x. So I'm guessing that R500's peak capability is to execute 16 "lines" of code per clock, as each of the eight Unified Shader Pipelines dual-issues 2 "lines" per clock.

The biggest doubt in my mind is to do with the texture address ALU. I'm not sure if such an ALU is redundant in SM3 (I suspect it might be...). If it is then plainly my guess needs re-organising, e.g. to two Vec + scalar co-issue (3 ALUs), in a symmetrical dual-issue configuration, i.e. 6 ALUs total.
Each of these green blocks is a full vec+scalar ALU, but I'm not sure how they distribute the instructions.

Demirug is hinting at the decoupling of pipelines and ROPs in NV4x. NV43 can output 4 pixels but has 8 pipelines. R500 is constrained by the bandwidth to the eDRAM (which the ROPs are part of), it only allows 8 pixels with color+z/stencil. But that doesn't mean it can't work on more than 8 pixels in parallel.
 
Xmas said:
Jawed said:
NV4x can dual-issue various combinations of Vec3 or Vec2 + scalar. R420 can do Vec3+scalar or two Vec2. R420 is limited because it can only dual-issue PS1.4 ops. R420's maximum capacity is 5 scalars, or Vec3 + 2 scalars. So R420 lags NV4x.
NV4x can do either vec4 or co-issue vec3+scalar or vec2+vec2 in each pixel ALU. R3xx/4xx can only do vec4 or co-issue vec3+scalar, it has no vec2+vec2 capability.
Wrong. Listen to Richard Huddy at GDC05.

I'm suggesting that R500 has a symmetrical dual-issue capability. As opposed to NV40's asymmetric ALUs, with one ALU further tied up by texturing.
The first ALU isn't tied up by texturing, it's just that there is no other datapath for texture coordinates than through this ALU. Which means you can still multiply the texcoords in the same cycle.
No, this ALU is unusable whilst texturing is being performed by the TMU. The compiler can only issue Vec2 (or scalars?) whilst texturing. R420 can continue executing any code whilst waiting for texture results, until a dependent operation arises. NV4x is comparatively limited in what code it can execute during texturing latency.

As I see it, R500 is focussed on ensuring that stalls in the ALU Engine are short-lived and can be filled rapidly by an alternative command thread (if the stall is caused by branch prediction failure). Similarly stalls caused by texturing (which are entirely predictable by the compiler) can be filled-out with a context switch to a different command thread, hence no stall ever takes place due to texturing.
Stalls caused by texturing aren't predictable at all. Think cache misses, AF.
Sigh, you really should go read the R500 patent threads. The whole point of unified shaders is that the compiler knows that a texture operation will cause a stall. You don't need to know how long a stall is to switch the thread out of ALU context (and into TMU context) and bring in another thread (or 2 or 5, doesn't matter) for the ALU Engine to execute, thus filling the texture latency.

So in my diagrams I'm guessing that R500 can dual-issue texture-address + Vec4 + scalar. Completely symmetrically, unlike NV4x. So I'm guessing that R500's peak capability is to execute 16 "lines" of code per clock, as each of the eight Unified Shader Pipelines dual-issues 2 "lines" per clock.

The biggest doubt in my mind is to do with the texture address ALU. I'm not sure if such an ALU is redundant in SM3 (I suspect it might be...). If it is then plainly my guess needs re-organising, e.g. to two Vec + scalar co-issue (3 ALUs), in a symmetrical dual-issue configuration, i.e. 6 ALUs total.
Each of these green blocks is a full vec+scalar ALU, but I'm not sure how they distribute the instructions.
I drew those diagrams.

Demirug is hinting at the decoupling of pipelines and ROPs in NV4x. NV43 can output 4 pixels but has 8 pipelines. R500 is constrained by the bandwidth to the eDRAM (which the ROPs are part of), it only allows 8 pixels with color+z/stencil. But that doesn't mean it can't work on more than 8 pixels in parallel.
I'll wait till Demirug says what he means.

Jawed
 
Jawed said:
Wrong. Listen to Richard Huddy at GDC05.
I haven't found any evidence of a vec2+vec2 split in the PDFs. Is it only in the video? I don't have the bw for that, sorry.

http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=8

No, this ALU is unusable whilst texturing is being performed by the TMU. The compiler can only issue Vec2 (or scalars?) whilst texturing. R420 can continue executing any code whilst waiting for texture results, until a dependent operation arises. NV4x is comparatively limited in what code it can execute during texturing latency.
I might be wrong on the mul of texcoords, though I don't see why there would be such a restriction. R420 is different in that it still uses the phase concept and TMU and ALU are different parts of the pipeline.
What do you mean with "during texturing latency"? Trilinear/AF/FP multi-cycle reads?

Stalls caused by texturing aren't predictable at all. Think cache misses, AF.
Sigh, you really should go read the R500 patent threads. The whole point of unified shaders is that the compiler knows that a texture operation will cause a stall. You don't need to know how long a stall is to switch the thread out of ALU context (and into TMU context) and bring in another thread (or 2 or 5, doesn't matter) for the ALU Engine to execute, thus filling the texture latency.
Ah, ok. If by stall you mean the usual texture latency this makes sense. I was looking at it from a more traditional POV where latency for a usual bilinear texture fetch is hidden by the pipeline architecture.

I think you meant that's the point of a shader unit pool, not unified shaders (which are about load balancing and identical capabilities).

I drew those diagrams.
Well, I thought they at least were built on other, less specific/non-labelled "official" diagrams. You said the second one was just a guess, but it made sense (apart from the ALUs) in the context of the first one.

I don't think it's actually built like that.
 
Xmas said:
Jawed said:
Wrong. Listen to Richard Huddy at GDC05.
I haven't found any evidence of a vec2+vec2 split in the PDFs. Is it only in the video? I don't have the bw for that, sorry.
Yes, just the video. He talks about making sure you explicitly mask operations (e.g. ".xy") if you know you're only using two channels. This enables the compiler to combine two .xy masked ops into one cycle.

So, R420 is as flexbile as NV4x in terms of combining vector and scalar ops, but R420 has a more limited issue capacity because the second vector ALU is PS1.4.

http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=8

No, this ALU is unusable whilst texturing is being performed by the TMU. The compiler can only issue Vec2 (or scalars?) whilst texturing. R420 can continue executing any code whilst waiting for texture results, until a dependent operation arises. NV4x is comparatively limited in what code it can execute during texturing latency.
I might be wrong on the mul of texcoords, though I don't see why there would be such a restriction. R420 is different in that it still uses the phase concept and TMU and ALU are different parts of the pipeline.
What do you mean with "during texturing latency"? Trilinear/AF/FP multi-cycle reads?
Yeah.

These results from GPUBench:

http://graphics.stanford.edu/projects/gpubench/results/6800Ultra-7189/
http://graphics.stanford.edu/projects/gpubench/results/X800XT-4955-PCIe/

Specifically the Fetch Costs, e.g.:

6800 Ultra

X800XTPE

Note how the X800's graph would pass through the origin, whereas 6800 Ultra's doesn't.

I think you meant that's the point of a shader unit pool, not unified shaders (which are about load balancing and identical capabilities).

Yes!

Jawed
 
Demirug, in the leak it says that R500 has a peak rate of 2 quads per cycle. This happens to match the peak write rate to the EDRAM module. The two figures are stated separately.

Jawed
 
Jawed said:
Yes, just the video. He talks about making sure you explicitly mask operations (e.g. ".xy") if you know you're only using two channels. This enables the compiler to combine two .xy masked ops into one cycle.

So, R420 is as flexbile as NV4x in terms of combining vector and scalar ops, but R420 has a more limited issue capacity because the second vector ALU is PS1.4.
There are several documents from ATI stating there is a 3:1 split only. The reason why it is still important to use write mask is that the compiler can often combine multiple identical ops. Say you have three scalar MULs, there is a chance that the compiler can pack those three MULs into a single vec3 mul. Same thing applies for two vec2 MULs. However there are limitations on register usage that sometimes block this. NV40 doesn't have those limitations and can combine two different vec2 ops, e.g. DP2 and MAD.

These results from GPUBench:

http://graphics.stanford.edu/projects/gpubench/results/6800Ultra-7189/
http://graphics.stanford.edu/projects/gpubench/results/X800XT-4955-PCIe/

Specifically the Fetch Costs, e.g.:

6800 Ultra

X800XTPE

Note how the X800's graph would pass through the origin, whereas 6800 Ultra's doesn't.
There's an interesting thread on this already. Tridam pointed out that NV40 can execute independent math instructions while waiting for a texture fetch, but obviously stalls with dependent math. R3xx/4xx does not stall on dependent math, only on dependent texture reads.
 
Back
Top