Imagine 48 ALUs grouped in 3 blocks..Pete said:Forgive me for not keeping up with all of the internet, but 3? I know 8 pixels per clock and 48 ALUs, but I can't figure out 3. 3-pass, 8x MSAA?
Imagine 48 ALUs grouped in 3 blocks..Pete said:Forgive me for not keeping up with all of the internet, but 3? I know 8 pixels per clock and 48 ALUs, but I can't figure out 3. 3-pass, 8x MSAA?
nAo said:Imagine 48 ALUs grouped in 3 blocks..Pete said:Forgive me for not keeping up with all of the internet, but 3? I know 8 pixels per clock and 48 ALUs, but I can't figure out 3. 3-pass, 8x MSAA?
Hmmm. Why would they be grouped this way? Because not all are "general purpose?" For memory access reasons? For manufacturing reasons (die has four blocks for 64ALUs, but one defective block is disabled)?nAo said:Imagine 48 ALUs grouped in 3 blocks..
Jawed said:
Jawed
Xmas said:The ALU stuff doesn't add up. There should be 48 (vec3+scalar) ALUs.
NV4x can do either vec4 or co-issue vec3+scalar or vec2+vec2 in each pixel ALU. R3xx/4xx can only do vec4 or co-issue vec3+scalar, it has no vec2+vec2 capability.Jawed said:NV4x can dual-issue various combinations of Vec3 or Vec2 + scalar. R420 can do Vec3+scalar or two Vec2. R420 is limited because it can only dual-issue PS1.4 ops. R420's maximum capacity is 5 scalars, or Vec3 + 2 scalars. So R420 lags NV4x.
The first ALU isn't tied up by texturing, it's just that there is no other datapath for texture coordinates than through this ALU. Which means you can still multiply the texcoords in the same cycle.I'm suggesting that R500 has a symmetrical dual-issue capability. As opposed to NV40's asymmetric ALUs, with one ALU further tied up by texturing.
Stalls caused by texturing aren't predictable at all. Think cache misses, AF.As I see it, R500 is focussed on ensuring that stalls in the ALU Engine are short-lived and can be filled rapidly by an alternative command thread (if the stall is caused by branch prediction failure). Similarly stalls caused by texturing (which are entirely predictable by the compiler) can be filled-out with a context switch to a different command thread, hence no stall ever takes place due to texturing.
Each of these green blocks is a full vec+scalar ALU, but I'm not sure how they distribute the instructions.So in my diagrams I'm guessing that R500 can dual-issue texture-address + Vec4 + scalar. Completely symmetrically, unlike NV4x. So I'm guessing that R500's peak capability is to execute 16 "lines" of code per clock, as each of the eight Unified Shader Pipelines dual-issues 2 "lines" per clock.
The biggest doubt in my mind is to do with the texture address ALU. I'm not sure if such an ALU is redundant in SM3 (I suspect it might be...). If it is then plainly my guess needs re-organising, e.g. to two Vec + scalar co-issue (3 ALUs), in a symmetrical dual-issue configuration, i.e. 6 ALUs total.
Wrong. Listen to Richard Huddy at GDC05.Xmas said:NV4x can do either vec4 or co-issue vec3+scalar or vec2+vec2 in each pixel ALU. R3xx/4xx can only do vec4 or co-issue vec3+scalar, it has no vec2+vec2 capability.Jawed said:NV4x can dual-issue various combinations of Vec3 or Vec2 + scalar. R420 can do Vec3+scalar or two Vec2. R420 is limited because it can only dual-issue PS1.4 ops. R420's maximum capacity is 5 scalars, or Vec3 + 2 scalars. So R420 lags NV4x.
No, this ALU is unusable whilst texturing is being performed by the TMU. The compiler can only issue Vec2 (or scalars?) whilst texturing. R420 can continue executing any code whilst waiting for texture results, until a dependent operation arises. NV4x is comparatively limited in what code it can execute during texturing latency.The first ALU isn't tied up by texturing, it's just that there is no other datapath for texture coordinates than through this ALU. Which means you can still multiply the texcoords in the same cycle.I'm suggesting that R500 has a symmetrical dual-issue capability. As opposed to NV40's asymmetric ALUs, with one ALU further tied up by texturing.
Sigh, you really should go read the R500 patent threads. The whole point of unified shaders is that the compiler knows that a texture operation will cause a stall. You don't need to know how long a stall is to switch the thread out of ALU context (and into TMU context) and bring in another thread (or 2 or 5, doesn't matter) for the ALU Engine to execute, thus filling the texture latency.Stalls caused by texturing aren't predictable at all. Think cache misses, AF.As I see it, R500 is focussed on ensuring that stalls in the ALU Engine are short-lived and can be filled rapidly by an alternative command thread (if the stall is caused by branch prediction failure). Similarly stalls caused by texturing (which are entirely predictable by the compiler) can be filled-out with a context switch to a different command thread, hence no stall ever takes place due to texturing.
I drew those diagrams.Each of these green blocks is a full vec+scalar ALU, but I'm not sure how they distribute the instructions.So in my diagrams I'm guessing that R500 can dual-issue texture-address + Vec4 + scalar. Completely symmetrically, unlike NV4x. So I'm guessing that R500's peak capability is to execute 16 "lines" of code per clock, as each of the eight Unified Shader Pipelines dual-issues 2 "lines" per clock.
The biggest doubt in my mind is to do with the texture address ALU. I'm not sure if such an ALU is redundant in SM3 (I suspect it might be...). If it is then plainly my guess needs re-organising, e.g. to two Vec + scalar co-issue (3 ALUs), in a symmetrical dual-issue configuration, i.e. 6 ALUs total.
I'll wait till Demirug says what he means.Demirug is hinting at the decoupling of pipelines and ROPs in NV4x. NV43 can output 4 pixels but has 8 pipelines. R500 is constrained by the bandwidth to the eDRAM (which the ROPs are part of), it only allows 8 pixels with color+z/stencil. But that doesn't mean it can't work on more than 8 pixels in parallel.
I haven't found any evidence of a vec2+vec2 split in the PDFs. Is it only in the video? I don't have the bw for that, sorry.Jawed said:Wrong. Listen to Richard Huddy at GDC05.
I might be wrong on the mul of texcoords, though I don't see why there would be such a restriction. R420 is different in that it still uses the phase concept and TMU and ALU are different parts of the pipeline.No, this ALU is unusable whilst texturing is being performed by the TMU. The compiler can only issue Vec2 (or scalars?) whilst texturing. R420 can continue executing any code whilst waiting for texture results, until a dependent operation arises. NV4x is comparatively limited in what code it can execute during texturing latency.
Ah, ok. If by stall you mean the usual texture latency this makes sense. I was looking at it from a more traditional POV where latency for a usual bilinear texture fetch is hidden by the pipeline architecture.Sigh, you really should go read the R500 patent threads. The whole point of unified shaders is that the compiler knows that a texture operation will cause a stall. You don't need to know how long a stall is to switch the thread out of ALU context (and into TMU context) and bring in another thread (or 2 or 5, doesn't matter) for the ALU Engine to execute, thus filling the texture latency.Stalls caused by texturing aren't predictable at all. Think cache misses, AF.
Well, I thought they at least were built on other, less specific/non-labelled "official" diagrams. You said the second one was just a guess, but it made sense (apart from the ALUs) in the context of the first one.I drew those diagrams.
Yes, just the video. He talks about making sure you explicitly mask operations (e.g. ".xy") if you know you're only using two channels. This enables the compiler to combine two .xy masked ops into one cycle.Xmas said:I haven't found any evidence of a vec2+vec2 split in the PDFs. Is it only in the video? I don't have the bw for that, sorry.Jawed said:Wrong. Listen to Richard Huddy at GDC05.
Yeah.http://www.beyond3d.com/reviews/ati/r420_x800/index.php?p=8
I might be wrong on the mul of texcoords, though I don't see why there would be such a restriction. R420 is different in that it still uses the phase concept and TMU and ALU are different parts of the pipeline.No, this ALU is unusable whilst texturing is being performed by the TMU. The compiler can only issue Vec2 (or scalars?) whilst texturing. R420 can continue executing any code whilst waiting for texture results, until a dependent operation arises. NV4x is comparatively limited in what code it can execute during texturing latency.
What do you mean with "during texturing latency"? Trilinear/AF/FP multi-cycle reads?
I think you meant that's the point of a shader unit pool, not unified shaders (which are about load balancing and identical capabilities).
There are several documents from ATI stating there is a 3:1 split only. The reason why it is still important to use write mask is that the compiler can often combine multiple identical ops. Say you have three scalar MULs, there is a chance that the compiler can pack those three MULs into a single vec3 mul. Same thing applies for two vec2 MULs. However there are limitations on register usage that sometimes block this. NV40 doesn't have those limitations and can combine two different vec2 ops, e.g. DP2 and MAD.Jawed said:Yes, just the video. He talks about making sure you explicitly mask operations (e.g. ".xy") if you know you're only using two channels. This enables the compiler to combine two .xy masked ops into one cycle.
So, R420 is as flexbile as NV4x in terms of combining vector and scalar ops, but R420 has a more limited issue capacity because the second vector ALU is PS1.4.
There's an interesting thread on this already. Tridam pointed out that NV40 can execute independent math instructions while waiting for a texture fetch, but obviously stalls with dependent math. R3xx/4xx does not stall on dependent math, only on dependent texture reads.These results from GPUBench:
http://graphics.stanford.edu/projects/gpubench/results/6800Ultra-7189/
http://graphics.stanford.edu/projects/gpubench/results/X800XT-4955-PCIe/
Specifically the Fetch Costs, e.g.:
6800 Ultra
X800XTPE
Note how the X800's graph would pass through the origin, whereas 6800 Ultra's doesn't.