G94 vs RV670 - Which one is more future-proof?

It's worth noting that the wideness of 16 is only true for R600 (and rv670), but not the lower end units (12 or 8 there). G80 doesn't look scalable that way.
Yes. The CUDA angle, where the sizing of a GPU has very constrictive performance ramifications, prolly acts as the determinant here. I don't know if it'd be possible to detect any particular effect in graphics shaders, because the way memory and registers are used in graphics is somewhat different (less precisely controllable). I don't know how sensitive to this kind of scaling the ATI GPUs are in GPGPU.

NVidia doesn't have the "high" ALU:TEX ratio to "cut-back" that ATI does (though NVidia's ratio is higher than it appears for a variety of reasons). I'm intrigued to see whether NVidia actually does go for an explicitly higher ALU:TEX ratio in the next generation (based on functional unit counts). They could just double the number of clusters (but, ahem, what's the use of 128 TMUs in a single GPU constrained by ~120-150GB/s?).

I thought though G80 is 16-wide too, I don't think the half-clusters can run different instructions or can they? I never saw what the purpose of the 8x2 internal cluster arrangement was.
They are indeed separate. The width may well be an artefact of G80 as the initial GPU of the architecture. Or it could be that making a 16-wide ALU was more than NVidia wanted to do, related to the very wide datapaths required. It prolly increases the effective ALU:TEX ratio (rather than having a single ALU that's twice as wide) because for a given ratio of register file:threads, the split arrangement will increase the chance that a thread requires a TEX operation, maximising TMU utilisation, which then reduces the count of threads that are needed to hide texturing latency. A stall in one ALU won't incur a stall in the other - or if you prefer, the split arrangement is just more finely-grained.

But future NVidia GPUs appear destined to work on the basis of 32-object threads (based on CUDA noises). So, 16-wide MAD ALUs or make the pipeline run each instruction for 4 clocks instead of 2?

And, I thought the SF ALU has the same width as the normal shader unit (it pretty much has to otherwise you couldn't co-issue mad+mul every clock), just needs 4 cycles for most functions (sin, cos etc.) - not sure what the interpolate rate is.
Interpolation rate is 1 per clock per lane, but it actually produces 4 results in parallel, so a total of 8 interpolations per clock. This still means a 2D texture fetch for a "thread" of 32 pixels takes 8 interpolation-instruction issues.

I presume this ALU is also involved in the generation (interpolation) of Z for MSAA samples. I don't remember any explicit discussion of this concept officially, though, but the original paper on the multifunction interpolator described this functionality.

I'm still not aware of any decent evidence that points to MAD+MUL co-issue in G80 or later GPUs. A math-heavy shader like the 3DMk06 Perlin noise test runs at identical throughput per clock/per unit and the MAD+MUL theorists maintain that this functionality is available in all GPUs since G80. Well G80 and G92 show no difference in per-clock/per-unit performance at all...

Jawed
 
(based on functional unit counts). They could just double the number of clusters (but, ahem, what's the use of 128 TMUs in a single GPU constrained by ~120-150GB/s?).
128*600*0.5 = 38.4GB/s. What's the problem? :) Now, of course, if we used any other format than DXT1/ATI1N (or DXT5/3DC, for which you should double that figure), then with INT8 or FP10 you've got 128*600*4 = 300GB/s. Which isn't quite as viable...

I really think it makes sense to keep a fast path for compressed textures, and seemingly not taking them into consideration is the major design flaw in R6xx's TMUs. What I would indeed argue, however, is that the linear scaling from INT8 to FP16/FP32 is fairly massive overkill in G8x/G9x. My opinion is that the best short/near-term design is as follows:
- Shared INT16 units for addressing & filtering (ala G92 afaict); only have as many as needed for full-speed INT8 addressing, or even less if we want free or semi-free trilinear.
- Special INT8 path (useful primarily for compressed textures) that goes at more than 2x the speed of INT16/FP16/FP10. Doesn't reuse the addressing logic.
- FP32 filtering done in the shader core; doesn't have to be the fastest thing ever so incremental bus cost should be OK.

But future NVidia GPUs appear destined to work on the basis of 32-object threads (based on CUDA noises). So, 16-wide MAD ALUs or make the pipeline run each instruction for 4 clocks instead of 2?
Probably worth making this a bit clearer: you've got multiprocessors with two execution units and one scheduler that can only run one instruction/cycle. In PS mode where interpolation is frequent, they need to be able to co-issue so they make the batch size longer (i.e. scheduler is free to alternate between the two ALUs because each ALU is filled up for two of its own cycles). In VS mode, you only need special function which isn't that frequent so you might as well let the MADD ALU idle when you need it to achieve a smaller batch size.

I presume this ALU is also involved in the generation (interpolation) of Z for MSAA samples. I don't remember any explicit discussion of this concept officially, though, but the original paper on the multifunction interpolator described this functionality.
Hmm, maybe, although I'd be worried about pathing issues there. Reusing the same unit for Z even though it doesn't need special-function processing would also seem weird to me; I'll have to think about it.

I'm still not aware of any decent evidence that points to MAD+MUL co-issue in G80 or later GPUs.
G80/G84/G92 can't co-issue, G86 can. As for G94, I *might* know if NV stopped acting like drunken monkeys and Rys actually got a board.

A math-heavy shader like the 3DMk06 Perlin noise test runs at identical throughput per clock/per unit and the MAD+MUL theorists maintain that this functionality is available in all GPUs since G80.
It *is* available, as clearly indicated by one of the early 100.xx driver releases we tested and where they were presumably desperate to get a slight performance boost in Vista. Fact remains that G80 never could issue two MULs per clock though, only more than one, and the rates I got clearly implied a register file and/or register-to-ALU transfer limitation. As for why it's gone, who knows - the perf boost probably was quite low because of that limitation and perhaps there were scheduling constraints that made it not worth the trouble.

Well G80 and G92 show no difference in per-clock/per-unit performance at all...
Indeed, they don't. If G94 does, then I'd be hopeful that a 55nm shrink of G92 might be based on that synthesis rather than G92's. Who knows if that's the case though, bah!
 
128*600*0.5 = 38.4GB/s. What's the problem? :) Now, of course, if we used any other format than DXT1/ATI1N (or DXT5/3DC, for which you should double that figure), then with INT8 or FP10 you've got 128*600*4 = 300GB/s. Which isn't quite as viable...
Actually, I was working on G94 scaled up! Why have 4x G94's TMUs, with twice the bandwidth?

I really think it makes sense to keep a fast path for compressed textures, and seemingly not taking them into consideration is the major design flaw in R6xx's TMUs.
R6xx actually treats all textures as fp16, it converts int8 texels to fp16!

What I would indeed argue, however, is that the linear scaling from INT8 to FP16/FP32 is fairly massive overkill in G8x/G9x.
I think G94 shows the path forwards, its realworld performance "per TMU" is great in comparison with G92. Though you could argue that 6 clusters would be the perfect configuration for ~64GB/s, since 8800GTS-512 is up to 50% faster. If G92 had been 6 clusters it would have been way more impressive.

My opinion is that the best short/near-term design is as follows:
- Shared INT16 units for addressing & filtering (ala G92 afaict); only have as many as needed for full-speed INT8 addressing, or even less if we want free or semi-free trilinear.
Are you suggesting changes for R6xx?

- Special INT8 path (useful primarily for compressed textures) that goes at more than 2x the speed of INT16/FP16/FP10. Doesn't reuse the addressing logic.
If you're talking about G9x onwards, I suspect merely increasing ALU:TEX by a significant factor will make any issues here disappear.

Hmm, maybe, although I'd be worried about pathing issues there. Reusing the same unit for Z even though it doesn't need special-function processing would also seem weird to me; I'll have to think about it.
It's a shame NVidia's been coy about this. This is yet another step on the road to generalisation of the graphics pipeline, rather than using fixed-function units.

It *is* available, as clearly indicated by one of the early 100.xx driver releases we tested and where they were presumably desperate to get a slight performance boost in Vista. Fact remains that G80 never could issue two MULs per clock though, only more than one, and the rates I got clearly implied a register file and/or register-to-ALU transfer limitation.
I have to admit I was wondering the same this morning as I woke up.

Jawed
 
Actually, I was working on G94 scaled up! Why have 4x G94's TMUs, with twice the bandwidth?
Because the TMUs aren't necessarily bandwidth-limited; the ROPs nearly always are, though. 4x the TMUs and 2x the ROPs, compared to G94 (or 2x the TMUs/2x the ROPs compared to G92) seems reasonable to me.

I think G94 shows the path forwards, its realworld performance "per TMU" is great in comparison with G92.
I think it's more of a function of having too many TMUs and ALUs compared to the input assembly/triangle setup parts of the pipeline, and potentially compared to the ROPs. Certainly bandwidth is playing a role here also, but I'm not sure it's the biggest contributor to what's happening here. Who knows though...

Though you could argue that 6 clusters would be the perfect configuration for ~64GB/s, since 8800GTS-512 is up to 50% faster. If G92 had been 6 clusters it would have been way more impressive.
Uhm, remember that every part of a frame can be limited by something else. While you may very well be right, you can't really be certain of that without much more complex data and analysis.

Are you suggesting changes for R6xx?
No, just an 'ideal TMU' for the short/mid-term in my mind. Arguably what I'm describing is slightly nearer G8x/G9x than RV6xx, though.

If you're talking about G9x onwards, I suspect merely increasing ALU:TEX by a significant factor will make any issues here disappear.
Are we talking about the same thing? I'm refering to interpolation, not addressing. You could increase the number of ALUs by a billion orders of magnitude and it wouldn't change anything here (although it'd certainly result in a performance boost!)

It's a shame NVidia's been coy about this. This is yet another step on the road to generalisation of the graphics pipeline, rather than using fixed-function units.
Uhm, what? Are you implying that AMD has a patent on this? Because the only patent I know about this is from Erik Lindholm (and someone else who, I think, is one of NV's top TMU guys), who was leading the G80 shader core team.

I have to admit I was wondering the same this morning as I woke up.
Hah! :)
 
Because the TMUs aren't necessarily bandwidth-limited; the ROPs nearly always are, though. 4x the TMUs and 2x the ROPs, compared to G94 (or 2x the TMUs/2x the ROPs compared to G92) seems reasonable to me.
Clearly I have little hope against this obvious configuration, given G92's seeming superfluity of TMUs for the bandwidth available.

I think it's more of a function of having too many TMUs and ALUs compared to the input assembly/triangle setup parts of the pipeline, and potentially compared to the ROPs. Certainly bandwidth is playing a role here also, but I'm not sure it's the biggest contributor to what's happening here. Who knows though...
Well, radically changing IA/setup rates per clock seems pretty unlikely. Famous last words, hahaha.

Uhm, remember that every part of a frame can be limited by something else. While you may very well be right, you can't really be certain of that without much more complex data and analysis.
I hope someone does this, scaling up from 8600GTS through 9600GT to 8800GTS-512...

Are we talking about the same thing? I'm refering to interpolation, not addressing. You could increase the number of ALUs by a billion orders of magnitude and it wouldn't change anything here (although it'd certainly result in a performance boost!)
I'm assuming interpolation rate would scale with ALU:TEX ratio if the ALUs are multiplied. You want SF to scale with ALU, so if ALU:TEX increases, then you automatically have increased interpolation rate per TEX.

Uhm, what? Are you implying that AMD has a patent on this? Because the only patent I know about this is from Erik Lindholm (and someone else who, I think, is one of NV's top TMU guys), who was leading the G80 shader core team.
:???: I can't work out why you think I'm referring to AMD here. The SF/MI slides were pretty clear about Z. I'm just puzzled why it's been left as a "footnote", a single sentence "Multi-bit fractions provide a large grid used for multi-sampling based antialiasing"...

Jawed
 
Well, radically changing IA/setup rates per clock seems pretty unlikely. Famous last words, hahaha.
For any future G9x, sure. For GT200, why not? Ideally you'd just run triangle setup on the ALUs (ala Intel) and make sure the rasterization & compression hardware can handle relatively high numbers of triangles/clock.

I hope someone does this, scaling up from 8600GTS through 9600GT to 8800GTS-512...
That'd certainly be interesting, and probably wouldn't be too awfully time-intensive for automated benchmarks. Hmm.

I'm assuming interpolation rate would scale with ALU:TEX ratio if the ALUs are multiplied. You want SF to scale with ALU, so if ALU:TEX increases, then you automatically have increased interpolation rate per TEX.
Yes, but I was talking about *addressing*, not interpolation. You know, the kind of unit G80 has 32 of and G92 has 64 of?

:???: I can't work out why you think I'm referring to AMD here. The SF/MI slides were pretty clear about Z. I'm just puzzled why it's been left as a "footnote", a single sentence "Multi-bit fractions provide a large grid used for multi-sampling based antialiasing"...
Oops, sorry! I misread and thought you were responding to using the shader core for texture filtering, rather than Z. So obviously my response made no sense... :)

Regarding that footnote, AFAICT it doesn't confirm that the same units are used for Z interpolation. The reason why you need multi-bit fractions for non-Z attributes when multisampling is that the sample position for color (on R300/NV40+) may be the centroid (average of the used samples' locations), rather than the center of the pixel. I believe this implies that you need as much (and probably more) precision for non-Z attribute interpolation.
 
For any future G9x, sure. For GT200, why not? Ideally you'd just run triangle setup on the ALUs (ala Intel) and make sure the rasterization & compression hardware can handle relatively high numbers of triangles/clock.
It's a question of priorities I suppose - an assessment that's still pretty woolly I think. Stuff for the GT200 thread.

Yes, but I was talking about *addressing*, not interpolation. You know, the kind of unit G80 has 32 of and G92 has 64 of?
Er, you just said "Are we talking about the same thing? I'm refering to interpolation, not addressing. You could increase the number of ALUs by a billion orders of magnitude and it wouldn't change anything here (although it'd certainly result in a performance boost!)" and as far as I can tell a G92-derivative is not going to get any faster at int8 texturing by doing anything within the TMUs when it appears to be some mix of a) generally over-specified with an excess of int8 capability and b) seemingly interpolation-bound.

Regarding that footnote, AFAICT it doesn't confirm that the same units are used for Z interpolation. The reason why you need multi-bit fractions for non-Z attributes when multisampling is that the sample position for color (on R300/NV40+) may be the centroid (average of the used samples' locations), rather than the center of the pixel. I believe this implies that you need as much (and probably more) precision for non-Z attribute interpolation.
Even with MSAA turned off you still need the centroid for each pixel. Also, bear in mind the diagrams relate pixel offsets (dXi,dYi) in relation to the centre of the quad (Xc,Yc), so the centroid adjustment is implicit.

So, I don't see how "centroids" explains the mention of MSAA as the precision required for centroid-based pixel shading isn't affected by MSAA. Indeed the whole basis of MSAA is that the pixel's precise coordinates for colour calculations are unaffected by the state of MSAA (off or on).

Jawed
 
Even with MSAA turned off you still need the centroid for each pixel. Also, bear in mind the diagrams relate pixel offsets (dXi,dYi) in relation to the centre of the quad (Xc,Yc), so the centroid adjustment is implicit.

So, I don't see how "centroids" explains the mention of MSAA as the precision required for centroid-based pixel shading isn't affected by MSAA. Indeed the whole basis of MSAA is that the pixel's precise coordinates for colour calculations are unaffected by the state of MSAA (off or on).
:oops: Hmm, no, that's completely wrong.

OK, so MSAA requires increased precision for interpolation, and yes that could be the end of it. Oh well.

Jawed
 
I think it's more of a function of having too many TMUs and ALUs compared to the input assembly/triangle setup parts of the pipeline, and potentially compared to the ROPs. Certainly bandwidth is playing a role here also, but I'm not sure it's the biggest contributor to what's happening here. Who knows though...

Well, well, looks like I'm not the only person on the planet who thinks that the G80/G92 cards (leave out the GTS-640 here) are mostly bottlenecked by the assembly/setup part, or possibly the thread scheduler - it was staring to get lonely :smile:
I got that idea when seeing tests with more recent games where the four cards performed almost identically at sub-19x12 resolutions - I just see no other explanation than the first part of the pipeline is bottlenecking, with some balancing effect by the memory bandwidth. Then the 9600GT's performance added some more fuel to this concept.

If this is really the case, then the G94 chip is anything but future proof as you probably need to replace that part of the chip completely to get better performance without serious clock speed boost.
The RV670 looks much prettier - although I'd really like to know if it's really mostly bottlenecked by the low texture filtering power, or there are some other dark secrets...
 
Lower serial performance for ALU-only shader -> fewer registers per cycle of ALU latency -> likely new higher-latency ALU design & synthesis rather than reusing G92's, IMO... ;)
 
Code:
       PixelShader = asm {
            //
            // Generated by Microsoft (R) HLSL Shader Compiler
            //
            //
            //
            // Input signature:
            //
            // Name             Index   Mask Register SysValue Format   Used
            // ---------------- ----- ------ -------- -------- ------ ------
            // val                  0   xyzw        0     NONE  float   xyzw
            // val                  1   xyzw        1     NONE  float   xyzw
            //
            //
            // Output signature:
            //
            // Name             Index   Mask Register SysValue Format   Used
            // ---------------- ----- ------ -------- -------- ------ ------
            // SV_Target            0   xyzw        0     NONE  float   xyzw
            //
            ps_4_0
            dcl_input linear v0.xyzw
            dcl_input linear v1.xyzw
            dcl_output o0.xyzw
            dcl_temps 2
            mul r0.x, v0.x, v0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mul r0.x, r0.x, r0.x
            mad r1.x, v0.y, v0.y, v0.y
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r1.x, r1.x, r1.x, r1.x
            mad r0.y, r1.x, r1.x, r1.x
            min r1.x, v0.z, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r1.x, r1.x, v1.y
            min r0.z, r1.x, v1.y
            max r1.x, v0.w, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r1.x, r1.x, v1.y
            max r0.w, r1.x, v1.y
            sqrt r1.x, v1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            sqrt r1.x, r1.x
            mov r1.yzw, v1.yyzw
            dp4 r0.x, r0.xyzw, r1.xyzw
            mov o0.xyzw, r0.xxxx
            ret 
            // Approximately 324 instruction slots used
                    
        };
 
Where did you get that program from? :) Anyhow, for the MUL to be faster than 1/cycle on G86, it needs to be parallel to another instruction rather than serial (the multiprocessor is dual-issue...) - is there any test there that would test that?
 
I've normalised for ALU clock rate:

Code:
G94 as % of G92             G94PS  G94VS
Float MAD serial               87    126
Float4 MAD parallel           100    126
SQRT serial                   103    126
float 5-instruction issue      93    126
int MAD serial                 97    126
int4 MAD parallel             100     92

Code:
VS as % of PS                 G94    G92
Float MAD serial               11      7
Float4 MAD parallel            57     46
SQRT serial                    60     50 
float 5-instruction issue      22     16
int MAD serial                 50     39
int4 MAD parallel              69     75

Jawed
 
Back
Top