When Tuesday does the G70 NDA expire?

Pete said:
No, nothing wrong with an extra ADD, as the benchmarks show. People keep saying "MADD + MUL," though. Tridam was pointing out that they're misusing the terms.

MADD + MUL = MUL + ADD + MUL.

G70 pipe = 2 * (MUL + ADD) = 2 * MADD. There's no extra "+ ADD" after MADD. MADD stands for MUL + ADD, not, say, matrix addition (what else are people guessing? :)).

Just FYI, there is a big difference between a MUL + ADD and a MADD/MAC. In the case of an architecture with a Mul and Add you have a lot more flexability in the use of the actual hardware, whereas with a MAC if you don't need the Add or mul portion for a particular operation, it goes wasted.

There are plenty of examples of architectures with 2 MACs that weren't any faster than architectures with 1 Mul and 1 Add.

There is research that does show in general, for the same register read port requirements, you are better off with 2 Muls and 1 ADD, than 2 MACs.


Aaron Spink
speaking for myself.
 
Is it arguable that a dual-MAC pipeline is easier to create a compiler for than a MUL and MAC pipeline? There are, in effect, more combinations of dual-issuable instructions to pick from.

NVidia appears to have dropped specific shader code replacements in its drivers for 7800GTX and I'm wondering if that's a side-effect of the new pipeline being somewhat less restrictive in terms of dual-issue capabilities.

Jawed
 
aaronspink said:
Pete said:
No, nothing wrong with an extra ADD, as the benchmarks show. People keep saying "MADD + MUL," though. Tridam was pointing out that they're misusing the terms.

MADD + MUL = MUL + ADD + MUL.

G70 pipe = 2 * (MUL + ADD) = 2 * MADD. There's no extra "+ ADD" after MADD. MADD stands for MUL + ADD, not, say, matrix addition (what else are people guessing? :)).

Just FYI, there is a big difference between a MUL + ADD and a MADD/MAC. In the case of an architecture with a Mul and Add you have a lot more flexability in the use of the actual hardware, whereas with a MAC if you don't need the Add or mul portion for a particular operation, it goes wasted.

There are plenty of examples of architectures with 2 MACs that weren't any faster than architectures with 1 Mul and 1 Add.

There is research that does show in general, for the same register read port requirements, you are better off with 2 Muls and 1 ADD, than 2 MACs.


Aaron Spink
speaking for myself.

GeForce 6/7's shader units can't output more than 4 components. So when doing a MUL the ADD can't work on another instruction (when it can it becomes a MAD of course). MUL and ADD in a Gef6/7 shader unit are not independent math units. But they can do more than ADD, MUL and MAD. DP for exemple.

The extra adder is a small change but it brings more than it costs. Instructions scheduling seems now more efficient (I saw it in many tests) and I think that's the reason for most of the shader performance improvement per pipeline.
 
You learn something new everyday hanging around here. Well, except for the days when we're debating which IHV's more conniving. Thanks, Aaron and Damien.
 
Pete said:
You learn something new everyday hanging around here. Well, except for the days when we're debating which IHV's more conniving. Thanks, Aaron and Damien.

Hmm, I thought we usually debated which IHV is least conniving.
 
Tridam said:
GeForce 6/7's shader units can't output more than 4 components. So when doing a MUL the ADD can't work on another instruction (when it can it becomes a MAD of course). MUL and ADD in a Gef6/7 shader unit are not independent math units. But they can do more than ADD, MUL and MAD. DP for exemple.

The Gefore 6 contains 1 MAC and 1 Mul if I remember correctly. The Geforce 7 contains 2 MACs. I assume that they are independantly connected to the register file, but this may or may not be correct. Has anyone done any testing to see if they are independant or cascaded?

You are correct that they have limited support for tree operations (specifically DP), but this is actually requires minor additional hardware.

FYI I know the difference between the MAC and independant Muls and Adds, I do design uP in my copious spare time. I was responding to the post that was a little unclear in talking about a MAC as a mul+add (while a MAC performs a Mul operation and an Add operation, it is cascaded and also generally a round step isn't performed after the mul portion (though there are some architectures where the MAC is truely a MADD but they are rare)).


The extra adder is a small change but it brings more than it costs. Instructions scheduling seems now more efficient (I saw it in many tests) and I think that's the reason for most of the shader performance improvement per pipeline.

Instruction scheduling should have improved because having 2 MACs vs 1 MAC and a MUL which means they you can schedule it as 2 ADDs, 2 MULs, or 2 MACs. The real question is how much of a benefit is having 2 MACs instead of say 2 MULs and 1 ADD, or even 3 ALUs that can do either 1 mul or 1 add. The reason this is an interesting question is that register file ports can become a critical issue and for the same number of ports it is possible to have a more flexible design without using MACs.


Aaron Spink
speaking for myself inc.
 
aaronspink said:
Instruction scheduling should have improved because having 2 MACs vs 1 MAC and a MUL which means they you can schedule it as 2 ADDs, 2 MULs, or 2 MACs. The real question is how much of a benefit is having 2 MACs instead of say 2 MULs and 1 ADD, or even 3 ALUs that can do either 1 mul or 1 add. The reason this is an interesting question is that register file ports can become a critical issue and for the same number of ports it is possible to have a more flexible design without using MACs.

Scheduling has also improved because they can schedule better many special functions. Scheduling and compiler efficiency is now way better. 3 ALUs that can do 1 mul or 1 add would have been even more efficient, but it would have required more transistors and more registers access to reach that efficiency.

To be able to issue 2 4D-MADs there are many limitations with 7800. Main one is of course register ports (in practice there are only a few cases where it is possible to issue 2 4D-MADs) but there are other limitations.



FYI I know the difference between the MAC and independant Muls and Adds, I do design uP in my copious spare time. I was responding to the post that was a little unclear in talking about a MAC as a mul+add (while a MAC performs a Mul operation and an Add operation, it is cascaded and also generally a round step isn't performed after the mul portion (though there are some architectures where the MAC is truely a MADD but they are rare)).

I know that you know ;) But you were unclear about what was exactly in the 6800/7800. Just wanted to clarify that.


The Gefore 6 contains 1 MAC and 1 Mul if I remember correctly. The Geforce 7 contains 2 MACs. I assume that they are independantly connected to the register file, but this may or may not be correct. Has anyone done any testing to see if they are independant or cascaded?

Independent.
 
Back
Top