What is the chance that the nv36 has improvments?

Sxotty

Legend
Just wondering what is the chance that they provided enough resources for it to actually function at a reasonable rate in fp32 mode? Or some other way that it can funtion with compatibility to dx9.

probably little but does anyone know better?
 
Well...
If the problem regarding register is latency, and that latency has been created by the lack of Low-K, and that SOI could have the same effect as Low-K ( this is assuming NV36 *does* work on SOI )...

Then, it might have better FP32 performance.

And then, it would also mean I'm really the master of the world in disguise :devilish: 8) :p

( read: Practically impossible it got better FP32 performance )


Uttar
 
Uttar said:
( read: Practically impossible it got better FP32 performance )

I thought the problem was lack of registers. If that was the case, adding some registers would help.

Not sure whether or not their ISA is amenable to that or not, however.
 
wouldn't it be a waste at this point to put more money into the fx line and just put as much time into the nv40 as possible . I doubt the nv36 will be much more than a speed bump to stay close to the radeons in legacy games and give a little better performance in dx 9 games.
 
I don't see any major architectural improvement (I could be wrong, of course), but a speed-bump (hopefuly larger than ATI's), and tweaks to get better yields and be able to attack ATI on the pricing front.
 
Ok, the nv36 is a mid-range part so it is not even supposed to compete with the nv35 right? That means if they did make it actually function at acceptable speed in fp32 (somewhere close to a 9600pro), it would be faster than the nv35 (at least in dx9) right? In any case I was just wondering if they could drop the color compression stuff and basically replace it with registers as bandwidth has not been the limiting factor for the nv35 at least.
 
If the NV3x pipeline uses a register file, they have to store somewhere which register is used by which pixel (or quad). The logical choice would be to store that in the register itself. But it could be 'hard-coded' as well by the instruction sceduler (the microcode that translates the opcodes to actual operations). So, say there are 16 registers. In that case there have to be 4 (or 6) bits somewhere that record what register is used by which pixel.

That hints at two interesting questions:

1) Are the registers used by all the pixels in a quad? If so, are they actually 32*4*4 = 512 bits? Or does each pixel has its own registers?

2) How and where is this information used and stored? In the registers itself or in the operation masks (the low-level opcodes)?

But all in all it does not matter. The hardware probably has a fixed amount of bits ( 4, 6 or 8 ) in all calculations that specify the register to use. Expanding that is not trivial. It would require reworking all the logic of the pipeline.

The most logical choice would be to use 8 bits. We need 2 to specify the part (x,y,z,c/a), 2 bits for the pixel in the quad and 4 bits for the actual register. They're not likely to expand that to 9.

;)

EDIT: another interesting question:

How and where do they store if 16 or 32 bit precision is used? And do they combine the mantissa and offset or split the register in half?
 
DiGuru said:
EDIT: another interesting question:

How and where do they store if 16 or 32 bit precision is used?
If it's anything like how CPU instruction sets are done, the information whether a register stores a single 32 bit value or two packed 16 bit values is "stored" in the instructions that are used. So, for example, there will be an ADD32 instruction that treats the two source registers as 32-bit values, and an ADD16 instruction that treats them as 16-bit values.

Presumably the extra 16-bit registers are architected into the native machine instructions. If we take the NV_fragment_program spec as a clue, NV3x seems to have 32 architected 32-bit registers, and 64 16-bit registers. So we might assume that 32-bit instructions can only address 32 registers (called R0 - R31 in NV_fragment_program), while with the 16-bit instructions you can address 64 (H0 - H63). The important point is, if that scheme is correct, a 16-bit instruction writing to H32 would also write to the upper half of R0 (if you read it with a 32-bit instruction).

Of course it's not clear this is exactly what's going on--for one thing, it doesn't seem NV3x actually has 32/64 physical registers (even though, again, that's the number architected in NV_fragment_program), so who knows how many are architected into the internal machine instruction set. But it seems pretty likely that something like this is what's going on. Of course the aliasing issue I mentioned above would be very confusing if the internal machine language were programmer visible, but as it isn't it shouldn't be a problem.

And do they combine the mantissa and offset or split the register in half?
Well, the exponent and mantissa fields in FP32 are not just double the width of the corresponding fields in FP16: FP16 is s10e5, while FP32 is s23e8, so it would seem like the something like the latter must be what's going on. In any case, this shouldn't make any programmer-visible difference, unless the aliasing issue I mentioned above can be triggered through a programmer visible API (i.e. PS 2.0, ARB_f_p, NV_f_p); but I don't think it can, so the question is sort of irrelevent.
 
Dave H, if they really use 32 registers, they need 5 bits. That would leave them an odd bit to specify the upper or lower half. As the instruction itself has to be in there as well, they need at least 6 bits for the 43 opcodes supported. And we need a target as well as a source. 10 + 10 + 6 = 26 bits.

That leaves 6 bits for a third item (like an operand for a MAD, an offset or an extension of the instruction) if they use 32 bits, but would pose a problem with constant values. Hm. Don't they store constants in program space?

:D

Btw. That would give a pipeline that is 16 quads deep when the microcode does not issue extra registers for constants and things like MAD instructions, or 8 when it does. Or even just 4 when it uses a 'classic' CPU design.

Edit2: I think, that the inherent latency of the design is probably more important than the amount of registers used, when they use a 'classic' design or issue extra registers: they need the depth to prevent latency stalls.

Oh, and they probably do have 32 registers, but only expose 16 of them.
 
DiGuru said:
That leaves 6 bits for a third item (like an operand for a MAD, an offset or an extension of the instruction) if they use 32 bits, but would pose a problem with constant values. Hm. Don't they store constants in program space?
They certainly use much more than 32 bits per instruction. Arbitrary swizzle requires 8 bits per input value (you need to choose one out of four for every channel), of which there can be up to three. Write mask is another 4 bits. Then there are the saturation and invert modifiers, so you already need 34 bits for those alone.
 
Xmas said:
DiGuru said:
That leaves 6 bits for a third item (like an operand for a MAD, an offset or an extension of the instruction) if they use 32 bits, but would pose a problem with constant values. Hm. Don't they store constants in program space?
They certainly use much more than 32 bits per instruction. Arbitrary swizzle requires 8 bits per input value (you need to choose one out of four for every channel), of which there can be up to three. Write mask is another 4 bits. Then there are the saturation and invert modifiers, so you already need 34 bits for those alone.

Yes, but the instruction scheduler would split that into chunks (the actual instruction masks) that perform operations. The pipeline has multiple stages, that do such sub-instructions in turn. At the end, all the units together would execute the specified amount of 'high-level' instructions per clockpulse, but actually it would be a larger amount of smaller instructions, executed in different parts of the pipeline.

EDIT: the compiler does the same thing.
 
On second thought: if they reserve 10 bits for the source and have 6 bits 'left', they could make a few instructions that use 16 bit constants and use the source as the target. For example: x=x+10. Unless they store immediate results in (not exposed) registers or use a 'classic' architecture, where operations can only execute on registers (source + modifier -> ( temp storage + modifier -> ) destination).
 
Do you guys think my analysis is reasonably accurate? Or is it way off the mark?
 
DiGuru said:
On second thought: if they reserve 10 bits for the source and have 6 bits 'left', they could make a few instructions that use 16 bit constants and use the source as the target. For example: x=x+10. Unless they store immediate results in (not exposed) registers or use a 'classic' architecture, where operations can only execute on registers (source + modifier -> ( temp storage + modifier -> ) destination).
Hm, why do you think those chunks for the different pipeline stages would be powers of two bits wide? So that 6 bits or so can be 'left'? For example, the source selection stage certainly needs a lot of information, while the adder stage can't do much more than either adding component-wise or add components for dot product.

I'd expect constants to occur as 64bit (16x4 or maybe 32x2) or 128bit (32x4), having single component constants put into instructions might be too much of a hassle.
 
Xmas said:
DiGuru said:
On second thought: if they reserve 10 bits for the source and have 6 bits 'left', they could make a few instructions that use 16 bit constants and use the source as the target. For example: x=x+10. Unless they store immediate results in (not exposed) registers or use a 'classic' architecture, where operations can only execute on registers (source + modifier -> ( temp storage + modifier -> ) destination).
Hm, why do you think those chunks for the different pipeline stages would be powers of two bits wide? So that 6 bits or so can be 'left'? For example, the source selection stage certainly needs a lot of information, while the adder stage can't do much more than either adding component-wise or add components for dot product.

I'd expect constants to occur as 64bit (16x4 or maybe 32x2) or 128bit (32x4), having single component constants put into instructions might be too much of a hassle.

Agreed. But most CPU's use immediate (small) values in some instructions. For example, an ADD instruction could have 2 opcodes: one to use a (temp) register and the accumulator, another would use an immediate value in the instruction and the accumulator.

That is much faster than first reading the instruction, copying the constant to a register, reading the next instruction and then executing the ADD instruction on the registers.
 
Sorry for ressurecting an old thread, but...
The performance hit of register usage is about 5% *bigger* on the NV36 ( transistor saving for a mainstream part I assume ). I'll let maxpower of nV News publish the full numbers on the 23 :)


Uttar
 
I'm coming into this thread kind of late, but shouldn't the NV36 have better PS performance than NV31 simply because the former is based on the NV35, while the latter is based on the black sheep NV30?

Or did the 5800 improve PS performance along with the 5900 with the Det 52.xx's? I'd love to see benchmarks including a 5800/U in upcoming articles. Surely some sites have a 5800 around? :)
 
Yep, of course, compared to NV31, dramatically improved, but that was to be expected :)
What I meant is that's it's not faster clock-for-clock, for PS at least, than half a NV35, on which the chip is based.


Uttar
 
I would have expected a 4x1 to have 50% shader parity cf a 4x2 ceteris paribus, given both can process 1 full quad/cycle. Perhaps it's offset when NV35 can act as 8x0 & fillrate/bandwidth differences?
 
Back
Top