weird result concerning R300's shader unit

991060

Regular
Wel, this might not be as interesting as the NV40/R420 discussion, but I want an answer. :oops:

I did some quick test on R300, here's the result:
Code:
mov r1, c0
texld r0, t0, s0
mul r0, r0, c2
add r0, r0, r1
takes 1 clock

Code:
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
mul r0, r0, c2
takes 2 clocks

These results suggest R300's mini alu can only do add(I know it can also be register modifier). And I have more:
Code:
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
add r0, r0, c2
takes 2 clocks, which leads me to think maybe the full shader core can not do add, a little similiar to NV40. So I change the shader to:
Code:
def c5, 2.0f, 4.0f, 8.0f, 1.0f
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
mul r0, r0, c5.r
takes 1 clock, as you can see, the last instruction is obviously taken care of by the mini alu, so the full alu can do add for sure.

Now the question is: what makes the full and mini alu can't do add simutaneously?
 
It works this way:

1:
Code:
texld r0,t0,s0 (TEX:Pass1)
mad r0,r0,c2,c0 (ALU:Pass1)

2:
Code:
texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
mul r0,r0,c2 (ALU:Pass2)

3:
Code:
texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
add r0,r0,c2 (ALU:Pass2)

4:
Code:
texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
r0=r0*2 (Mini-ALU:Pass1)
 
Actually, the last example as it is doesn't even prove the presence of a mini-ALU. MOVing c0 to r1 doesn't change the fact that c0 is a constant that can be premultiplied by 2, so you can replace the add and mul with a mad.
 
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
 
sireric said:
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
Hmm, this is interesting...
And another thing, it seems R300 only have one inerpolator for v0 and v1 in the pixel shader, is that true?
 
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).
 
Ostsol said:
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).
It's still possible that the pipeline can interpolate only one or two vec4s per clock cycle. The interpolated data are not required to be present before pixel shader instructions actually use them, and it takes quite many instructions to access ten vec4s anyway.
 
sireric said:
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
So ALU and mini-ALU are actually running parallel and not as a serial pipeline?
 
Xmas said:
sireric said:
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
So ALU and mini-ALU are actually running parallel and not as a serial pipeline?

They run in parallel and have serial data dependancy.
 
Xmas said:
sireric said:
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
So ALU and mini-ALU are actually running parallel and not as a serial pipeline?

They run in parallel and have serial data dependancy.
 
Ostsol said:
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).

DX9 requires that at least 8 vec4 textures and 2 vec4 colors per interpolatable.

Not sure that we've put out the iterator rate, so I'll leave it as an exercize for the reader :)
 
I'm not sure I understand. The scalar and vector (both full and mini) run in parallel. You can issue to both sets every cycle, and you can use the output of one into the other every cycle too, but, again, everything takes time to compute. There's always going to be serial data dependancy on operations -- R0=(A op B) followed by R1 = (R0 op C) followed by R2=(R1 op D) has a serial data dependancy. Assuming you can't do an operation such as (X op Y op Z op W) in 1 cycle (op being some sort of operation), then there's a latency you need to wait for, regardless of the number of parallel units.
 
Maybe he means feeding the result of a special function (like rsq) into a vec? I think he's asking if the serialization is a crossbar.
 
In that case, yes :)

You can take any scalar output and send it to any of the component inputs of the vector on the next instruction. It's very component. Same thing you can take the vec output and send it to the scalar.

Edit: Of course, that also means that any of these types of sequences will take 2 cycles.
 
That is exactly what I meant DemoCoder.

Sireric, you mean to say that a Vec/Scalar unit could offer its output as input to any of the other ALUs, mini or large?
 
Back
Top