weird result concerning R300's shader unit

991060 · May 12, 2004

Wel, this might not be as interesting as the NV40/R420 discussion, but I want an answer.

I did some quick test on R300, here's the result:

Code:

mov r1, c0
texld r0, t0, s0
mul r0, r0, c2
add r0, r0, r1

takes 1 clock

Code:

mov r1, c0
texld r0, t0, s0
add r0, r0, r1
mul r0, r0, c2

takes 2 clocks

These results suggest R300's mini alu can only do add(I know it can also be register modifier). And I have more:

Code:

mov r1, c0
texld r0, t0, s0
add r0, r0, r1
add r0, r0, c2

takes 2 clocks, which leads me to think maybe the full shader core can not do add, a little similiar to NV40. So I change the shader to:

Code:

def c5, 2.0f, 4.0f, 8.0f, 1.0f
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
mul r0, r0, c5.r

takes 1 clock, as you can see, the last instruction is obviously taken care of by the mini alu, so the full alu can do add for sure.

Now the question is: what makes the full and mini alu can't do add simutaneously?

Demirug · May 12, 2004

It works this way:

1:

Code:

texld r0,t0,s0 (TEX:Pass1)
mad r0,r0,c2,c0 (ALU:Pass1)

2:

Code:

texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
mul r0,r0,c2 (ALU:Pass2)

3:

Code:

texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
add r0,r0,c2 (ALU:Pass2)

4:

Code:

texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
r0=r0*2 (Mini-ALU:Pass1)

991060 · May 12, 2004

Thanks Demirug, I think you're right, the mini alu can not do add at all.

Xmas · May 12, 2004

Actually, the last example as it is doesn't even prove the presence of a mini-ALU. MOVing c0 to r1 doesn't change the fact that c0 is a constant that can be premultiplied by 2, so you can replace the add and mul with a mad.

sireric · May 12, 2004

You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.

Reverend · May 13, 2004

Man, and I thought all the DevRel guys do at dev houses are play games...

991060 · May 13, 2004

sireric said:
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.

Hmm, this is interesting...
And another thing, it seems R300 only have one inerpolator for v0 and v1 in the pixel shader, is that true?

Ostsol · May 13, 2004

Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).

Simon F · May 13, 2004

Reverend said:
Man, and I thought all the DevRel guys do at dev houses are play games...

I believe it's called "testing"

Ailuros · May 13, 2004

Simon F said:
Reverend said:

Man, and I thought all the DevRel guys do at dev houses are play games...

Click to expand...

I believe it's called "testing"

I always had the suspicion that I'm working in the wrong branch

arjan de lumens · May 13, 2004

Ostsol said:
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).

It's still possible that the pipeline can interpolate only one or two vec4s per clock cycle. The interpolated data are not required to be present before pixel shader instructions actually use them, and it takes quite many instructions to access ten vec4s anyway.

Xmas · May 13, 2004

sireric said:
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.

So ALU and mini-ALU are actually running parallel and not as a serial pipeline?

sireric · May 13, 2004

Xmas said:
sireric said:

You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.

Click to expand...

So ALU and mini-ALU are actually running parallel and not as a serial pipeline?

They run in parallel and have serial data dependancy.

sireric · May 13, 2004

Xmas said:
sireric said:

You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.

Click to expand...

So ALU and mini-ALU are actually running parallel and not as a serial pipeline?

They run in parallel and have serial data dependancy.

sireric · May 13, 2004

Ostsol said:
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).

DX9 requires that at least 8 vec4 textures and 2 vec4 colors per interpolatable.

Not sure that we've put out the iterator rate, so I'll leave it as an exercize for the reader

Luminescent · May 13, 2004

Do the Vec units ever use the scalar units for input on an instruction? Would this be possible?

sireric · May 13, 2004

I'm not sure I understand. The scalar and vector (both full and mini) run in parallel. You can issue to both sets every cycle, and you can use the output of one into the other every cycle too, but, again, everything takes time to compute. There's always going to be serial data dependancy on operations -- R0=(A op B) followed by R1 = (R0 op C) followed by R2=(R1 op D) has a serial data dependancy. Assuming you can't do an operation such as (X op Y op Z op W) in 1 cycle (op being some sort of operation), then there's a latency you need to wait for, regardless of the number of parallel units.

DemoCoder · May 13, 2004

Maybe he means feeding the result of a special function (like rsq) into a vec? I think he's asking if the serialization is a crossbar.

sireric · May 13, 2004

In that case, yes

You can take any scalar output and send it to any of the component inputs of the vector on the next instruction. It's very component. Same thing you can take the vec output and send it to the scalar.

Edit: Of course, that also means that any of these types of sequences will take 2 cycles.

Luminescent · May 13, 2004

That is exactly what I meant DemoCoder.

Sireric, you mean to say that a Vec/Scalar unit could offer its output as input to any of the other ALUs, mini or large?

weird result concerning R300's shader unit

991060

Demirug

991060

Xmas

Porous

sireric

Reverend

991060

Ostsol

Simon F

Tea maker

Ailuros

Epsilon plus three

arjan de lumens

Xmas

Porous

sireric

sireric

sireric

Luminescent

sireric

DemoCoder

sireric

Luminescent

Similar threads