Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 12-May-2004, 14:20   #1
991060
Member
 
Join Date: Jul 2003
Location: Beijing
Posts: 640
Default weird result concerning R300's shader unit

Wel, this might not be as interesting as the NV40/R420 discussion, but I want an answer.

I did some quick test on R300, here's the result:
Code:
mov r1, c0
texld r0, t0, s0
mul r0, r0, c2
add r0, r0, r1
takes 1 clock

Code:
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
mul r0, r0, c2
takes 2 clocks

These results suggest R300's mini alu can only do add(I know it can also be register modifier). And I have more:
Code:
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
add r0, r0, c2
takes 2 clocks, which leads me to think maybe the full shader core can not do add, a little similiar to NV40. So I change the shader to:
Code:
def c5, 2.0f, 4.0f, 8.0f, 1.0f
mov r1, c0
texld r0, t0, s0
add r0, r0, r1
mul r0, r0, c5.r
takes 1 clock, as you can see, the last instruction is obviously taken care of by the mini alu, so the full alu can do add for sure.

Now the question is: what makes the full and mini alu can't do add simutaneously?
991060 is offline   Reply With Quote
Old 12-May-2004, 14:41   #2
Demirug
Senior Member
 
Join Date: Dec 2002
Posts: 1,326
Send a message via MSN to Demirug
Default

It works this way:

1:
Code:
texld r0,t0,s0 (TEX:Pass1)
mad r0,r0,c2,c0 (ALU:Pass1)
2:
Code:
texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
mul r0,r0,c2 (ALU:Pass2)
3:
Code:
texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
add r0,r0,c2 (ALU:Pass2)
4:
Code:
texld r0,t0,s0 (TEX:Pass1)
add r0,r0,c0 (ALU:Pass1)
r0=r0*2 (Mini-ALU:Pass1)
__________________
GPU blog
Demirug is offline   Reply With Quote
Old 12-May-2004, 14:54   #3
991060
Member
 
Join Date: Jul 2003
Location: Beijing
Posts: 640
Default

Thanks Demirug, I think you're right, the mini alu can not do add at all.
991060 is offline   Reply With Quote
Old 12-May-2004, 16:32   #4
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Actually, the last example as it is doesn't even prove the presence of a mini-ALU. MOVing c0 to r1 doesn't change the fact that c0 is a constant that can be premultiplied by 2, so you can replace the add and mul with a mad.
Xmas is offline   Reply With Quote
Old 12-May-2004, 21:27   #5
sireric
Member
 
Join Date: Jul 2002
Location: Santa Clara, CA
Posts: 348
Default

You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
sireric is offline   Reply With Quote
Old 13-May-2004, 01:50   #6
Reverend
Naughty Boy!
 
Join Date: Jan 2002
Posts: 3,266
Default

Man, and I thought all the DevRel guys do at dev houses are play games...
__________________
Reverend
Dev Anon : Best game ever? Hmm... you mean other than anything from us? (2005)
Reverend is offline   Reply With Quote
Old 13-May-2004, 02:38   #7
991060
Member
 
Join Date: Jul 2003
Location: Beijing
Posts: 640
Default

Quote:
Originally Posted by sireric
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
Hmm, this is interesting...
And another thing, it seems R300 only have one inerpolator for v0 and v1 in the pixel shader, is that true?
991060 is offline   Reply With Quote
Old 13-May-2004, 13:28   #8
Ostsol
Senior Member
 
Join Date: Nov 2002
Location: Edmonton, Alberta, Canada
Posts: 1,765
Default

Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).
__________________
"Extremism is so easy. You've got your position, and that's it. It doesn't take much thought. And when you go far enough to the right, you meet the same idiots coming around from the left." -- Clint Eastwood

-Ostsol
Ostsol is offline   Reply With Quote
Old 13-May-2004, 13:31   #9
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,382
Default

Quote:
Originally Posted by Reverend
Man, and I thought all the DevRel guys do at dev houses are play games...
I believe it's called "testing"
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline   Reply With Quote
Old 13-May-2004, 13:57   #10
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,768
Default

Quote:
Originally Posted by Simon F
Quote:
Originally Posted by Reverend
Man, and I thought all the DevRel guys do at dev houses are play games...
I believe it's called "testing"
I always had the suspicion that I'm working in the wrong branch
Ailuros is offline   Reply With Quote
Old 13-May-2004, 14:06   #11
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

Quote:
Originally Posted by Ostsol
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).
It's still possible that the pipeline can interpolate only one or two vec4s per clock cycle. The interpolated data are not required to be present before pixel shader instructions actually use them, and it takes quite many instructions to access ten vec4s anyway.
arjan de lumens is offline   Reply With Quote
Old 13-May-2004, 16:30   #12
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Quote:
Originally Posted by sireric
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
So ALU and mini-ALU are actually running parallel and not as a serial pipeline?
Xmas is offline   Reply With Quote
Old 13-May-2004, 18:25   #13
sireric
Member
 
Join Date: Jul 2002
Location: Santa Clara, CA
Posts: 348
Default

Quote:
Originally Posted by Xmas
Quote:
Originally Posted by sireric
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
So ALU and mini-ALU are actually running parallel and not as a serial pipeline?
They run in parallel and have serial data dependancy.
sireric is offline   Reply With Quote
Old 13-May-2004, 18:27   #14
sireric
Member
 
Join Date: Jul 2002
Location: Santa Clara, CA
Posts: 348
Default

Quote:
Originally Posted by Xmas
Quote:
Originally Posted by sireric
You guys are running under the false assumption that both the alus have 0 latency. They don't. Consequently, assuming a data dependancy on a series of adds, it will take 2 cycles of latency for 2 adds. However, we can issue to all units every cycle, so you'll need to interleave multiple operations to hide the latencies.
So ALU and mini-ALU are actually running parallel and not as a serial pipeline?
They run in parallel and have serial data dependancy.
sireric is offline   Reply With Quote
Old 13-May-2004, 18:30   #15
sireric
Member
 
Join Date: Jul 2002
Location: Santa Clara, CA
Posts: 348
Default

Quote:
Originally Posted by Ostsol
Hmm. . . Their documentation says that ten vec4s can be interpolated (two being reserved for colours and being clamped to [0-1] at 12 bit precision).
DX9 requires that at least 8 vec4 textures and 2 vec4 colors per interpolatable.

Not sure that we've put out the iterator rate, so I'll leave it as an exercize for the reader
sireric is offline   Reply With Quote
Old 13-May-2004, 19:34   #16
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

Do the Vec units ever use the scalar units for input on an instruction? Would this be possible?
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote
Old 13-May-2004, 19:40   #17
sireric
Member
 
Join Date: Jul 2002
Location: Santa Clara, CA
Posts: 348
Default

I'm not sure I understand. The scalar and vector (both full and mini) run in parallel. You can issue to both sets every cycle, and you can use the output of one into the other every cycle too, but, again, everything takes time to compute. There's always going to be serial data dependancy on operations -- R0=(A op B) followed by R1 = (R0 op C) followed by R2=(R1 op D) has a serial data dependancy. Assuming you can't do an operation such as (X op Y op Z op W) in 1 cycle (op being some sort of operation), then there's a latency you need to wait for, regardless of the number of parallel units.
sireric is offline   Reply With Quote
Old 13-May-2004, 19:54   #18
DemoCoder
Regular
 
Join Date: Feb 2002
Location: California
Posts: 4,732
Default

Maybe he means feeding the result of a special function (like rsq) into a vec? I think he's asking if the serialization is a crossbar.
DemoCoder is offline   Reply With Quote
Old 13-May-2004, 19:56   #19
sireric
Member
 
Join Date: Jul 2002
Location: Santa Clara, CA
Posts: 348
Default

In that case, yes

You can take any scalar output and send it to any of the component inputs of the vector on the next instruction. It's very component. Same thing you can take the vec output and send it to the scalar.

Edit: Of course, that also means that any of these types of sequences will take 2 cycles.
sireric is offline   Reply With Quote
Old 13-May-2004, 20:12   #20
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

That is exactly what I meant DemoCoder.

Sireric, you mean to say that a Vec/Scalar unit could offer its output as input to any of the other ALUs, mini or large?
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PowerVR families - shader capabilities Megadrive1988 3D Architectures & Chips 29 05-Aug-2005 00:43
ImgTech launches programmable shader graphics for mobiles marco Press Releases 0 29-Jul-2005 09:40
xbit labs reviews farcry 1.2 hovz 3D Hardware, Software & Output Devices 261 26-Jul-2004 09:35
How does the NV30 really store PS programs? Arun 3D Architectures & Chips 19 20-Feb-2003 13:54
Microsoft to own every GPU? Cyborg 3D Architectures & Chips 26 14-Jul-2002 11:15


All times are GMT +1. The time now is 00:04.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.