Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 25-Aug-2004, 10:56   #1
tcchiu
Junior Member
 
Join Date: Jun 2004
Location: Taiwan
Posts: 22
Default Maximum number of instructions per clock of a NV40 pipeline

NV40 Technology explained.
http://www.3dcenter.org/artikel/nv40...e/index2_e.php

(To clarify first, the terminology "pipeline" in this post is not a hardware quad-pipeline.)

Quote:
Dual- and co-issue combined, an NV40 pipe can execute up to four instructions – while having a single all-purpose arithmetic unit only.
I don't understand. I think the number should be eight.

With co-issue, each single shader unit can execute four instrunctions per clock. For example, shader unit 1 should be able to execute the following four instructions in a cycle:
Code:
rsq r0.r, v0.r
mul r1.r, r0.r, v1.r
rsq r0.b, v1.b
mul r1.b, r0.b, v2.b
r and b channels could be co-issued. Each mul depends on the previous rsq, but could be dual-issued. That is, shader unit 1 can execute these four instrunctions per clock.

Since there are two shader units, and suppose NV40 could be dual-issued between the two shader units - the destination of shader unit 1 could be the source of shader unit 2 (if not, why this 'complement' design?), there can be another four instrunctions, and the maximum number of instruction per clock should be eight.

Because NV40 has 16 pipes in total, overall chip-perfomance is 16T + 128M at maximum.

What I misssed?
tcchiu is offline   Reply With Quote
Old 25-Aug-2004, 16:54   #2
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Three things:
1. I'm not sure that you can co-issue special functions.
2. The second shader unit is not identical to the first.
3. There may be a hard limit due to bandwidth or cache constraints that prevent more than four instructions from being executed each clock.
Chalnoth is offline   Reply With Quote
Old 25-Aug-2004, 18:24   #3
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

According to 3DGPU the limit of 4 instructions per clock is a result of the the SUs available number of data paths (4 data paths).

I'm sure that 2 RCP cannot be co-issued in parallel, as there is only one RCP unit, but I'm not sure about RCP and RSQ? Acording to the article, these are separate units, although I believe they are arranged in serial, which would prevent them from functioning independently. I guess one unit could modify the output of the other, but this would come to no real benefit.
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote
Old 25-Aug-2004, 19:06   #4
tcchiu
Junior Member
 
Join Date: Jun 2004
Location: Taiwan
Posts: 22
Default

Quote:
Originally Posted by Chalnoth
Three things:
1. I'm not sure that you can co-issue special functions.
2. The second shader unit is not identical to the first.
3. There may be a hard limit due to bandwidth or cache constraints that prevent more than four instructions from being executed each clock.
I am not sure about 1 either, but it seems okay according to the figures gave in "NV40 technology explained". I know it may be just a guess, that I seeks for more (solid) information here.

About 2: I know the shader unit 2 are not identical to the shader unit 1. The four-line example code in my original post cannot be executed in the shader unit 2.

Maybe I didn't make myself clear in the original post. The four-instruction sample code should be executed in a cycle by the shader unit 1 (I am not sure, that's why I posted this question). What I meant by "there can be another four instructions (could be executed by the shader unit 2)" are not the same four-instruction sample code I gave for the shader unit 1. For example, another sequence of four instructions consist of MAD and DOT, with totally different channels (so they can be co-issued and dual-isssued as the sample four-instruction for the shader unit 1).

About 3: in the case of the "hard limit" (e.g. the read ports of the instrunction buffer), the linked article should have mentioned it, isn't it?
tcchiu is offline   Reply With Quote
Old 25-Aug-2004, 19:18   #5
tcchiu
Junior Member
 
Join Date: Jun 2004
Location: Taiwan
Posts: 22
Default

Quote:
Originally Posted by Luminescent
I guess one unit could modify the output of the other, but this would come to no real benefit.
The benefit is capable of executing four "dependent" instrunctions in a clock cycle.

Code:
rcp r0.r, v0.r
mul r1.r, r0.r, v1.r
mul r2.r, r1.r, v2.r
add r3.r, r2.r, v3.r
The first two instrunctions could be dual-issued in the shader unit 1, and the later two in the shader unit 2. If the shader unit 2 could modify the result of the shader unit 1, even though the second mul depends on the destination register (r1.r) of the first mul, these four instrunctions could be executed in one clock cycle - under the assumption that the dual-issue happens across two shader units (I cannot find any statement in "NV40 Technology explained" supporting this).
tcchiu is offline   Reply With Quote
Old 25-Aug-2004, 19:19   #6
Xmas
Off-season
 
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
Default

Dual-issue means having SU1 and SU2 perform different operations. Co-issue means performing two ops in one SU, effectively splitting it. You can only co-issue two MAD/MUL/ADD/DPx, and only use up to four channels. I.e. SU2 can only do 2 instructions per clock.

You can't co-issue special functions. I think the maximum you can reach is 6 instructions/clock, e.g.
RCP + 2 MUL2 in SU1
2 MAD2 in SU2
NRM_PP (two cycles latency)
Xmas is offline   Reply With Quote
Old 25-Aug-2004, 22:55   #7
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by Xmas
You can't co-issue special functions. I think the maximum you can reach is 6 instructions/clock, e.g.
RCP + 2 MUL2 in SU1
2 MAD2 in SU2
NRM_PP (two cycles latency)
Perhaps, unless there's an absolute hard limit on the number of instructions.
Chalnoth is offline   Reply With Quote
Old 26-Aug-2004, 07:00   #8
tcchiu
Junior Member
 
Join Date: Jun 2004
Location: Taiwan
Posts: 22
Default

Quote:
Originally Posted by Xmas
You can only co-issue two MAD/MUL/ADD/DPx, and only use up to four channels. I.e. SU2 can only do 2 instructions per clock.
What does "only use up to four channels" mean? Doesn't the following code snippet use up to four channels? (I expect SU2 can execute all of them in a clock cycle.)

Code:
mul r0.rg, v1.rg, v2.rg
add r1.rg, r0.rg, v3.rg
mul r0.ba, v4.ba, v5.ba
add r1.ba, r0.ba, v6.ba
tcchiu is offline   Reply With Quote
Old 26-Aug-2004, 11:21   #9
pat777
Member
 
Join Date: May 2004
Posts: 230
Default

http://techreport.com/etc/2004q2/tamasi/index.x?pg=4

Quote:
TR: Inside of the pixel pipeline, you've got two of the FP32 pixel shaders in each pixel pipe. Can both of them do parallel vector operations per clock?

Tamasi: Yep. The way to think about it is that you can dual (or more) issue instructions per shader unit, and then you can co-issue between them as well, so, in fact, you can have four, or in some cases more than four, instructions being issued on a single pixel pipeline—two in shader unit one and two in shader unit two—two independent instructions in shader unit one and another two independent instructions in shader unit two. We also have mini-ALUs in each of those shader units, as well, which also can have instructions issued to them. We gave a shader example that actually had up to seven instructions being executed in parallel in one pass.
It appears one NV40 pipe can do 7 instructions per clock under a special circumstance.
pat777 is offline   Reply With Quote
Old 26-Aug-2004, 15:44   #10
Tridam
Regular
 
Join Date: Apr 2003
Location: Louvain-la-Neuve, Belgium
Posts: 523
Default

Quote:
Originally Posted by pat777
http://techreport.com/etc/2004q2/tamasi/index.x?pg=4

Quote:
TR: Inside of the pixel pipeline, you've got two of the FP32 pixel shaders in each pixel pipe. Can both of them do parallel vector operations per clock?

Tamasi: Yep. The way to think about it is that you can dual (or more) issue instructions per shader unit, and then you can co-issue between them as well, so, in fact, you can have four, or in some cases more than four, instructions being issued on a single pixel pipeline—two in shader unit one and two in shader unit two—two independent instructions in shader unit one and another two independent instructions in shader unit two. We also have mini-ALUs in each of those shader units, as well, which also can have instructions issued to them. We gave a shader example that actually had up to seven instructions being executed in parallel in one pass.
It appears one NV40 pipe can do 7 instructions per clock under a special circumstance.
The number 7 comes from a shader example used by NVIDIA for the press briefings. IIRC some useless instructions were added to this shader to expand the max number of instructions showed to the press. A lot of instructions were modifiers and there were a lot of syntax errors (mismatch of opengl and direct3d syntax). It always makes me smile when I read this number 7


I think that the maximum is what Xmas said + modifiers

rcp.w
mul.w
mul.xyz
bx2.xyzw
nrm_pp
add.xy
add.zw
bx2.xyzw

-> 8 instructions with modifiers

However many things can prevent this to happen. Register usage (1 interpolated, 2 constants, ?2 temporary write / 4 temporary read?) or scheduling difficulties.
__________________
Damien Triolet - HardWare.fr
Sorry for my bad English. Maybe one day it'll be better :D
Tridam is offline   Reply With Quote
Old 26-Aug-2004, 16:19   #11
Luminescent
Senior Member
 
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
Default

In what unit is nrm_pp executed? Is there a specialized unit for normalization?
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival."
-C.S. Lewis
Luminescent is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Typical GPU Efficiency rwolf 3D Architectures & Chips 102 17-Jun-2005 03:23
r420 may beat nv40 in doom3 with anti-aliasing LokeshRay 3D Architectures & Chips 295 28-Jun-2004 08:39
How does the NV30 really store PS programs? Arun 3D Architectures & Chips 19 20-Feb-2003 13:54
Microsoft to own every GPU? Cyborg 3D Architectures & Chips 26 14-Jul-2002 11:15


All times are GMT +1. The time now is 19:51.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.