If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 | |
|
Junior Member
Join Date: Jun 2004
Location: Taiwan
Posts: 22
|
NV40 Technology explained.
http://www.3dcenter.org/artikel/nv40...e/index2_e.php (To clarify first, the terminology "pipeline" in this post is not a hardware quad-pipeline.) Quote:
With co-issue, each single shader unit can execute four instrunctions per clock. For example, shader unit 1 should be able to execute the following four instructions in a cycle: Code:
rsq r0.r, v0.r mul r1.r, r0.r, v1.r rsq r0.b, v1.b mul r1.b, r0.b, v2.b Since there are two shader units, and suppose NV40 could be dual-issued between the two shader units - the destination of shader unit 1 could be the source of shader unit 2 (if not, why this 'complement' design?), there can be another four instrunctions, and the maximum number of instruction per clock should be eight. Because NV40 has 16 pipes in total, overall chip-perfomance is 16T + 128M at maximum. What I misssed? |
|
|
|
|
|
|
#2 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Three things:
1. I'm not sure that you can co-issue special functions. 2. The second shader unit is not identical to the first. 3. There may be a hard limit due to bandwidth or cache constraints that prevent more than four instructions from being executed each clock.
__________________
April 20, 1979 - America must never forget. |
|
|
|
|
|
#3 |
|
Senior Member
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
|
According to 3DGPU the limit of 4 instructions per clock is a result of the the SUs available number of data paths (4 data paths).
I'm sure that 2 RCP cannot be co-issued in parallel, as there is only one RCP unit, but I'm not sure about RCP and RSQ? Acording to the article, these are separate units, although I believe they are arranged in serial, which would prevent them from functioning independently. I guess one unit could modify the output of the other, but this would come to no real benefit.
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival." -C.S. Lewis |
|
|
|
|
|
#4 | |
|
Junior Member
Join Date: Jun 2004
Location: Taiwan
Posts: 22
|
Quote:
About 2: I know the shader unit 2 are not identical to the shader unit 1. The four-line example code in my original post cannot be executed in the shader unit 2. Maybe I didn't make myself clear in the original post. The four-instruction sample code should be executed in a cycle by the shader unit 1 (I am not sure, that's why I posted this question). What I meant by "there can be another four instructions (could be executed by the shader unit 2)" are not the same four-instruction sample code I gave for the shader unit 1. For example, another sequence of four instructions consist of MAD and DOT, with totally different channels (so they can be co-issued and dual-isssued as the sample four-instruction for the shader unit 1). About 3: in the case of the "hard limit" (e.g. the read ports of the instrunction buffer), the linked article should have mentioned it, isn't it? |
|
|
|
|
|
|
#5 | |
|
Junior Member
Join Date: Jun 2004
Location: Taiwan
Posts: 22
|
Quote:
Code:
rcp r0.r, v0.r mul r1.r, r0.r, v1.r mul r2.r, r1.r, v2.r add r3.r, r2.r, v3.r |
|
|
|
|
|
|
#6 |
|
Off-season
Join Date: Feb 2002
Location: On the pursuit of happiness
Posts: 3,019
|
Dual-issue means having SU1 and SU2 perform different operations. Co-issue means performing two ops in one SU, effectively splitting it. You can only co-issue two MAD/MUL/ADD/DPx, and only use up to four channels. I.e. SU2 can only do 2 instructions per clock.
You can't co-issue special functions. I think the maximum you can reach is 6 instructions/clock, e.g. RCP + 2 MUL2 in SU1 2 MAD2 in SU2 NRM_PP (two cycles latency)
__________________
Binary prefixes for bits and bytes |
|
|
|
|
|
#7 | |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Quote:
__________________
April 20, 1979 - America must never forget. |
|
|
|
|
|
|
#8 | |
|
Junior Member
Join Date: Jun 2004
Location: Taiwan
Posts: 22
|
Quote:
Code:
mul r0.rg, v1.rg, v2.rg add r1.rg, r0.rg, v3.rg mul r0.ba, v4.ba, v5.ba add r1.ba, r0.ba, v6.ba |
|
|
|
|
|
|
#9 | |
|
Member
Join Date: May 2004
Posts: 230
|
http://techreport.com/etc/2004q2/tamasi/index.x?pg=4
Quote:
|
|
|
|
|
|
|
#10 | ||
|
Regular
Join Date: Apr 2003
Location: Louvain-la-Neuve, Belgium
Posts: 523
|
Quote:
I think that the maximum is what Xmas said + modifiers rcp.w mul.w mul.xyz bx2.xyzw nrm_pp add.xy add.zw bx2.xyzw -> 8 instructions with modifiers However many things can prevent this to happen. Register usage (1 interpolated, 2 constants, ?2 temporary write / 4 temporary read?) or scheduling difficulties.
__________________
Damien Triolet - HardWare.fr Sorry for my bad English. Maybe one day it'll be better :D |
||
|
|
|
|
|
#11 |
|
Senior Member
Join Date: Aug 2002
Location: Miami, Fl
Posts: 1,036
|
In what unit is nrm_pp executed? Is there a specialized unit for normalization?
__________________
"Friendship is unnecessary, like philosophy, like art... It has no survival value; rather it is one of those things that give value to survival." -C.S. Lewis |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Typical GPU Efficiency | rwolf | 3D Architectures & Chips | 102 | 17-Jun-2005 03:23 |
| r420 may beat nv40 in doom3 with anti-aliasing | LokeshRay | 3D Architectures & Chips | 295 | 28-Jun-2004 08:39 |
| How does the NV30 really store PS programs? | Arun | 3D Architectures & Chips | 19 | 20-Feb-2003 13:54 |
| Microsoft to own every GPU? | Cyborg | 3D Architectures & Chips | 26 | 14-Jul-2002 11:15 |