Is the NV3x influenced by ILDP?

Arun

Unknown.
Moderator
Legend
Hello everyone,

This post is a sort of follow-up to my pipeline article, and the resulting refusal by many people, including Dave, that it simply HAD to have pipelines.

Well, I've been doing some additional research. And it sounds like the idea of going away from pipelines and using pools of units instead has been considered by people for, well, quite a while.

I've stumbled upon an article by J. E. Smith, professor at the Dept. of Elect. and Comp. Engr. at the University of Wisconsin, his job being mostly related to computer archtiectures.

His personal website is: http://www.engr.wisc.edu/ece/faculty/smith_james.html

And the article I'm talking about is: http://www.ece.wisc.edu/~jes/papers/hipc.00.pdf ( Instruction Level Distributed Processing )

The paper of course is more aimed at CPU architectures, although it is quite general. Let me quote parts of the article...

A microarchitecture paradigm which deals effectively with technology and application trends is Instruction Level Distributed Processing ( ILDP ). A processor following the ILPD paradigm consists of distributed functional units, each fairly simple with a very high frequency clock cycle ( for example, Fig. 1 )
( Figure 1 is on page 5 )

With high intra-processor communication delays, the number of instructions executed per cycle may level off or decrease when compared to today, but overall performance can be increased by running the smaller distributed processing elements at a much higher clock rate. The structure of the system and clock speeds have implications for global clock distribution. There will likely be multiple clock domains, possibly asynchronous from one another.
( it is fairly obvious the NV3x does not use any, or anyway not by any noticeable amount, such techniques, which is one of the causes of its high heat and power consumption. Other parts of the paper explain some ways to remedy to this )

Clustered dependence-based architectures are one important class of ILDP processors. The 21264 is a fairly recent example. In these microarchitectures, processing units are organized into clusters and dependent instructions are steered to the same cluster for processing
( the 21264 is an Alpha processor )
Before anyone asks, I *did* read all of the article.


Seems familiar? Remember nVidia insisting about the VS being many small units? All the hype about the super high clock speeds, at least compared to traditional GPUs? The communication delays with registers ( although a lot got to be left to the imagination in that case ) ?

A lot of things COULD be explained if the NV3x was heavily influenced by ILPD. I'm not saying it *is*. I'm just saying it is, in my opinion, a likely explanation.

So, you might say "Yes, but since it's distributed, why aren't the fragment and vertex units shared?"
Well, to share, you need to have similar caracteristics. You need the same ISA, mostly. And the NV3x does not have that: no texture lookup in the VS, no branching in the PS, ...

So, how to define the NV3x based on this? Well, it would be a pipeline, with stages of the pipeline using distribution and sharing, but with the impossibility to share with other parts of the pipeline. Just an idea, really, let me reinsist it's all speculation.

I'm going to close by a quote of CMKRNL...
This part will also contain a completely revamped unified shading model. This means that both vertex and pixel shaders will share the exact same ISA and constructs.

I don't want this to become a NV40 speculation thread though. I want this to be related to the technical caractierists of the NV3x. So should this become too related to the NV4x, I'd suggest opening a new thread maybe or something...


Feedback, comments, flames related to this?


Uttar

EDIT: Fixed title ( was ILPD, should have been ILDP ) and typos
 
While reading through the doc I started to think that it was sounding a lot like what Transmeta were doing, and sure enough, the doc referred to them :) .

While it's pure speculation that nvidia is using what is described in the document, it is just that. I'm sure that the hardware guys at nvidia and ati have come up with a similiar sort of thing in their architecture, altho this does seem to fit nv3x pretty well. ATI seems to have just copied their units around to get greater instruction level parallelism.

Even if they had something similar to this in the NV3x, it would explain the need for a high clock speed, and ATI's apparent "better than nvidia at the same clock speed" kinda thing..

NV40 will be an interesting chip if they can ramp clock speeds... :devilish:
 
I found this anantech article quite interesting in regards to Uttars propostition/speculation. Here it is:
The number of pixel pipes has always been related to the number of pixels you could spit out every clock cycle; now, with the introduction of fully programmable pipelines that number can vary rather significantly depending on what sort of operations are carried out through the pipe.

Think of these pixel rendering pipelines, not as independent pathways, but as a farm of execution resources that can be used in any way. There are a number of adders that can be used either in parallel, or in series, with the result of one being fed to the input of another. If we were to characterize the number of pipelines by the number of pixels we could send through there in parallel then we could end up with numbers as low as 2 or as high as 16. .

As you can see, there is sufficient hardware in the NV35 to guarantee a throughput of 8 pixels per clock in most scenarios, but in older games (e.g. single textured games) the GPU is only capable of delivering 4 pixels per clock. If you correctly pack the instructions that are dispatched to the execution units in this stage you can yield significantly more than 8 pixel shader operations per clock. For example, in NVIDIA's architecture a multiply/add can be done extremely fast and efficiently in these units, which would be one scenario in which you'd yield more than 8 pixel shader ops per clock.

It all depends on what sort of parallelism can be extracted from the instructions and data coming into this stage of the pipeline. Although not as extreme of a case (there isn't as much parallelism in desktop applications), CPUs enjoy the same difficulty of characterization. For example, the AMD Athlon XP has a total of 9 execution units, but on average the processor yields around 1 instruction per clock; the overall yield of the processor can vary so much depending on available memory bandwidth and the type of data it's working on among other things.
It most definitely seems that the pixel units of NV35 are arranged in an array as are the vertex units.
 
TTA

Think of these pixel rendering pipelines, not as independent pathways, but as a farm of execution resources that can be used in any way. There are a number of adders that can be used either in parallel, or in series, with the result of one being fed to the input of another.

This sounds a lot like TTA (Transport-Triggered Architecture). In this architecture there is just one instruction - MOV. You move the data between the units, the output of one is fed as input into another.
For more info:
http://www.byte.com/art/9502/sec13/art1.htm
http://homepage.ntlworld.com/hansydelm/move/move.html

One of the advantages of TTA in GPUs is that many pixels can be processed simultaneously in different stages of the TTA pipeline - more pixels for simple shaders, or with a complex shader the whole TTA unit can work on just one pixel at a time. Even in the later case - different parts of the shader can be processed simultaneously - if these parts are independent from each other. TTA also should be more simple (implementation and transistor-count wise) than the current register based architectures.

I don't think that NVidia uses TTA though. TTA does not fit very well in the current ASM-style shaders. For instance you don't necesary need registers (though a register for holding temporary data can be implemented as a seperate unit - in order to avoid stalling a functional unit). HLSL or a very TTA specific ASM shader language would be required. And NVidia engineers have been opponents to the "HLSL in the gl2 driver" approach saying that the drivers should only handle ASM shaders and the HLSL should be a separate layer (e.g. Cg ). This means that they are not even considering TTA.
 
GeLeTo: Didn't have the time to read the articles, but seems interesting.
As you say, however, it would make no sense for nVidia to refuse GL2's HLSL if they really did use that.

I'd like to rerefer to Figure 1 on page 5 of the original paper.
When you think about it, it perfectly well fits the "pool of units" thingy too, but you'd still have storage.

My bet would be more on something like that, really. Actually, David Kirk once mentionned "32 [PS] shading units" - and using such a figure, I can easily get to such a number based on the NV3x's known performance:

- 12 FX12 ( Add & Multiply ) Processors
- 4 FP32 ( Add & Multiply ) Processors, dependant on the Add & Multiply FX ones
- 12 Cache Processors
- 4 FX12 ( Multiply only ) Processors, with no dedicated cache processors, need to share them with the other FX12 units, so that'd explain why the NV3x can only use it for independant instructions ( no idea if that makes sense ) - note that those units are useful for LRP.


Does that make sense, technically speaking? Hmm...


Uttar
 
The types of units are different in NV35, however. According to the Dawn benchmark, the difference in fps between mixed mode and fp16 on the NV30 (10 fps) is far greater than on NV35 (2 fps). This difference goes to show NV35 is, most probably, composed of only fp units. MDolnec specifically stated that integer hardware was removed from NV35, which pretty much rules out fx12 units. FP16 & 32 units are one in the same with identical execution rates but varying memory access penalties.
 
Yes, but I don't think it's safe talking about the NV35 yet.
There are a lot of things we know NOTHING about. For example, are those MUL-only units still there? Are register usage problems smaller, or bigger - we know they're still there, but nothing else.

Is having sometype of a mix between TTA and "traditional" ILDP possible?


Uttar
 
It seems the NV35 contains only 8 general fp processors (4 fp +4 fp/tex), but is sometimes capable of a mov alongside the 2 general instructions per pipeline (4 parallel pipelines with 2 fp units each) on shaders only. Read the following from the Firing Squad 5900 ultra preview for a reason why I stated the above (hopefully FiringSquad is reliable; they seem to be directly reverencing an Nvidia comment):
FiringSquad.com said:
Besides the addition of UltraShadow technology, NVIDIA has improved upon its CineFX architecture first introduced with GeForce FX 5800. NVIDIA has optimized all stages of the pixel pipeline in CineFX 2.0. According to NVIDIA, CineFX 2.0 doubles the floating-point pixel shading power of GeForce FX 5800. This should result in a dramatic performance boost, as pixel shader programs can be executed more quickly.

NVIDIA claims this improvement gives them an edge over RADEON 9800 PRO, 3,600 floating-point shader ops in GeForce FX 5900 Ultra versus 3,040 in RADEON 9800 PRO.
They attain this number by multiplying the host clock by the number of fp pixel processors in the vpu.
9800pro=380MHz*8=3.040*10^9
5900ultra=450MHz*8=3.600*10^9

This is one reason I belive Nvidia states NV35 contains 32 processors. It consists of 8 sets of 4 fp32/16
units.
 
DaveBaumann said:
They forgot about texturing when calculating that though.
NVIDIA seems to *love* the idea of "peak IPC". They love it so much, in fact, that they always seem to forget to mention the IPC in general situations ;) Not like any company did this, anyway :(

Oh, and NVIDIA forgot about ATI's 8 Vec + 8 Scalar architecture.
They sure wouldn't like to compare 3600 to 6080 :LOL:

Also, I believe that for the idea of having "forgot texturing", it might still give a realistic figure, because the free MOVs can come in handy sometimes I suppose.

But with ATI's Scalar and NVIDIA's register usage performance hit, I must say that even though the NV35 is heavily superior to the NV30, it's still more like on par with a R300, not a R350...


Uttar
 
I say the same, Uttar. With a possible IDLP architecture, it seems NV35 need the 450 MHz speed for equal footing with R300 (not even R350), but it is a step more flexible and closer to the unified shading model. As a matter of fact, there should be no reason why the pixel units cannot be used as vertex units with the render to texture and render to vertex array OpenGl extensions. Woudln't such a move be possible?

NV35 can compute 8-12 shader ops/clock, but I doubt it can actually write 8 separate fp pixel color results to the framebuffer from the resulting values. Since its predesessors were composed of pipeline execution units, in series, it seems likely that NV35 maintains this tradition (confirmed by an Nvidia representative). One unit must serve as the input for the other (within each pipeline). For NV35, this translates into 4 writes per clock, regardless of the data type. I made a speculative analogy concerning how the approaches differ in a previous thread, which I modified in lieu of more recent information. Here it is:
Here is the didactic, mock scenario (in a world of conventional/steriotypical pipelines):

Let us say two vpu's (A & B) have equal clockspeeds. Vpu A has a fillrate of 3 while B has a fillrate of 1.5. Assuming a simple world, we can conclude that if vpu B has n piplines, vpu A has 2n pipelines. Vpu A, however, has 2 texture units per pipeline while vpu B has 1 texture unit per pipeline (both are capable of loopback). Vpu A and B attain the same Gtexel rate; vpu A by pipelining its three units/pipe and vpu B through more extensive superscalar (8-way) execution with one texture unit per pipeline.

Now the benchmarks come rolling in. On benchmarks with extensive use of multitexturing (in multiples of 2); vpu A is put to use most efficiently, its texture units can be used simultaneously, and its resources are well employed. Vpu B will have to loopback to multitexture, although it will produce twice as many pixels when it outputs its final results, so the performance of vpu's A and B are relatively equal. When the heavy single texturing marks begin to arrive, vpu A struggles because its it can only use one of its texture units per pipe and the extra logic is put to no use. Vpu B excels compared to A, because all its resources are employed efficiently and it has twice the amount of crucial resources that A does.

By equating textures to instructions and texture units to fp shader units:

We may observe NV35 suffering from the same fate endured by vpu A. NV35 may contain 2 fp shaders per pipeline, but if the benchmark's pixel shaders are of low instruction counts (and instructions are independent), the vpu will incur a disadvantage. Each pixel will require less shading (instructions), and unless those instructions come in factors of two, NV35 will not effectively use its 2 fp units per pipeline. Each pipeline's resources will have a greater likelyhood of being unemployed, as opposed to the other leading brand vpu which can address more pixels simultaneously s single fp shader per pixel (although it can handle a 3 component vector alongside a scalar) in shaders with instruction counts which are factors of 1 per pixel (assuming 4 component vectors).
If NV35 is indeed arranged in a 4*2 group of fp units (and I believe it is; MDolenc's contact mentioned it was capable of 3 fp ops per pipeline, for a total of 12 ops/clock), and per pixel instructions come in factors of two, the architecture will behave as if each pixel shader contains only 1 instruction; its output will be invariable because its throughput is 8 fragment instructions per clock. In such a case, one shader can output to the other, while recieving the next set of instruction. I believe Anand put it the following way:
Anandtech said:
There are a number of adders that can be used either in parallel, or in series, with the result of one being fed to the input of another.
The adders would be arraged, solely, in parallel, if only independent pixel instructions are present. If dependent instructions are present, the adders are arranged serially, per pipeline, and in parallel (a total of 4 "virtual" pipelines).

We may conclude, then, that shaders composed of short instruction counts or those which break up into a small amount of architectural ops will recieve little to no benefit from the increase in fp units in NV35, while dependent instructions should recieve a significant boost (this is all clock for clock, with respect to NV30). Remember, R300/350 can use 8 texture units in conjunction with its 8 fp shaders, NV35 cannot; it either employs 8 fp shader units or 4 fp shader units alongside 8 texture units (an fp shader in disguise).

As previously noted, register usage is bound to increase with two units, per pipeline, as opposed to one. NV35 will encounter register penalties more often if the on-chip register counts are equal to those of NV30.
 
Back
Top