A bunch of questions.

CI · Apr 21, 2004

Read the NV40 preview, pondered a few days and came up with the following questions. Some NV40 specific, some more general:

1) On the NV40, is it still a 4x64-bit crossbar memory interface?

2) Being a 4 quad architecture, must all 4 quads have to be on the same triangle?

3) On a more general note, is it possible that instead of activating/deactivating by quad blocks, to bind pipelines more dynamically to process quads after determining which ones are defective? Thereby increasing yield? So instead of, say, having 2 out of 4 blocks in a 16-pipe architecture being rejected due to defect of one pipe in each block, to re-organise the pipes into 3 working blocks and group the 2 defective pipes together into a fourth non-working block?

In context of the NV40, I think it means it will have to forego the L1 texture cache and use the shared L2 cache only. Any other performance implications?

4) On geometry instancing, the examples (RTS and asteriods) so far appear to be relatively low in polygon count (in order of hundreds of vertices, I guessed). Is it practical to do geometry instancing say, in First Person Shooters so that there's a horde of 3000-polygon monsters running around? Is there any primary limitation, eg. the vertex buffer size?

5) Last question, compared to VS2.0 in software; is it possible/feasible to do VS3.0 via CPU, including texture lookups?
:?:

TIA!

Luminescent · Apr 21, 2004

To answer question #2 is the following excerpt from this portion of DaveBaumann's NV40 preview:

With 16 pixel pipelines, NV40 can be rendering 4 quads at any one time. NV40 will dispatch quads from a triangle to each available quad pipeline in order until the triangle is fully dispatched to rendering quads, then quads from the next triangles will be dispatched as rendering quads become available.

MDolenc · Apr 21, 2004

OK, I can answer you 4 and 5:
4. No you are not limited to low poly objects only. You could as well render a bunch of 3000 poly monsters or even 30000 poly monsters, however this is purely about doing something with 1 Draw(Indexed)Primitive call vs doing it with 100 or more Draw(Indexed)Primitive calls and you only save CPU time, so it is used best for rendering thousands of small objects as such case would have generated a lot of CPU overhead if you would use 1000 Draw(Indexed)Primitive calls. If you are rendering say 10 3000 poly objects you are only doing 1 call instead of 10, so there won't be high advantage as CPU overhead of 10 Draw(Indexed)Primitive calls is still quite low.

5. It is possible to do VS 3.0 via CPU emulation including texture lookups. However!

Textures that you want to use in you VS 3.0 shaders that are emulated have to be placed in system memory, which means you can't set them as render targets or ordinary textures for pixel shader. So you could as well pack your texture in vertex buffer itself. You can not ping-pong data between pixel shader and vertex shader (unless you do slow video memory to system memory copies).

Rys · Apr 21, 2004

CI said:
Read the NV40 preview, pondered a few days and came up with the following questions. Some NV40 specific, some more general:

1) On the NV40, is it still a 4x64-bit crossbar memory interface?

TIA!

Yep, the crossbar has four partitions again

Rys

991060 · Apr 21, 2004

Can anyone make a comparsion between GPU and CPU when fetching texture? i.e. efficiency, latency, pipeline stalls, etc.

Frank · Apr 21, 2004

Ok. I'll answer 3:

CI said:
3) On a more general note, is it possible that instead of activating/deactivating by quad blocks, to bind pipelines more dynamically to process quads after determining which ones are defective? Thereby increasing yield? So instead of, say, having 2 out of 4 blocks in a 16-pipe architecture being rejected due to defect of one pipe in each block, to re-organise the pipes into 3 working blocks and group the 2 defective pipes together into a fourth non-working block?

In context of the NV40, I think it means it will have to forego the L1 texture cache and use the shared L2 cache only. Any other performance implications?

The main reason to use quads, is that you can save a bunch of transistors for each quad pipe by only doing the bookkeeping (like instruction fetching, decoding and dispatching) once. Also, you save three quarters the transistors of the datapaths in between all parts (and can use unified caches). The beauty of SIMD (Single Instruction, Multiple Data) is that you can essentially expand one full ALU to more than one for a much smaller amount of transistors than using multiple ALU's, as long as they all do the same thing.

If you want to be able to link each pipe to a quad, you need each pipe to do all that by itself. Therefore, it would then be much more efficient to forego quads altogether. But that would limit the amount of pipes you can make with your transistors, thereby reducing the throughput.

So, no, it cannot be done.

Joe DeFuria · Apr 21, 2004

DiGuru said:
The main reason to use quads, is that you can save a bunch of transistors for each quad pipe by only doing the bookkeeping (like instruction fetching, decoding and dispatching) once. Also, you save three quarters the transistors of the datapaths in between all parts (and can use unified caches)....

Interestingly then, there is a case to make "octs", correct? An NV40 / R420 with "dual octs" would save significant transisitors compared to the current 4 quad set up.

I suspect the primary reason for not going with Octs...given the transistor budget advantate it has over quads:

* Less granularity with respect to disabling pipelines. No 12 pipe boards, giving less options for product lines, and less flexibility in yield management.

Having said that, would you estimate that an "oct" (given average polygon size for today's apps) would also take a significant performance impact vs. dual quads? In other words, do you think "octs" in the future are out of the question, and we'll just see more "quads"...or even revert down to "duets" as polygon sizes decrease? (For a "24 pipe board of the future...would you predict 3 octs is as likely an approach as 6 quads?)

ERP · Apr 21, 2004

I think the only useful place your going to go from quads is down to single pixels.

Given that even single pixel polygons will in general lie on more than one screen pixel (subpixel accuracy and all that) quads are probably optimal until you get average polygon sizes much smaller than a pixel.

Someones going to have to come up with much better ways to solve the Aliasing problem before we start seeing source art that has polygons that small even if the cards are fast enough to do it.

Basic · Apr 21, 2004

ERP:
Mesh mending. (My term, it probably suck.

)

I think that even when (if) single pixel triangles becomes the norm, they'll mostly be used to model more detailed bumps and curvature. So you'll still have rather large surfaces (in pixels) with the same shader. If the setup can detect internal edges between two triangles with the same shader, then it could be possible to "mend" the mesh and run quads that stradle the triangle edge.

Doesn't 3Dlabs P10 do something similar? I think it's capable of setting up up to four triangles for its 8x8 pixel shaders. And I assume that the shading is perfectly effcient at internal edges between those triangles. Even doing partial derivates over the edges.

BTW, there might be an edge for TBDRs here, making it easier to do quads over triangle edges.

Frank · Apr 21, 2004

ERP said:
I think the only useful place your going to go from quads is down to single pixels.

Given that even single pixel polygons will in general lie on more than one screen pixel (subpixel accuracy and all that) quads are probably optimal until you get average polygon sizes much smaller than a pixel.

Someones going to have to come up with much better ways to solve the Aliasing problem before we start seeing source art that has polygons that small even if the cards are fast enough to do it.

Granularity is becoming a problem, yes. As the 'z-culling' is becoming very efficient and polygons get smaller, the point is approaching where a quad will become less than optimal. But, can you make two single pixel pipelines with the same amount of transistors that make up a quad? That will be a close call, and you will lose almost HALF your throughput doing that.

Remember, most of the background of a visual is either a very simple and fast shader and/or a small number of large polygons. So, all in all you need pretty complex scenes to make halving your maximum throughput viable.

Joe DeFuria said:
DiGuru said:

The main reason to use quads, is that you can save a bunch of transistors for each quad pipe by only doing the bookkeeping (like instruction fetching, decoding and dispatching) once. Also, you save three quarters the transistors of the datapaths in between all parts (and can use unified caches)....

Click to expand...

Interestingly then, there is a case to make "octs", correct? An NV40 / R420 with "dual octs" would save significant transisitors compared to the current 4 quad set up.

I suspect the primary reason for not going with Octs...given the transistor budget advantate it has over quads:

* Less granularity with respect to disabling pipelines. No 12 pipe boards, giving less options for product lines, and less flexibility in yield management.

Having said that, would you estimate that an "oct" (given average polygon size for today's apps) would also take a significant performance impact vs. dual quads? In other words, do you think "octs" in the future are out of the question, and we'll just see more "quads"...or even revert down to "duets" as polygon sizes decrease? (For a "24 pipe board of the future...would you predict 3 octs is as likely an approach as 6 quads?)

Good question. I think that quads are mostly just evolved, by first doubling the single pipeline and later doubling it again. And when doubling it another time, the chip makers opted for two separate quad pipes, even with the large increase in transistors.

There is a separate argument to make agains very large SIMD pipelines: the flexibility. When you take the step to programmable shaders, you would want your PS to be able to run a program for each separate pixel. That's what we see now, with the discussion about conditionals and branches in shader programs. But running multiple shaders that could be variable length and/or need separate instructions executed per pixel within a quad pipe is very ineficcient. That's why flow control is so hard to implement.

So, I don't think quads will grow to octs or larger. But at the same time I don't think they will be reduced to individual pixel pipes very soon.

Joe DeFuria · Apr 21, 2004

Thanks for your insight!

DemoCoder · Apr 21, 2004

Quads or Tris have an additional benefit: you're table to implement gradient calculations.

psurge · Apr 21, 2004

ERP - that's true, but I think its likely that quads are assigned to pixels statically, e.g. pixels (0,0), (1,0), (0,1), (1,1) always get the same quad.

If this assumption is true, triangles smaller than a pixel probably occupy two quad pipelines quite often.

Basic -
That seems like the way to go. Also, I think there is a good case for operating on a triangle of fragments as opposed to a quad, since this is
all you need for partial derivatives. It also means you can skip interpolation for triangles where each of the 3 fragment corresponds directly to a vertex.

Also, I think that at that point triangles are mostly going to be generated via some form of subdivision in conjunction with RenderMan style displacement mapping, and that worrying about triangles much smaller than a pixel is not very useful...

psurge · Apr 21, 2004

Also, on a related note... does anyone know of an open source RenderMan renderer? I'm asking just because it would be interesting to see how geometric subdivision is handled...

Regards,
Serge

MfA · Apr 22, 2004

A couple, but I guess you mean a REYES renderer ... I know only of Aqsis.

ERK · Apr 22, 2004

More talk of this here.
http://www.beyond3d.com/forum/viewtopic.php?p=251336#251336

Frank · Apr 22, 2004

There are multiple ways to get flow control working within a quad pipe. The easiest way is to examine the shader to see if there is a finite solution and just calculate all possible results and pick and choose the correct one for each pixel. If it is possible for a loop/branch to have an infinite (or very large) amount of instructions when you have to calculate all (for example DO..UNTIL or a repeating CASE statement), then you drop the quad down to a single pixel pipe.

If we can believe the sparse hints dropped, this is what NV4x does. This method can be very costly for throughput if used wrongly. The best use for this is static branching or breaking out of loops, with shaders that give the same path for each pixel in a quad if possible.

Another way of doing flow control depends on having two full-fledged ALU's per quad. When you encounter a branch, you store two pixels and calculate the remaining two each within their own ALU. That gives the best throughput for simple models while allowing very complex flow control.

The throughput will be reduced to 1/2 the pipes and 1/2 the ALU's per pipe when you use branching, but it will be consistent. That allows the developer to freely use branching without having to worry about hard to trace performance loss in specific cases, as with the first method.

Therefore, I think it is a fairly safe bet to say that the next generation (R5xx/NV5x) will have two full ALU's per quad and switch back to two individual pipes with one full ALU each when flow control per pixel is used.

jolle · Apr 22, 2004

So does that mean its up to the game devs if it boosts or hampers performance?
Or its all up to Nvidias hardware implementation and/or drivers if it is speeding up or slowing down usage?

Frank · Apr 22, 2004

jolle said:
So does that mean its up to the game devs if it boosts or hampers performance?
Or its all up to Nvidias hardware implementation and/or drivers if it is speeding up or slowing down usage?

Hard to say right now, but I think the drivers will play a major role for the NV4x. It will almost certainly work best with GLSL, that allows the drivers to seek the optimum path. For the developers I think it will mean: use with care.

But it does provide a way to do things that cannot be done otherwise right now. Breaking out of loops and shaders that can calculate a different surface depending on some conditionals are very nice to have.

jolle · Apr 22, 2004

It does sound slightly troubeling, what with the history of NV3x, devs had to spend alot of extra time to make it work as it should, of totally different reasons ofcource..
but if this ends up being up to the game developers it sounds as if it could potentially end up the same way... or skipped totally perhaps..
If the time spent cant be justifed by the result or something..

Well hopefully Nvidia would have taken measures not to run into the same sort of situations they did with nv3x..

A bunch of questions.

CI

Luminescent

MDolenc

Rys

Graphics @ AMD

991060

Frank

Certified not a majority

Joe DeFuria

ERP

Basic

Frank

Certified not a majority

Joe DeFuria

DemoCoder

psurge

psurge

MfA

ERK

Frank

Certified not a majority

jolle

Frank

Certified not a majority

jolle

Similar threads