PDA

View Full Version : Performance hit for 64bit and 128bit rendering ?


BRiT
26-Sep-2002, 08:09
Does anyone know the performance hit the ATI-9700 takes for rendering in 64bit or 128bit color, in comparison to 32bit modes? Or is this something thats impossible to see/test out without DX9 or a higher version of OpenGL? Is there no option in the control-panel to "force" 64bit or 128bit rendering?

Not that this will matter with games being a year or two away, but just curious for now.

--|BRiT|

Kristof
26-Sep-2002, 08:42
Well you need to explain what excactly you mean with 64 bit and 128 bit rendering.

First of all there is internal accuracy in floats. There are 4 components (RGBA) and these can be 16bit to 32bit floats. I believe that the Radeon uses a fixed format of 24 bits. So the internal accuracy always sits at 4x24=96 bits. This is not something you can turn on or off and even if you can there won't be any difference since its the whole internal structure is based on these bit counts.

If, on the other hand, you are talking about framebuffer (external memory) bit depth then there is nothing you can force. Framebuffer only supports 16 and 32 bit formats. If you want to go to 64 or 128 bit you end up in Multiple Render Target (MRT) Case which you can not "force" on an old game since these games are designed to render to only one buffer. This is also only available in DX9 (probabaly also in OGL2.0 and maybe 1.4 using a new extension)... The only thing you could maybe force is the 10-10-10-2 external format (rather than 8-8-8-8 ), which I believe is what Matrox alows you to do.

Does this make sense or did I misread your actual question ?

On the other hand you do have a RGBA-32-32-32-32f format, but you can not display that AFAIK, its only to be used for render to texture and then read back in to do some funcky effects. Again this would be difficult/impossible to force...

Dave Baumann
26-Sep-2002, 08:44
The ideal would be nothing.

Because DX9 chips can handle so many interal passes, any game that actually uses the high precision shouls also be coded to utilise the internal passes, so there will be no performance lost since, ultimately it will still be outputting to a 32bit buffer.

However, its if the application requires more passes than the hardware can handle internally (i.e. for R300, > 16 textures, > 1024 vertex instrustions, > 160 pixel shader instructions etc) then the intermediate results should be stored in offscreen floating point buffers which is what will cause the performance drop because of the extra bandwidth needed.

LeStoffer
26-Sep-2002, 09:54
I pressume that your question is whether the pixel pipelines takes a dive in performance when rendering in 64/128 bit instead of 32 bit?

Let's take the R9700. It's 8 pixel pipelines can apply 3 instructions per clock (one texture look-up, one texture address operation, and one colour operation). Notice that there is no mention of colour precision! (It not 1 colour operation per clock in 32 bit, but 1 operation per 2 clocks in 128 bit etc). The compute power to handle the higher colour precision is already in place (and one pixel is still just one pixel to the pixel pipeline).

Okay, let's say that you are not going to fetch textures that are different (higher res/bit) than if you where using 32 bit and you're not using the framebuffer during rendering. Then the only difference should be that you get higher internal precision without a performance dive. (Remember the KyroII and 32 bit internal rendering :wink: )

Simon Templar
26-Sep-2002, 14:57
What do you make of this?

http://www.beyond3d.com/forum/viewtopic.php?topic=2562&forum=9

Specifically this sentence:
"Where the NV30 differs from the 9700 is that instead of a single 128bit colour call, it can perform two 64bit operations in the same time, providing what Adam called a 'sweet spot' between performance and colour"

I could have posted it in that thread but I think it may apply to this.

I could be totally wrong but could you in hardware have a pipeline that could pack multiple pixels in a single pipeline instruction?

If this is the case you could theoretically have a 128bit 4 Pipeline card producing 8 64-bit pixels? Have I gone off the deep end?

If it scales this far does it to 32 bit pixels also? Something like 16 pixels on a 4 pipeline card?

SIMP? Single Instruction Multiple Pixel? This approach sure would maximize hardware efficiency but is it even possible?

sireric
26-Sep-2002, 16:25
A general answer would be that 64b or 128b writes would take 2x or 4x more bandwidth than 32b writes. And for an application that's bandwidth limited, then you could expect a comparable performance hit.

However, the question is, when would an application write 64b or 128b pixels? Generally, that will be for multi-pass algorithm. Given that a multi-pass algorithm will have substancial shader operations per pass, the vpu will generally not be bandwidth limited but actually be shader limited. In those cases, the 64b or 128b writes will have no performance impacts. You could imagine that with something like 3x4=12 instructions (assuming a balanced scalar, rgb vector and texture blend), you could completly hide the 128b writes.

Later

antlers
26-Sep-2002, 17:08
I think the interesting question is:

Will it be able to do twice the instructions per cycle at 64 bit?

Or will it be limited to half the instructions per cycle at 128 bit?

What distinguishes this from a glass half-full or half-empty proposition is that we have something to compare it to in the 9700. If it doubles the 9700's shader performance (say clock for clock) when running at 64 bit, nVidia will certainly have something to market with (even if the impact on existing or near future games is negligible).

LeStoffer
26-Sep-2002, 18:23
I think the interesting question is:

Will it be able to do twice the instructions per cycle at 64 bit?

Or will it be limited to half the instructions per cycle at 128 bit?

What distinguishes this from a glass half-full or half-empty proposition is that we have something to compare it to in the 9700. If it doubles the 9700's shader performance (say clock for clock) when running at 64 bit, nVidia will certainly have something to market with (even if the impact on existing or near future games is negligible).

Hmmm yes, but then I have to ask how the NV30 compares to the R9700's ability to do the 3 instructions per clock (texture look-up, texture address operation and colour operation)? It's an unfair question of course :wink: but if they use extra transitors to handle 2 colour ops in one clock [in 64 bit] then they might have cut a corner elsewhere....

What if the NV30 can do 2 colour operations (in 64 bit) but cannot do both a texture look-up and address operation per clock? Dunno....

psurge
26-Sep-2002, 19:46
LeStoffer, CineFX does not make a distinction between texture address and color operations.

Now honestly I'm not entirely sure what a texture address instruction is, but I'm guessing it's the LOD calculations and texture coordinate interpolation? Or...?

fresh
26-Sep-2002, 19:51
The ideal would be nothing.

Because DX9 chips can handle so many interal passes, any game that actually uses the high precision shouls also be coded to utilise the internal passes, so there will be no performance lost since, ultimately it will still be outputting to a 32bit buffer.

However, its if the application requires more passes than the hardware can handle internally (i.e. for R300, > 16 textures, > 1024 vertex instrustions, > 160 pixel shader instructions etc) then the intermediate results should be stored in offscreen floating point buffers which is what will cause the performance drop because of the extra bandwidth needed.

What about reading from a 128bit image? That would still cost performance, no matter how much you try to avoid multiple passes. Especially if you want to do bi/trilinear filtering.

antlers
26-Sep-2002, 20:20
What about reading from a 128bit image? That would still cost performance, no matter how much you try to avoid multiple passes. Especially if you want to do bi/trilinear filtering.

Are you talking about for things like cube mapping? Or do you think source art is ever going to be 128 bits?

DemoCoder
26-Sep-2002, 20:23
High dynamic range light maps will be floating point as will rendertargets from previous passes.

pcchen
26-Sep-2002, 20:24
I am not sure but I remembered that DX9 does not allow filtering for FP readings. I don't know if this has been changed. Of course, with PS 2.0 you still can do that manually.

Xmas
26-Sep-2002, 21:16
I am not sure but I remembered that DX9 does not allow filtering for FP readings. I don't know if this has been changed. Of course, with PS 2.0 you still can do that manually.
It's a hardware limitation of both R300 and NV30. And I doubt this will change soon, considering the amount of transistors required, and the connection to the cache would have to be four times as wide. And you often want more than linear filtering for high precision data.

alexsok
26-Sep-2002, 21:18
It's a hardware limitation of both R300 and NV30. And I doubt this will change soon, considering the amount of transistors required, and the connection to the cache would have to be four times as wide. And you often want more than linear filtering for high precision data.

Are you sure about that?

I recall reading Humus's post where he said that R300 allows that... not sure though and I could be very much mistaken here...

Xmas
26-Sep-2002, 21:27
It's a hardware limitation of both R300 and NV30. And I doubt this will change soon, considering the amount of transistors required, and the connection to the cache would have to be four times as wide. And you often want more than linear filtering for high precision data.

Are you sure about that?

I recall reading Humus's post where he said that R300 allows that... not sure though and I could be very much mistaken here...
Yes, I am sure. The only differences between them IIRC is that R300 supports mipmaps and cube maps with FP. But no filtering.

alexsok
26-Sep-2002, 21:30
Yes, I am sure. The only differences between them IIRC is that R300 supports mipmaps and cube maps with FP. But no filtering.

O.k then, thx a lot for the clarification! :)

One more thing, the following is taken from NV30 OpenGL specs (take note of the things I bolded):


• Floating-point texture limitations
• Only GL_NEAREST filtering
• Try fragment programs for filtering
• Hint: summed area tables
• Corollary: No mipmap filtering
• Only GL_TEXTURE_RECTANGLE_NV texture target
• No 1D, 2D, 3D, or cube map floating-point
texture targets
• Requires fragment programs to use
• Conventional texture environment & register
combiners cannot use float textures

Think it's a good idea?

LeStoffer
26-Sep-2002, 22:08
LeStoffer, CineFX does not make a distinction between texture address and color operations.

Now honestly I'm not entirely sure what a texture address instruction is, but I'm guessing it's the LOD calculations and texture coordinate interpolation? Or...?

I think Mr. Baumann have to help us out here. If his NDA is a limitation then at least give us a hint! :wink:

Edit: Or maybe sireric could clear this one up?

Basic
26-Sep-2002, 22:19
I believe the 64bit and 128bit formats will be used mainly for pixel aligned buffers or unfiltered textures. It would be useful for HDR textures, but doing the filtering by hand is just too messy.

Summed area tables is an efficient way to flush all benefits of floating point textures down the toilet. Cancelation galore! It will also give a bad texture cache efficiency. And generating the textures dynamically will be rather inefficient. But at least NV30 will have the DDX/DDY instructions to make PS-filtering possible (though inefficiently).

R300 has MIPMAPing, which is good. But I don't see any direct way to know what MIPMAP is currently used, so finding the offset to nearby texels to do bi-/trilinear by hand is difficult.

DemoCoder
26-Sep-2002, 22:38
• No 1D, 2D, 3D, or cube map floating-point
texture targets


Think it's a good idea?

Think what's a good idea??

Anyway, I'm pretty sure there is a typo in the above quote since that would rule out every FP rendertarget. I'm pretty sure atleast 2D FP texture targets are allowed.

In any case, fixed-function filtering of floating point textures doesn't really make sense, since in most cases, these aren't going to be image data, and linear filtering is just going to be wrong.

Hyp-X
26-Sep-2002, 22:39
CineFX does not make a distinction between texture address and color operations.

No, it doesn't.
But it does have 32bit and 16bit float support, with the clear hint that, while texture address calculations require 32bit, 16bit is sufficient for color.
It's also clear that 16bit is faster on the nv30, but by how much is remains to be seen.

I'm certain that they'll use the 16bit path for executing pre2.0 pixel shaders, for performance.

Hyp-X
26-Sep-2002, 22:53
• No 1D, 2D, 3D, or cube map floating-point
texture targets


Anyway, I'm pretty sure there is a typo in the above quote since that would rule out every FP rendertarget. I'm pretty sure atleast 2D FP texture targets are allowed.

No it's not a typo.

They define a new type of rendertarget: GL_TEXTURE_RECTANGLE_NV
It's the only one that's supported.
And it is not compatible with GL_TEXTURE2D

IIRC, texture rectangle was defined on GF3 for its (rather crippled) non-pow2 texture support.
Microsoft crippled the DX8 definition for not allowing mip-mapping, even though there are other cards that support it.
In DX8.1 they further specified that the feature doesn't work with DXTC after discovering that the GF3 doesn't support it...

IMHO cubemap support would be the most important thing.
Ok, it could be done with texture address calculation in the PS, but I doubt it will have comparable speed with a hardware implementation.

Basic
26-Sep-2002, 22:55
While GL_TEXTURE_RECTANGLE_NV is a two-dimensional texture, it's still considered a different texture target than the usual 2D textures.

Edit:
Hyp-X beat me to it.

DemoCoder
26-Sep-2002, 23:07
When he said "No... 2d...texture target" I thought he was saying none at all.

As for 1D/3D, you could always render a 2D texture and use pixel shaders to treat it as 1D or 3D.

fresh
27-Sep-2002, 01:06
As for 1D/3D, you could always render a 2D texture and use pixel shaders to treat it as 1D or 3D.

That's gonna be such a pain in the ass. Especially the lack of cubemap support. Thank God the 9700 supports it.

DemoCoder
27-Sep-2002, 01:23
Going from R1 -> R2 or R3 -> R2 is trivial, not really a pain. I don't see why people are fretting over this. Besides pre-computed lookups, how many games are going to render-to-texture into a floating point cube map? Just doing that once would kill your fillrate and bandwidth.


It's not really a good thing that the R300 supports it if the NV30 doesn't, since if people can code around it, they will, rather than supporting an obscure R300 feature directly. Developers are going to support the lowest common DX9 denominator.

You've got to come up with a really compelling use case before you can start worrying about it.

Grall
27-Sep-2002, 02:38
This is so weird... Suddenly it's a GOOD thing that NV30 lacks a feature. :o You'd expect it would be better to HAVE a feature than NOT have it, no matter how unlikely you think it is people will actually use it.

I'm pretty sure people wouldn't mount such a completely compelling case if the roles were reversed. Not accusing Democoder of Nvidia bias, but it strikes me as a fairly twisted line of reasoning...

*G*

Nagorak
27-Sep-2002, 02:45
It's not really a good thing that the R300 supports it if the NV30 doesn't, since if people can code around it, they will, rather than supporting an obscure R300 feature directly. Developers are going to support the lowest common DX9 denominator.

You've got to come up with a really compelling use case before you can start worrying about it.

Hmmm...last time I checked the R9700 was first to market. Although any feature may not receive widespread support unless it works across all DX9 hardware, I think you worded the above in reverse.

It's not a good thing that NV30 doesn't support this feature (rather than it being bad that ATI supported it??? :o)

Anyway it probably won't make too much difference.

Kristof
27-Sep-2002, 08:02
LeStoffer, CineFX does not make a distinction between texture address and color operations.

Now honestly I'm not entirely sure what a texture address instruction is, but I'm guessing it's the LOD calculations and texture coordinate interpolation? Or...?

The API does make a distinction between texture addressing and arithmetic operations (don't call them color ops since you might not be working with colors). The API contains pure maths instructions like add, mul, dp3, pow, etc... these are arithmetic and have no relation to textures. You have specific texture sampling instructions that say : go and take this texture coordinate and this LOD and this texture and deliver me the filtered result in register X. Such an instruction is a texture addressing instruction. I assume what R300 has is an Arithmetic Unit and a Texture Sampling Unit, both work in parallell and as such you can do a pure maths instruction and a pure texture sampling command in the same clock. So you can issue a fetch texture sample command and a maths command at the same time. Examples of texture addressing instructions would be tex, texld, texbem, texcrd, etc...

The issue with NVIDIAs 16 and 32 bit modes is that 16bit might not be enough and 32bit might be too much. Question can we compare benchmarks run at 16 or 32 bit mode with the results of the R300 which runs all at 24 bit ? If we compare 16 with 24 then NV lacks accuracy (but probably has plenty of speed), if we compare 32 with 24 then ATI lacks accuracy and NV probably lacks speed. I can see another 16 versus 32 war and both parties will partially be right... it will be apples versus oranges from the start. Those who think 16 bit or even 24 bit is good enough will probabaly be dissapointed... its just so easy to blow up accuracy :(

K-

Nagorak
27-Sep-2002, 08:37
[quote=psurge]
The issue with NVIDIAs 16 and 32 bit modes is that 16bit might not be enough and 32bit might be too much. Question can we compare benchmarks run at 16 or 32 bit mode with the results of the R300 which runs all at 24 bit ? If we compare 16 with 24 then NV lacks accuracy (but probably has plenty of speed), if we compare 32 with 24 then ATI lacks accuracy and NV probably lacks speed. I can see another 16 versus 32 war and both parties will partially be right... it will be apples versus oranges from the start. Those who think 16 bit or even 24 bit is good enough will probabaly be dissapointed... its just so easy to blow up accuracy :(

K-

I doubt it will be as extreme as 16 bit vs 32 bit. After all 32 bit is a hell of a lot more accurate than 16 bit. The difference between 24 and 32 bit is much smaller.

Kristof
27-Sep-2002, 09:00
I doubt it will be as extreme as 16 bit vs 32 bit. After all 32 bit is a hell of a lot more accurate than 16 bit. The difference between 24 and 32 bit is much smaller.

The difference will be 8 bits both ways. Say B3D benchmarks NV30 16bit mode versus R9700 using 24 bits and publishes the results. ATI will be unhappy indicating that their results have 8 bits more accuracy and that the comparison is invalid. If B3D benchmarks NV30 32bit mode versus R9700 using 24 bits and publishes the results then NVIDIA will be unhappy calling the results unfair since they have 8 bits more accuracy... Do you see the huge problem thats going to appear ? Actually this will only be an issue if R9700 is faster than NV30 in 32 bit mode but slower when NV30 is in 16 bit mode, so ATI beats NV30 in 32 bit mode but loses in 16 bit mode. If R9700 matches NV30 in 16 bit then ATI has won this battle, if R9700 matches NV30 in 32 bit mode then NVidea has won since 16 bit is expected to be "faster". Problem is if they are roughly the same speed...

Remember the old days where is was 22bit equivalent versus 24 bit ? :) I can see the discussion : "I can't see the difference between 24 bit and 16 bit floats in game X...".

K-

DemoCoder
27-Sep-2002, 10:13
Yep, it's coming. Put on your flame suits.

psurge
27-Sep-2002, 20:48
Kristof,

Thanks for the info. Since this is the case, I would be extremely surprised if you couldn't co-issue arithmetic and texture ops on NV30.

A texture unit should probably be able to support a fairly large number of "in-flight" loads to hide latency, so not being able to run arithmetic ops while a texture op is executing seems like it would lead to a serious performance handicap...

Regards,
Serge

LeStoffer
27-Sep-2002, 21:28
Question can we compare benchmarks run at 16 or 32 bit mode with the results of the R300 which runs all at 24 bit ? If we compare 16 with 24 then NV lacks accuracy (but probably has plenty of speed), if we compare 32 with 24 then ATI lacks accuracy and NV probably lacks speed.

Okay then, I need some help here:

If R9700 is built to do one arithmetic ops per cycle I would assume that this ops would be done in one cycle regardless of whether we're rendering in 16 or 24 bit if textures etc is the same and we only need to write the final pixel to the framebuffer. Right?

So unless NV30 can do two arithmetic ops per cycle in 16 bit rendering, I don't understand the difference in performance.

Likewise: If NV30 can do one arithmetic ops per cycle in 32 bit (with R9700 doing the same i 24 bit) the performance shold be the same - everything else being equal of course.

Sorry for being confused, but I don't understand how specs on one arithmetic (color) ops per cycle would chance internal in the rendering pixel pipeline just because you shift rendering depth (but keep texture/framebuffer bandwidth requirements the same).

DemoCoder
27-Sep-2002, 22:00
It's not rendering depth you're shifting, it's pipeline precision.

e.g.

float x = 1.0f + 2.0f;

vs

double x = 1.0 + 2.0;

Like a CPU, the NV30 can supposedly run the lower precision ops faster.

fresh
28-Sep-2002, 00:37
Going from R1 -> R2 or R3 -> R2 is trivial, not really a pain. I don't see why people are fretting over this. Besides pre-computed lookups, how many games are going to render-to-texture into a floating point cube map? Just doing that once would kill your fillrate and bandwidth.

Sure I'm not going to render into an fp cube map, but I'd like to be able to use a precomputed HDR cube map without jumping through hoops. It's annoying, that's all.

Just like the PS2 doesn't have backface culling. Writing the code for it is easy, but it's a hell of a lot more convenient to have it in hardware.

multigl2
28-Sep-2002, 07:04
Yep, it's coming. Put on your flame suits.

not if DX9 games are as fast coming as DX8 games have been :lol: :lol: :lol:

Hyp-X
28-Sep-2002, 11:41
Yep, it's coming. Put on your flame suits.

not if DX9 games are as fast coming as DX8 games have been :lol: :lol: :lol:

Well don't except DX9 games until cheap (<=100$) DX9 cards are available from all major manufacturers...
I leave up to you to guess when it will come.
But it won't be next year.

On the other hand I'm pretty sure nv30 will use the 16bit path to run <=DX8.1 programs (or <PS2.0 shaders to be more specific).

I think the only game that it'll be an advantage is Doom3.
Other games don't use shader programs long enough to make a difference.

demalion
28-Sep-2002, 15:30
? I thought it was pretty clear that by this time next year there would be sub $100 DX 9 parts...?

nVidia has made claims hinting at this, and ATi has stated it pretty strongly with their aggressive roadmap. I'm even pretty sure that that exact aim has been put forth by ATi as a goal to accomplish...I mean a 9500 non pro is likely to be selling for < $180 within the year, isnit it?

BRiT
07-Dec-2002, 19:30
Just wanted to revisit this topic and make note that in under 3 months, there are indeed 9500 non-pro cards selling for under $160 with 9500 pro cards for under $180. Hats off to ATI for this achievement!

Demirug
08-Dec-2002, 01:06
The last days i have take a closer look to the new PS API from DX9.

IMO the 300 PS (2.0) is only a improved Version from the R200 PS (1.4).

We all rember that a PS 1.4 works this way

1. Texturereads (6)
2. Arithmetic (8)
phase
3. Texturereads (6)
4. Arithmetic (8)

PS 2.0 do not have a phase command but i am belive it is still there but only in the driver. In PS 2.0 there is a limit of 4 dependent reads. That's the reason why i am belive that R300 works in this manner:

1. Texturereads
2. Arithmetic
phase
3. Texturereads
4. Arithmetic
phase
5. Texturereads
6. Arithmetic
phase
7. Texturereads
8. Arithmetic
phase
9. Texturereads
10. Arithmetic

If you uses less than 4 depent reads the nummber of steps is reduced.

Step 1, 3, 5, 7 and 9 runs in the texture unit
Step 2, 4, 6, and 8 runs in the address processor
Step 10 runs in the color processor

The only think i am not sure about is the number of pixels that runs at the same time in one pixelpipeline. IMO 3 is the best way to make sure that the PS is used as much as possible. But still in this case each one of this pixel can block a other one.

At the moment i am doing some theoretical tests with this configuartion.

SA
08-Dec-2002, 01:10
By the way, just thought it was a good time to mention that deferred rendering tilers have a big advantage over IMRs for high precision floating point pipelines (as most here already know).

The major problem with floating point textures and pixel pipelines at the moment is that they do not implement what has become the essential features of the integer pipeline (texture filtering, antialiasing, etc.). This substantially limits their usefulness as a general pipeline for realtime 3d graphics. They are, however, very useful for offline work where filtering can be done in (shader) software and performance is not an issue. They are also currently useful for some specialized realtime pixel shaders.

With integers it is easy to implement a large amount of computation directly in hardware in parallel. Floating point requires far more transistors to implement. As a result, it makes no sense to dedicate all those transistors to fixed functions. Allocating all those transistors to floating point only makes sense if you make the pipeline programmable.

The problem is that while you might do dozens of operations in parallel in a single clock per pipe for integer vectors, you are generally limited to as little as one (or a few) for floating point.

This was a lesson learned long ago for other types of hardware. As soon as you increase the sophistication of the data types and the computations, direct hardware implementation is no longer feasible. You must rely on software. As soon as you do this the entire hardware design picture changes dramatically. It becomes extremely important to maximize frequency, data availability, and software computation parallelism to squeeze the most out of all those transistors in each pipe. Since you can perform far fewer operations per pipe per cycle, you must increase the number of cycles and the number of pipes dramatically.

In the future, I expect to see much more emphasis on frequency and the number of pipelines than in the past.

Mintmaster
08-Dec-2002, 01:46
I believe the 64bit and 128bit formats will be used mainly for pixel aligned buffers or unfiltered textures. It would be useful for HDR textures, but doing the filtering by hand is just too messy.

There are also high precision integer formats up to 64-bit (maybe even 128-bit). 2 or 4 16-bit integers per texture sample can be very useful, and they can be filtered just fine. Because they are higher precision, you can still do a scale & bias in the pixel shader to get HDR, especially if you store HDR values non-linearly in the texture and linearize it in the pixel shader. I remember hearing (maybe at Siggraph?) about storing all values as s = 1/(h+1) in a texture, then expanding them as h = 1/s - 1 in the pixel shader (h is HDR value, s is texture sample). Another method is the RGBScale method, mentioned in ATI's paper, although I like first method better.

Mintmaster
08-Dec-2002, 02:16
Welcome to the forum Demirug!

IMO the 300 PS (2.0) is only a improved Version from the R200 PS (1.4).
Agreed. PS 1.4 was quite a big step from PS 1.3, although technically you could pretty much convert anything from PS 1.4 to multiple PS 1.3 passes. ATI was saying in their PS 1.4 papers that this is where we're headed with PS 2.0, and they weren't unjustly promoting R200 by doing so.


Step 1, 3, 5, 7 and 9 runs in the texture unit
Step 2, 4, 6, and 8 runs in the address processor
Step 10 runs in the color processor

I don't really think there is a separate address and colour processor, not even in R200. I think the GF3/4 worked the way you described, though, which is why they had arithmetic operations like texdp3 or texm3x3. I believe there is only one shader processor that does arithmetic, and a texture unit that fetches texture samples based on values from either texture coordinates or the shader.

An interesting point here is that it seems R300 is relatively evolutionary from R200, as compared to NV30 being a significant architectural overhaul from NV25/NV20. Maybe this is one of the reasons that R300 came so much sooner.


The only think i am not sure about is the number of pixels that runs at the same time in one pixelpipeline. IMO 3 is the best way to make sure that the PS is used as much as possible. But still in this case each one of this pixel can block a other one.

I'm not sure what you mean by this, but I'm pretty sure only one pixel "runs at the same time" per pipeline. Between "phases", a group of pixels is buffered while the dependent texture lookup is performed, and then those pixels are injected back into the pipeline in a cycle. If that's what you mean, then the number of pixels is probably something like 10 or 20 per pipe, but this is a pure guess. You need enough time to ride out the latency of a texture fetch.

Demirug
08-Dec-2002, 03:22
Step 1, 3, 5, 7 and 9 runs in the texture unit
Step 2, 4, 6, and 8 runs in the address processor
Step 10 runs in the color processor

I don't really think there is a separate address and colour processor, not even in R200. I think the GF3/4 worked the way you described, though, which is why they had arithmetic operations like texdp3 or texm3x3. I believe there is only one shader processor that does arithmetic, and a texture unit that fetches texture samples based on values from either texture coordinates or the shader.

http://www.digit-life.com/articles2/radeon/ati-r300.html

if you look at the pic of the Pixelshader you can see that there are 3 parts

-Texture Unit
-Address Processor
-Color Processor

ok maybe the "Address Processor" doing something different but why do the call it "Color Processor" if it callculate Texturecoordinates for depend reads?



The only think i am not sure about is the number of pixels that runs at the same time in one pixelpipeline. IMO 3 is the best way to make sure that the PS is used as much as possible. But still in this case each one of this pixel can block a other one.

I'm not sure what you mean by this, but I'm pretty sure only one pixel "runs at the same time" per pipeline. Between "phases", a group of pixels is buffered while the dependent texture lookup is performed, and then those pixels are injected back into the pipeline in a cycle. If that's what you mean, then the number of pixels is probably something like 10 or 20 per pipe, but this is a pure guess. You need enough time to ride out the latency of a texture fetch.

Yes I mean something like this but i think that "run" is the right term if i declare the TMU as part of the Pixelshader. I do not think that a Pixel shader buffered many pixels because the register set is not small at all. And we know that the NV30 can work on 32 Pixel with 8 Pipes that mean that every Pipe work on 4 Pixel at the same time.

IMO at the same time the TMUs fetch the Samples for one Pixel the Processor works on an other.

Something like this:

T A C
1
2 1
3 2 1
4 3 2

T = TMU
A = Address Processor
C = Color Processor

In reality it is more tricky because the time that a pixel spent in one of the Pixel Shader parts may different. As i say befor i play a littel bit with this model and a model of the NV30 pipeline, too

arjan de lumens
08-Dec-2002, 03:47
Yes I mean something like this but i think that "run" is the right term if i declare the TMU as part of the Pixelshader. I do not think that a Pixel shader buffered many pixels because the register set is not small at all. And we know that the NV30 can work on 32 Pixel with 8 Pipes that mean that every Pipe work on 4 Pixel at the same time.

Where does this information come from?

NV30's pixel shaders use 32 registers per pixel, each 128 bits wide. Such a register file need not take more than ~33K transistors, so I would find it perfectly feasible to implement 30+ such register files in each pixel pipe (which would still amount to only ~6% of the total transistor count in NV30). If you want an efficient pipeline, you pretty much need a pipeline that can actually absorb texture cache misses (several clock cycles + 1 full DRAM latency, in sum ~50-80 ns = 25-40 cycles @ 500 MHz) without stalling.

Demirug
08-Dec-2002, 04:37
Yes I mean something like this but i think that "run" is the right term if i declare the TMU as part of the Pixelshader. I do not think that a Pixel shader buffered many pixels because the register set is not small at all. And we know that the NV30 can work on 32 Pixel with 8 Pipes that mean that every Pipe work on 4 Pixel at the same time.

Where does this information come from?

NV30's pixel shaders use 32 registers per pixel, each 128 bits wide. Such a register file need not take more than ~33K transistors, so I would find it perfectly feasible to implement 30+ such register files in each pixel pipe (which would still amount to only ~6% of the total transistor count in NV30). If you want an efficient pipeline, you pretty much need a pipeline that can actually absorb texture cache misses (several clock cycles + 1 full DRAM latency, in sum ~50-80 ns = 25-40 cycles @ 500 MHz) without stalling.

The information is from a interview with David Kirk.
http://www.extremetech.com/article2/0,3973,710384,00.asp

This is the the same technic that was used befor in the vertex shader. But in this case NVIDIA use only 3 stages. This is done to make sure that every vertex operation can handle in virtual one cycles per vertex. Some Ops need up to 3 cycles but the vertex shader work on 3 vertex at the same time so every cycles one operation is finished.

In the new pixel shader is looks like that they use 4 stages and a decoupled TMU to reduce the depend on latency.

Your callculation miss something. the input and output register are a part of the pixel register set too. The only thing i ask me is: "Is it a good idea to use such a huge part of the die to only reduce the effect of latency in texturefetch"

arjan de lumens
08-Dec-2002, 06:10
The information is from a interview with David Kirk.
http://www.extremetech.com/article2/0,3973,710384,00.asp

OK, although I still think that 32 pixels in flight sounds awfully little for an 8-pipeline part, given that it would allow for only 4 cycles latency per TMU lookup before fillrate takes a nosedive - TMUs that are *that* fast are hard to make, even if you don't count the texture cache miss latency.

This is the the same technic that was used befor in the vertex shader. But in this case NVIDIA use only 3 stages. This is done to make sure that every vertex operation can handle in virtual one cycles per vertex. Some Ops need up to 3 cycles but the vertex shader work on 3 vertex at the same time so every cycles one operation is finished.

In the new pixel shader is looks like that they use 4 stages and a decoupled TMU to reduce the depend on latency.

Sounds reasonable.

Your callculation miss something. the input and output register are a part of the pixel register set too. The only thing i ask me is: "Is it a good idea to use such a huge part of the die to only reduce the effect of latency in texturefetch"
Input and output registers add about 50% to that register count. 12 or so million transistors out of 125 doesn't sound *that* large to me, given that it can be done in rather compact SRAM. And for what happens when the pipeline is not optimized like that, it look like the SiS Xabre chip is a rather good example: it needs a huge positive LOD bias (=> blurry textures) to reach competitive performance levels, and its dependent texturing/EMBM performance is *horrible*.

Chalnoth
08-Dec-2002, 08:30
OK, although I still think that 32 pixels in flight sounds awfully little for an 8-pipeline part, given that it would allow for only 4 cycles latency per TMU lookup before fillrate takes a nosedive - TMUs that are *that* fast are hard to make, even if you don't count the texture cache miss latency.

Yes, I believe I remember reading not too long ago that modern processors have hundreds of pipeline stages (300-400 was what I thought I remember reading). This makes perfect sense to me, as the static nature of 3D graphics just makes such huge pipelines an obvious optimization.

Demirug
08-Dec-2002, 12:23
OK, although I still think that 32 pixels in flight sounds awfully little for an 8-pipeline part, given that it would allow for only 4 cycles latency per TMU lookup before fillrate takes a nosedive - TMUs that are *that* fast are hard to make, even if you don't count the texture cache miss latency.


If you can build a TMU that can handle a lookup with a 4 cycles latency it will fast enough for any case because the Pixel need in the Pixelpipline 4 cycles too. The other thing I belive is that the NV30 Driver try to optimize a shader programm. "Fetch early use late" ist the right way for this. The number of cycles between the start of the fetch and the use of the sample is the key. If you can do something other after start the lookup before you need the sample you will win some cycles. Each operation give you 4 aditional cycles. OK, in the case you are limited by TMUs because you use a high AF-Mode the pixelshader processor will not with maximum performance, but this is an old story and nothing new.



Input and output registers add about 50% to that register count. 12 or so million transistors out of 125 doesn't sound *that* large to me, given that it can be done in rather compact SRAM. And for what happens when the pipeline is not optimized like that, it look like the SiS Xabre chip is a rather good example: it needs a huge positive LOD bias (=> blurry textures) to reach competitive performance levels, and its dependent texturing/EMBM performance is *horrible*.

Sure but we still need the same count of transitors for the 32 pixel that active in the pipelines. IMO it is better to use this transistors, if you have it left, as cache for textures to improve the hit rate.

Demirug
08-Dec-2002, 12:25
OK, although I still think that 32 pixels in flight sounds awfully little for an 8-pipeline part, given that it would allow for only 4 cycles latency per TMU lookup before fillrate takes a nosedive - TMUs that are *that* fast are hard to make, even if you don't count the texture cache miss latency.

Yes, I believe I remember reading not too long ago that modern processors have hundreds of pipeline stages (300-400 was what I thought I remember reading). This makes perfect sense to me, as the static nature of 3D graphics just makes such huge pipelines an obvious optimization.

The number of stages seems a littel bit high. AFAIK it is more something in the range of 20-40.

Althornin
08-Dec-2002, 18:14
Yeah, from what i remember in reading my computer architecture book, modern processors have more like 7-20 pipeline stages. Else, a branch mispredict would cost even more.
Quick googling seems to say that AthlonXP has 10 pipeline stages, and Pentium4 has 20. IIRC, pentium2 has 7.

RoOoBo
08-Dec-2002, 18:16
OK, although I still think that 32 pixels in flight sounds awfully little for an 8-pipeline part, given that it would allow for only 4 cycles latency per TMU lookup before fillrate takes a nosedive - TMUs that are *that* fast are hard to make, even if you don't count the texture cache miss latency.

Yes, I believe I remember reading not too long ago that modern processors have hundreds of pipeline stages (300-400 was what I thought I remember reading). This makes perfect sense to me, as the static nature of 3D graphics just makes such huge pipelines an obvious optimization.

The number of stages seems a littel bit high. AFAIK it is more something in the range of 20-40.

I found long ago a reference for 600 - 800 pipeline stages here:

http://www.ce.chalmers.se/undergraduate/D/EDA425/lectures2002/gfxhw.pdf

Not that I think that is a fully credible source because I don't remember any NVidia document talking about hundreds of stages in GeForce 3. May be it is counting all the stages in all the pipelines (for example 20 for each pixel pipe: 4x20 = 80)

I think around 50 - 100 is a more reasonable number ;).

Now we could start a poll about the number of stages in a GPU :lol: .

RoOoBo
08-Dec-2002, 18:21
Yeah, from what i remember in reading my computer architecture book, modern processors have more like 7-20 pipeline stages. Else, a branch mispredict would cost even more.
Quick googling seems to say that AthlonXP has 10 pipeline stages, and Pentium4 has 20. IIRC, pentium2 has 7.

Well, but a GPU is a lot different from a CPU. Shader units (the ones that are CPU alike) should have around 5 - 8 stages, taking into account short latency instructions (integer, branches) and long latency instructions (FP). But that doesn't take into account the stages in the fixed function hardware which can be a lot of more.

Perhaps the guy in the pdf I posted before means with 600 - 800 the number of cycles it takes a vertex or a triangle from the start to the end, not the real number of stages in the hardware pipeline. That could include that a single triangle can produce multiple pixels.

BTW, the 20 stages in P4 are just from Trace Cache to Write Back (from what I recall). That doesn't include instruction fetch and decode from L2 and commit, so the actual number perhaps is a bit larger (though Intel doesn't seem to like to talk about it). It is a hyperpipelined superscalar out of order architecture, well and bithreaded now ;). I still recall the last PACT with three papers talking about how many stages could have a CPU before becoming inefficient (around 40 in some of the papers).

Demirug
08-Dec-2002, 19:07
RoOoBo, yes i think that the 600 - 800 are the number of cycles that need from the start to end.

Basic
08-Dec-2002, 19:09
A GPU isn't like one CPU, it's instead several CPUs connected in series with FIFOs between them (VS, triangle setup, posibly some rasterization stuff, PS). If the 600-800 stages number ar right, then I'd guess that it refers to the sum of all pipes, plus the depth of the FIFOs. This does not by any means mean that the pipeline depth for just the PS is anywhere near these numbers.

I assume that most GPUs are heavily dependant on texture coherency, even in dependant reads, so that precaching of textures is possible. PS' should get their data directly from the caches in most cases, and the cache misses should be rare enough that the high cost doesn't matter that much.

It would be interesting to test dependant texture reads where the accesses to the second texture is rather random, and with mipmaping turned off on that texture. If it doesn't take a deep performance plunge, I'd be impressed.

psurge
08-Dec-2002, 19:22
Demirug - from the extreme-tech article, it doesn't sound like there are necessarily 8 decoupled pipelines at all. Kirk simply states that there are 32 functional units, and that 8 pixels can be written per cycle.

http://www.extremetech.com/article2/0,3973,710352,00.asp

Pipes don't mean as much as they used to. In the [dual-pipeline] TNT2 days you used to be able to do two pixels in one clock if they were single textured, or one dual-textured pixel per pipe in every two clocks, it could operate in either of those two modes. We've now taken that to an extreme. Some things happen at sixteen pixels per clock. Some things happen at eight. Some things happen at four, and a lot of things happen in a bunch of clock cycles four pixels at a time. For instance, if you're doing sixteen textures, it's four pixels per clock, but it takes more than one clock. There are really 32 functional units that can do things in various multiples. We don't have the ability in NV30 to actually draw more than eight pixels per cycle. It's going to be a less meaningful question as we move forward...[GeForceFX] isn't really a texture lookup and blending pipeline with stages and maybe loop back anymore. It's a processor, and texture lookups are decoupled from this hard-wired pipe.



Something else that's interesting (from the same article):

nVidia declined to provide more details about the actual sample patterns. Tamasi and Kirk would only say that the sample positions and patterns have changed, and that they are non-rectangular.

Demirug
08-Dec-2002, 20:09
psurge, the information with the decoupled TMUs was in an other artikel but i can not find it at the moment.

RoOoBo
08-Dec-2002, 20:46
RoOoBo, yes i think that the 600 - 800 are the number of cycles that need from the start to end.

In fact instructions in a CPU can also remain in the pipe for hundreds of cycles: a load that miss in L2. And now think about a chain of dependent loads that remain in the instruction queue ;). So you could say that a CPU has *cough* hundreds *cough* of stages. But it doesn't make any sense ...

Althornin
08-Dec-2002, 21:52
Well, but a GPU is a lot different from a CPU. Shader units (the ones that are CPU alike) should have around 5 - 8 stages, taking into account short latency instructions (integer, branches) and long latency instructions (FP). But that doesn't take into account the stages in the fixed function hardware which can be a lot of more.

Perhaps the guy in the pdf I posted before means with 600 - 800 the number of cycles it takes a vertex or a triangle from the start to the end, not the real number of stages in the hardware pipeline. That could include that a single triangle can produce multiple pixels.

BTW, the 20 stages in P4 are just from Trace Cache to Write Back (from what I recall). That doesn't include instruction fetch and decode from L2 and commit, so the actual number perhaps is a bit larger (though Intel doesn't seem to like to talk about it). It is a hyperpipelined superscalar out of order architecture, well and bithreaded now ;). I still recall the last PACT with three papers talking about how many stages could have a CPU before becoming inefficient (around 40 in some of the papers).

I understand that, my post was in relation to Chalnoths comment on modern processors having hundreds of stages.

Chalnoth
09-Dec-2002, 06:22
Yeah, from what i remember in reading my computer architecture book, modern processors have more like 7-20 pipeline stages. Else, a branch mispredict would cost even more.
Quick googling seems to say that AthlonXP has 10 pipeline stages, and Pentium4 has 20. IIRC, pentium2 has 7.

Sorry, I meant modern graphics processors. Of course you're right, CPU's have far less.

JohnH
09-Dec-2002, 19:02
To maintain peak fillrate, you want to be able to absorb the latency of your pipeline plus the latency of a cache miss, this can add up to a lot of clocks. In a pre Dx8 processor this latency could be absorbed by adding a large fifo between the texture fetch unit and the colour processing unit, however as soon as you allow a texture to be looked up using the result of another lookup and/or some arithmetic, that solution no longer works, you have to either duplicate execution units (with fifo's in between i.e. think phase marker) or make your temp regs as deep as the latency you want to absorb. The latency of a typical GPU texturing pipeline (e.g. isue->cache->filter->ps) is probably of the order of 30-50 clocks, however a cache miss can add 100's of cycles to this, this potentually makes temp registers very expensive!

Incedentally modern CPU's generally grind to a halt on a 'seen' cache miss e.g. try reading from uncached memory (such as FB, this typically yeilds around 5MB's!). This is one of the things hyperthreading helps with, i.e. try and use my processing resource for something else while I'm stalled waiting for external resources to come back...

Hmm, seem to be rambling.

John.

RoOoBo
10-Dec-2002, 08:13
The latency of a typical GPU texturing pipeline (e.g. isue->cache->filter->ps) is probably of the order of 30-50 clocks, however a cache miss can add 100's of cycles to this, this potentually makes temp registers very expensive!

I wonder how many cycles is the max latency for a texture load. I don't think it is that big as a hundred cycles. You have to remember that the video memory is way faster than the main memory and that the GPU is a lot slower than the CPU. The CPU to main memory relative clock is around 10-20 CPU cycles per bus cycle (the address clock is still 133 MHz even if the data can be read/write at double speed). The GPU to video memory is 1:1. That without counting AGP tecture reads, but I don't think anyone takes that into account while designing the optimum path in the pipeline (the latency would be way too large).

Of course a texture read can be more than a 32bit data read when using filtering (or textures with more precission) but the data burst (reading a whole line) is faster than address setup and if the texture is stored correctly all the data is fetched in few cycles with just one or two accesses.

The fixed amount because the filtering algorith shouldn't be that large, bilinear is 3 adds (32bit as I think FP filtering isn't supported yet) and one division/multiplication (probably something way more efficient that a real division), trilinear is 9 adds and 3 div/muls, AF could be more expensive but that is expected in any case (it isn't suppossed to be single cycle). I don't think it is that much larger than the FMAD latency.

Chalnoth
10-Dec-2002, 08:24
Well, one thing that you do need to consider is that the memory controller will need to cache resonably large blocks of data for there to be any kind of reasonable efficiency in using the memory bandwidth, so while the memory may only take a few clocks to access, the memory controller may require more so that it can make best possible use of the total memory bandwidth.

Still, I don't really know how this can translate to 100's of clocks...

Kristof
10-Dec-2002, 09:32
RoOoBo its all about pipelining and latency. When you do a texture read you are going through the following steps :

PS -> Texture Address Generator -> Cache <=> External Video Memory
Cache -> Texture Filter -> PS

You need to remember that all these stages will be pipelined so they can execute a full operation per clock. Even though they deliver a result each clock this does not mean that an operation is being worked on for one clock. Each of these blocks is a set of pipeline stages and each stage adds another clock of latency. So even though filtering might not sound expensive due to the fact thats its pipelined its going to add quite a few stages, also don't forget that there are other things involved with all of this for example decompression : DXT1 is not going to magically decompress itself, lower bit depth formats will get upsampled to the internal format (conversion to float for example) and all of this takes time and stages and adds latency.

If you have a true miss then you'll need to access external memory, and you don't just read the texels you need you'll read a burst which is a whole block of texture data - just reading this will take quite a few cycles. But the real issue is that your memory interface is doing a lot of things (nopt just texture fetches) and so all of this goes into a buffer with several stages (need to wait your turn, you can't break for example a vertex buffer burst just because you need a texture). So just because you have a miss does not mean you can stop all the other data that needs to get onto or off the chip and then you'll most likely have a page break (another wad of cycles lost) or some read/write turn around. Another thing to remember is that you have a lot of data that needs to flow into and out of the cache, think 8 pipelines all supporting trilinear thats 8x8=64 texels flowing at 32 bits each so thats 64*4 bytes or 256 bytes of data flowing out of your cache, thats not a trivial thing to do so the cache itself is a quite complex beast. Anyway all in all you might be quite surprised how quickly latency adds up in a complex design.

JohnH
10-Dec-2002, 10:11
The other thing to bear in mind is that the cache won't necessarily get the bus the instant it requests it as other ports may be using it e.g. FB R/W, DAC, Vertex fetch etc all of which will be trying to supply as large a burst as possible in order to maintain efficiency, add in the time to burst fill the cache itself and then page breaks and you start having to allow for an awful lot of clocks.

There's also a pile of latency sitting behind the cache in the form of arbitration logic, synchronisers (for async mem) and memory interface. This all adds up very quickley...

John.

PSarge
10-Dec-2002, 17:57
Lots of very true statements
Exactly!!!

Still, I don't really know how this can translate to 100's of clocks...
Consider that there might be 10-20 different areas of the chip all wanting access to memory bandwidth (Video Displays, Writes from host loading geometry/textures, DMA engines, texture reads, z reads, z writes, framebuffer reads, framebuffer writes, and then add in all the clever stuff :) ) Then think that each time a request for data goes out of a cache it has be arbitrated and then wait for all the other higher priority requests to complete before it gets hold of the memory bus.

Time just flys while you're having fun 8)

JohnH
10-Dec-2002, 18:18
Actually, going back to the original subject of this thread. Neither the 9700 or NV30 will have sufficient memory BW to support 128 bit surfaces at their max fillrate. E.g. at 9700's 2.6GPix/s fill rate 128BPP requires ~40GB/s just for data write out alone i.e >2x peak memory BW, add texturing to this, and well...

John.

Chalnoth
10-Dec-2002, 18:50
Actually, going back to the original subject of this thread. Neither the 9700 or NV30 will have sufficient memory BW to support 128 bit surfaces at their max fillrate. E.g. at 9700's 2.6GPix/s fill rate 128BPP requires ~40GB/s just for data write out alone i.e >2x peak memory BW, add texturing to this, and well...

Yeah, if the cards can actually calculate a 128-bit pixel in a single clock per pipeline. Remember what these things are going to be used for? Primarily for intermediate passes of multipass algorithms.

Think about that for a moment. If you have to resort to multiple passes on DX9 hardware, how long is the shader going to be? I don't think memory bandwidth is going to be much of a concern for 128-bit PS, usually (Might be a concern for people attempting to use nVidia's packed pixel format...not sure...).

ERP
10-Dec-2002, 19:08
Actually, going back to the original subject of this thread. Neither the 9700 or NV30 will have sufficient memory BW to support 128 bit surfaces at their max fillrate. E.g. at 9700's 2.6GPix/s fill rate 128BPP requires ~40GB/s just for data write out alone i.e >2x peak memory BW, add texturing to this, and well...


I would have thought that most of the cases where writing out a 128BPP buffer it's going to be an intermediate result and by enlarge your going to be spending a lot more than one clock computing the results.

So I don't think you have to approach the necessary bandwidth required for peak fill.

Actually I think that anytime your stressing R300 or NV30 feature wise your probably going to end up more worried about pixel computations than raw framebuffer bandwidth.

Simon F
11-Dec-2002, 09:27
I would have thought that most of the cases where writing out a 128BPP buffer it's going to be an intermediate result and by and large you're going to be spending a lot more than one clock computing the results.

So I don't think you have to approach the necessary bandwidth required for peak fill.
You might still get bottlenecks. The "average" rate may be within the limits but if the peak rate is too high and the FIFOs too small then you won't get near that average.

JohnH
11-Dec-2002, 10:07
Yep, the typical algorithms we've come up with that use these do tend to have a reasonable number of instructions so, as you say, the avarage rate tends to be quite a bit slower, although I can probably come up with scenarios in which passes with relatively few instructions are required.

The other way of looking at this is how much memory are these surfaces going to take ? e.g many algorithms might want a 1:1 ratio with the "real" render target, so, say at 1024x768, with 4xAA @ 128BPP, requires 48MB.

John.