What does the floating point adress processor do?

Luminescent · Mar 14, 2003

What does this middle processor do in the R3XX's pixel pipeline? Can anyone give examples of tasks it would carry out? Is it programmable, like the fragment color processor?

Why does NV3X seem to be lacking it?

Humus · Mar 14, 2003

It processes floating point addresses ... 8)

KimB · Mar 15, 2003

It sounds like that's the part that calculates texture coordinates.

Hyp-X · Mar 15, 2003

I'd think that's the unit that calculates partial derivatives of texture coordinates, and determines the LOD and the amount of anisotropy required.

This is usually part of the definition of the TMU, but I think ATI marketeers felt that it wouldn't sound impressive if there was only a TMU and an arithmetic unit, so they split the TMU into the address processing and sampling part.

Dave H · Mar 15, 2003

This is usually part of the definition of the TMU

In an Nx2 design, doesn't the 2nd TMU in the pipe share most of these calculations with the 1st? Are there really 2 address units or just 1 shared between the 2 TMUs?

If not, what transistors are saved by going Nx2 instead of 2Nx1? (Well, I can come up with some answers to this question, but not too satisfying.)

OpenGL guy · Mar 15, 2003

Dave H said:
This is usually part of the definition of the TMU

Click to expand...

In an Nx2 design, doesn't the 2nd TMU in the pipe share most of these calculations with the 1st? Are there really 2 address units or just 1 shared between the 2 TMUs?

There may be some sharing, but each texture may have different LODs, texture coords, etc. so many (all?) of the calculations will have to be done separately.

If not, what transistors are saved by going Nx2 instead of 2Nx1? (Well, I can come up with some answers to this question, but not too satisfying.)

Nx2 has less pixels to be written out so you can save on a lot of post pixel shade stuff (fog, blending, Z, etc.).

darkblu · Mar 15, 2003

Dave H said:
In an Nx2 design, doesn't the 2nd TMU in the pipe share most of these calculations with the 1st? Are there really 2 address units or just 1 shared between the 2 TMUs?

there are 2 address units, as each texture 'layer' may have its own coordinate set.

Dave H · Mar 16, 2003

Thanks for the corrections.

demalion · Mar 16, 2003

Hyp-X said:
I'd think that's the unit that calculates partial derivatives of texture coordinates, and determines the LOD and the amount of anisotropy required.

I thought the nv30 had the two being the same thing as you described, but the R300 had separate functionality for filtering calculations for some texture formats (performed in the first, more limited/specialized "floating point texture unit")? This was my understanding of the indication that the nv30 fp32 ALU was tied up for texture processing use always (I thought it was a unique aspect of the nv30) with regards to its performance behavior, and seems necessary considering the ATI anisotropic filtering implementation is a bit customized.

I thought floating point texture addressing allowed floating point calculation results to be calculated and applied to determining texture addresses (i.e., formula based texture lookups, etc), which then could be used to do what you describe but was not required to be in typical usage (atleast, not necessarily so, and specifically not so in the R300).

It would be nice if someone corrected me where I go wrong, in as detailed an explanation, or reference to one, as they have time for.

This is usually part of the definition of the TMU, but I think ATI marketeers felt that it wouldn't sound impressive if there was only a TMU and an arithmetic unit, so they split the TMU into the address processing and sampling part.

Maybe I'm just confused as to what's going on with the nv30. I'll see how things shake out in my head after some sleep. Sorry if I'm missing anything obvious.

Luminescent · Mar 16, 2003

I believe there is no differentiation between texture processors (which both filter and address, according to Nvidia) and pixel program processors, in NV3X. They both seem to use the same alu's. This maybe why the units support ddx/ddy instructions. Anybody agree/disagree?

Hyp-X · Mar 16, 2003

Dave H said:
This is usually part of the definition of the TMU

Click to expand...

In an Nx2 design, doesn't the 2nd TMU in the pipe share most of these calculations with the 1st? Are there really 2 address units or just 1 shared between the 2 TMUs?

1. Maybe there are operations that are supported by only 1 of the two address units. For example they can chose to implement "rare" things like cube-maps, volumetric textures in only one of them.

2. Maybe there are operations that are supported by combining the two units together - altough it doesn't sound likely.

KimB · Mar 16, 2003

Luminescent said:
I believe there is no differentiation between texture processors (which both filter and address, according to Nvidia) and pixel program processors, in NV3X. They both seem to use the same alu's. This maybe why the units support ddx/ddy instructions. Anybody agree/disagree?

I doubt it. Otherwise we'd see support for the standard filtering techniques on other storage formats, such as floating-point textures. I'm pretty confidant that all modern GPU's have hardware dedicated to texture filtering alone.

Blending, however, could possibly be done in the fragment processor.

demalion · Mar 16, 2003

OK, a bit more awake now (sorry in advance if that isn't awake enough, I'm still under the weather).

I'm not clear on the full functionality limitations, but it has been stated that the R300 can do 3 operations in one clock:

Texture look up (Floating Point Texture Unit): I'd think clock cycle delays would depend on cache size/organization in relation to what is being referenced, so excessive latency for this type of operation should happen in less instances for the R350 compared to R300.
Looking things up, the R300 has been said to being limited to one 32-bit fetch per clock per pipeline, hardwired bilinear filtered at that time, if desired.
Texture address operation (Floating Point Address Processor): I'm not sure how many components (3? 4?), and also unclear (to me) is the full scope of the calculations allowed, though I'd think a look at PS 2.0 specs would give a good indication.
Color operation (Floating Point Color Processor): Seems this is capable of calculating 4 component operations (3 of which must be part of the same vector operation, with the 4th able to be an independent scalar operation) at the same time (usually in one clock, depending on the op).

Barring the errors in my understanding above, this seems to show the reasoning behind the part of the image in question.

Also, Hyp-X seems to be saying calculations necessary for anisotropic filtering are performed in the floating point address processor in such a way that other operations are precluded in the same cycle a texture is filtered anisotropically. Is this true? One way to check (I think) would be to contrast the impact of anisotropic filtering for a small texture using a simple texture look up/address op/color op shader...I didn't have time to double check the benchmarks we have right now to see if one could serve.

In any case, since it is presumably programmable and can be used to do other calculations, that means even if this is true, the reason for it being included is not just marketing, though its performance characteristics in this regard would be nice to know. It has probably been specified somewhere in these forums and I just missed it, or need to find the right bit of publically available documentation

Contrasting the nv30 as a means to clearer understanding of programmable pipelines in general:

My current understanding is that the nv30 has 8 fp32 processing units, but that they do not have as capable an "assistant" unit to handle either filtering and sampling or texture address operations, so must be used to assist in such (hey, I wonder how it would perform in point sampled texturing with floating point shader ops?). I am also under the impression that the scalar/vector execution flexibility is not present, and that it relates to the lack of cubemap support for the nv30 for floating point textures (assuming that is correct).

As far as this description's correctness or incorrectness relates to the topic at hand, I would appreciate it if someone would share their understanding. As far is it relates to the nv35 and possible improvements it could offer with the (rumored) relatively small transistor count increase (which I think it does), perhaps we could discuss that in other thread(s) relating to that, after things are clarified here?

LeStoffer · Mar 16, 2003

One of the new features of PS 2.0 is that texture data and sampling is decoupled (according to DX9 specs because there are now sampler registers in addition to the texture registers).

This might be why ATI made this distinction.

A few more facts about PS 2.0: There are 4 texture instructions which is used to load and sample texture data and to modify texture coordinates.

I not know whether all four is done on the FP address processor in question, but since you can use arithmetic instructions on texture address information it is a bit fuzzy which processor does what on the R300. Maybe the dev guys can clear this out?

Hyp-X · Mar 17, 2003

LeStoffer said:
One of the new features of PS 2.0 is that texture data and sampling is decoupled (according to DX9 specs because there are now sampler registers in addition to the texture registers).

You mean texture coordinate registers and sampling is decoupled.
It's not new it's already done in PS1.4, it's just made more flexible.

In PS1.4

Code:

texld r3, t2

Reads from from the 3rd sampler unit not the 2nd.
There are no sampler registers the source is indicated by the destination register.

In PS2.0 the same instruction would be:

Code:

texld r3, t2, s3

In this model it is possible to indicate the sampler source separately.

Hyp-X · Mar 17, 2003

demalion said:
Also, Hyp-X seems to be saying calculations necessary for anisotropic filtering are performed in the floating point address processor in such a way that other operations are precluded in the same cycle a texture is filtered anisotropically. Is this true?

It was only a guess.

One way to check (I think) would be to contrast the impact of anisotropic filtering for a small texture using a simple texture look up/address op/color op shader...I didn't have time to double check the benchmarks we have right now to see if one could serve.

You mean a texture that small so there is only magnification and therefore no actual anisotropic filtering will occur only bilinear?

Presume there will be
a.) no speed drop
b.) some speed drop
which one will prove that anisotropy is performed in the texture addressing unit?

Luminescent · Mar 17, 2003

Here are a few questions bugging me (feel free to answer; speculation is welcome):

-The floating point adress processor, in R300, is composed of what units(fmads?)?

-Is the address processor specialized to perform specific register ops, or could it arbitrarily fetch/read/write, as dictated by the application?

-Is the adress processor user-programmable, or is it invisible to the developer?

demalion · Mar 17, 2003

Hyp-X said:
...

One way to check (I think) would be to contrast the impact of anisotropic filtering for a small texture using a simple texture look up/address op/color op shader...I didn't have time to double check the benchmarks we have right now to see if one could serve.

Click to expand...

You mean a texture that small so there is only magnification and therefore no actual anisotropic filtering will occur only bilinear?

No, that sounds like a "2 dimensional" cache interpretation of what I said...I'd expect that "3 dimensional" (mip map) organizational priorities would allow what I am referring to. Anyways, why couldn't the same texture be repeated instead of being magnified?

DemoCoder · Mar 17, 2003

There is a difference between the way 1.4 and 2.0 do the decoupling. 1.4 is only partially decoupled. In 1.4, the coordinate register is implicitly tied to the sampler. Register t2 always samples from texture#2. You can't really do texture register math, but must wait until phase 2 and use Rn registers. So, you'll still setup the texture regs in the vertex shader, and you will need 1 register for each texture, unless you want to delay until phase 2.

You can't use the same coordinate register to sample from other textures, and you can't even copy that register to another, because the texture coordinate registers are read only. The best you can do, is copy to R_n registers and use those in phase 2.

In contrast, PS2.0 offers complete freedom. If I put an address into T1, I can sample from as many textures as a want, pulling in from a normal map, color map, and whatever else you need.

Of course, Ps2.0 still requires a specialized instruction, texld, instead of allowing texture reads directly as part of a color op, but I'm sure the VLIW architecture optimizes a separate texld/move into one slot anyway.

Hyp-X · Mar 18, 2003

DemoCoder said:
In 1.4, the coordinate register is implicitly tied to the sampler. Register t2 always samples from texture#2.
...
You can't use the same coordinate register to sample from other textures, and you can't even copy that register to another, because the texture coordinate registers are read only. The best you can do, is copy to R_n registers and use those in phase 2.

http://msdn.microsoft.com/library/d...Shaders/PixelShader1_X/instructions/texld.asp
texld (Pixel Shader)

Loads the destination register with color data (RGBA) sampled using the contents of the source register as texture coordinates. The sampled texture is the texture associated with the destination register number.

What does the floating point adress processor do?

Luminescent

Humus

Crazy coder

KimB

Hyp-X

Irregular

Dave H

OpenGL guy

darkblu

Dave H

demalion

Luminescent

Hyp-X

Irregular

KimB

demalion

LeStoffer

Hyp-X

Irregular

Hyp-X

Irregular

Luminescent

demalion

DemoCoder

Hyp-X

Irregular

Similar threads