Are texture addressing units scalar

Nick

Veteran
Given that textures can be 1D, 2D, 3D, or cube maps, I was wondering if the texture addressing units in today's GPUs are scalar, or if they handle each case in a single cycle?
 
The whole texture adressing and fetching is certainly more than just a tiny line of math (mip-mapping, anisotropic filtering, multiple fetches etc.).
It's not really meaningfull to ask if one tiny term in the math is scalar or not. The whole calculation is certainly not possible to do in a single cycle (bo. dependent terms).
 
Yes, the whole texture sampling operation consists of a lot more steps, but I was merely wondering about the first few, which operate on the 1-3 coordinates independently.

Anyway, I think I got confused/tricked by marketing. GF104 is advertised as having two times more texture filter units than texture address units. This gave me the impression that they are decoupled, which would make some sense in light of anisotropic filtering. I then started wondering if there are any GPUs which perform the independent part of the calculations per coordinate in a serial fashion, instead of in parallel and thereby leaving hardware unused in the 1D and 2D cases.

It appears that the addressing and filter units are not very decoupled at all, and GF104 can simply do FP16 filtering at the same speed as 8-bit filtering, whereas it was half-speed on GF100. They've advertised it as having twice the filter units.

Please let me know if there's more to it. I'm also still wondering if there are GPUs which perform any part of the texture addressing operations in the shader cores. Where is the multiplication for perspective correction performed these days?
 
Yes, the whole texture sampling operation consists of a lot more steps, but I was merely wondering about the first few, which operate on the 1-3 coordinates independently.
How is that supposed to work? Adressing the samples of a texture involves a (convoluted) address generation procedure (which also involves translating virtual to physical addresses). Additionally, textures are traditionally laid out non-linearly in memory, more often the layout resemble space filling curves to improve the cache hit rate due to spatial locality (and this makes the address generation of course a bit more complicated). Radeons for instance use this one (which enables a very simple address generation scheme). One needs all coordinates (at least for 2D textures) to do anything meaningful.
Anyway, I think I got confused/tricked by marketing. GF104 is advertised as having two times more texture filter units than texture address units. This gave me the impression that they are decoupled, which would make some sense in light of anisotropic filtering. I then started wondering if there are any GPUs which perform the independent part of the calculations per coordinate in a serial fashion, instead of in parallel and thereby leaving hardware unused in the 1D and 2D cases.

It appears that the addressing and filter units are not very decoupled at all, and GF104 can simply do FP16 filtering at the same speed as 8-bit filtering, whereas it was half-speed on GF100. They've advertised it as having twice the filter units.
Traditionally, a filter unit has the ability to generate one bilinear filtered texel (per clock) while the address unit limits the number of individually adressed texture coordinates (and a trilinear or anisotropic filtered texel still needs only a single texture address). That means for a trilinear filtered sample, one needs just one address unit (getting the two physical adresses for the different mipmaps is just a small extension) and two filter units (actually slightly more) to balance the throughput and get fullspeed trilinear filtering (not slowed down by the filtering, it can still be slower because of the higher bandwidth needs). And as with DX10/11 all anisotropic filters are based on trilinear filtered samples, that is the optimal ratio if one doesn't shoot for highest possible bilinear filtering speed (gets less and less relevant).
Please let me know if there's more to it. I'm also still wondering if there are GPUs which perform any part of the texture addressing operations in the shader cores. Where is the multiplication for perspective correction performed these days?
Perspective correction is part of the texture coordinate (attribute) interpolation. It is not part of the texture address generation in a narrower sense. It basically runs between rasterization and the pixel shader (one may even consider it a part of the rasterization). All DX11 GPUs do that in the shader core (I think it is even mandated that way).
 
Last edited by a moderator:
How is that supposed to work? Adressing the samples of a texture involves a (convoluted) address generation procedure (which also involves translating virtual to physical addresses). Additionally, textures are traditionally laid out non-linearly in memory, more often the layout resemble space filling curves to improve the cache hit rate due to spatial locality (and this makes the address generation of course a bit more complicated). Radeons for instance use this one (which enables a very simple address generation scheme). One needs all coordinates (at least for 2D textures) to do anything meaningful.
As far as I know those are the last steps before texel fetch, and they even require LOD determination before that (which also requires all coordinates). So I was only referring to anything done before that (which I now realize might not be a whole lot).
Traditionally, a filter unit has the ability to generate one bilinear filtered texel (per clock) while the address unit limits the number of individually adressed texture coordinates (and a trilinear or anisotropic filtered texel still needs only a single texture address). That means for a trilinear filtered sample, one needs just one address unit (getting the two physical adresses for the different mipmaps is just a small extension) and two filter units (actually slightly more) to balance the throughput and get fullspeed trilinear filtering (not slowed down by the filtering, it can still be slower because of the higher bandwidth needs). And as with DX10/11 all anisotropic filters are based on trilinear filtered samples, that is the optimal ratio if one doesn't shoot for highest possible bilinear filtering speed (gets less and less relevant).
Are you saying that GF104 has twice the number of bilinear filtering units over GF100, making the marketing diagrams fairly accurate? Or does all modern hardware support single cycle (pipelined) trilinear filtering and is GF104's only novelty full speed FP16 filtering?
Perspective correction is part of the texture coordinate (attribute) interpolation. It is not part of the texture address generation in a narrower sense. It basically runs between rasterization and the pixel shader (one may even consider it a part of the rasterization). All DX11 GPUs do that in the shader core (I think it is even mandated that way).
NVIDIA uses the SFUs for interpolation, right? Does AMD use the generic arithmetic units?
 
Didn't Cypress gave up the dedicated attribute interpolation to accommodate this function for D3D11 spec's? NV has been using the SFUs for interpolation since G80 (the case for the "lost" MUL co-issue).
 
As far as I know those are the last steps before texel fetch, and they even require LOD determination before that (which also requires all coordinates). So I was only referring to anything done before that (which I now realize might not be a whole lot).
LOD plays the combined role of a variable address offset (the integer part) selecting the correct MipMap(s) and on the other hand also specifies a filtering weight in case of trilinear sampling (the fractional part). So using the LOD is part of the address generation and also feeds into the filtering. The determination of the LOD is usually done using the differences of texture coordinates within a quad (also a reason why we have quad TMUs), the gradients (also used for determining the line of anisotropy and the anisotropy degree in case of AF). It usually happens before the actual texture instruction (can be done during interpolation for simple use cases but can also be individually calculated and set in the pixel shader).
Are you saying that GF104 has twice the number of bilinear filtering units over GF100, making the marketing diagrams fairly accurate?
I thought it is common knowledge that GF104 had twice the numbers of texture filters per SM.
Or does all modern hardware support single cycle (pipelined) trilinear filtering and is GF104's only novelty full speed FP16 filtering?
The fullspeed bilinear/trilinear question and the data format a filter unit can handle at full speed are in principle orthogonal. It really depends on the implementation.

NVIDIA uses the SFUs for interpolation, right? Does AMD use the generic arithmetic units?
In newer GPUs it's basically some kind of "interpolation shader" run before (or within) the pixel shader. In principle, it could use the other units, too. AMD specifically added interpolation instructions for this purpose to the ALUs in the Cypress generation.
Didn't Cypress gave up the dedicated attribute interpolation to accommodate this function for D3D11 spec's? NV has been using the SFUs for interpolation since G80 (the case for the "lost" MUL co-issue).
Yes, that's why I wrote all DX11 GPUs do it that way because it is basically mandated by DX11. ;)
 
Last edited by a moderator:
LOD plays the combined role of a variable address offset (the integer part) selecting the correct MipMap(s) and on the other hand also specifies a filtering weight in case of trilinear sampling (the fractional part). So using the LOD is part of the address generation and also feeds into the filtering. The determination of the LOD is usually done using the differences of texture coordinates within a quad (also a reason why we have quad TMUs), the gradients (also used for determining the line of anisotropy and the anisotropy degree in case of AF). It usually happens before the actual texture instruction (can be done during interpolation for simple use cases but can also be individually calculated and set in the pixel shader).
I've implemented it in software, so I know how it works. But what do you mean by it usually happens before the actual texture instruction (on a GPU)? Do today's GPUs perform ddx/ddy operations in the shader cores and then send that along with the texture coordinates to the texture units? In other words, at the hardware level there's no tex2D(s, t) instruction only a tex2D(s, t, ddx, ddy) instruction?
I thought it is common knowledge that GF104 had twice the numbers of texture filters per SM.
The fullspeed bilinear/trilinear question and the data format a filter unit can handle at full speed are in principle orthogonal. It really depends on the implementation.
Well it's the different implementations I'm interested in. :) It's indeed common knowledge that GF104 was advertised to have twice the number of texture filters, but it's still not clear to me what that means and where the FP16 filtering fits in. Is the doubling of the filter units the cause of doubling the FP16 filtering performance, or is there really a 4x improvement from having twice the filter units which perform FP16 filtering at full speed instead of half speed?

Sorry for the many questions, I'd just like to better understand how texture sampling is implemented in hardware these days. It appears to be very little documented. Any pointers would be much appreciated.
 
Well it's the different implementations I'm interested in. :) It's indeed common knowledge that GF104 was advertised to have twice the number of texture filters, but it's still not clear to me what that means and where the FP16 filtering fits in. Is the doubling of the filter units the cause of doubling the FP16 filtering performance, or is there really a 4x improvement from having twice the filter units which perform FP16 filtering at full speed instead of half speed?
GF104 has twice the number of filter units and they are twice as fast with FP16 than those in GF100 (but NOT GF110). This is per SM of course, so in total both chips have the same amount of texture units (and gf104 is twice as fast as GF100 with FP16).
 
But what do you mean by it usually happens before the actual texture instruction (on a GPU)? Do today's GPUs perform ddx/ddy operations in the shader cores and then send that along with the texture coordinates to the texture units? In other words, at the hardware level there's no tex2D(s, t) instruction only a tex2D(s, t, ddx, ddy) instruction?
My wording above wasn't exact. And frankly, I don't know exactly how it is done on nV hardware, the following is specific to Radeons (but could be somewhat similar on nV as they also have limits on the number of source parameters for each instruction).

When using a normal sample instruction, the gradients are calculated on the fly from the texture coordinates within a quad (only in a quite distant past it was predetermined by the texture coordinates from the interpolation). But you actually can't supply the gradients directly within a sample instruction, there is no 1:1 mapping of the tex2D(s,t,ddx,ddy) instruction in the ISA. Instead, the compiler generates multiple instructions for it. First, it writes the user supplied gradients to the TMU (2 instructions needed, 3 horizontal and 3 vertical gradients are written) and after that it uses a special sample instruction which does not use the automatically calculated gradients, but the ones set up before. Alternatively, one can supply texture coordinates to the TMU from which the gradients are calculated and kept there (no texture access is executed). Later, the same special sample instruction as above can use these stored gradients with other texture coordinates. The HLSL ddx and ddy instructions are actually implemented using the same TMU functionality: the values are written to the TMU, the gradients are calculated there and then read back to the ALU registers. There is no specific functionality in the ALUs to calculate derivatives.

So to sum it up, the automatic gradient calculation can be overwritten when the TMU is set up accordingly with manually supplied gradients before the sample instruction. The only thing one can directly supply with a sample instruction is an override for the LOD value itself (and a LOD bias, but the bias only hard coded, not as a register value), but not for the derivatives.
 
GF104 has twice the number of filter units and they are twice as fast with FP16 than those in GF100 (but NOT GF110). This is per SM of course, so in total both chips have the same amount of texture units (and gf104 is twice as fast as GF100 with FP16).
Thanks, that clarifies a lot!
 
My wording above wasn't exact. And frankly, I don't know exactly how it is done on nV hardware, the following is specific to Radeons (but could be somewhat similar on nV as they also have limits on the number of source parameters for each instruction).

When using a normal sample instruction, the gradients are calculated on the fly from the texture coordinates within a quad (only in a quite distant past it was predetermined by the texture coordinates from the interpolation). But you actually can't supply the gradients directly within a sample instruction, there is no 1:1 mapping of the tex2D(s,t,ddx,ddy) instruction in the ISA. Instead, the compiler generates multiple instructions for it. First, it writes the user supplied gradients to the TMU (2 instructions needed, 3 horizontal and 3 vertical gradients are written) and after that it uses a special sample instruction which does not use the automatically calculated gradients, but the ones set up before. Alternatively, one can supply texture coordinates to the TMU from which the gradients are calculated and kept there (no texture access is executed). Later, the same special sample instruction as above can use these stored gradients with other texture coordinates. The HLSL ddx and ddy instructions are actually implemented using the same TMU functionality: the values are written to the TMU, the gradients are calculated there and then read back to the ALU registers. There is no specific functionality in the ALUs to calculate derivatives.
Thanks, that's largely how I imagined it was done. The only surprise is that the shader units can't compute gradients. Doesn't that mean that ddx and ddy are much higher latency than other arithmetic instructions? Or does the TMU have different latencies depending on the operation, in turn meaning it has to deal with writeback hazards?
 
Doesn't that mean that ddx and ddy are much higher latency than other arithmetic instructions?
Quite probably, yes. But maybe not that much higher. I can't recall having seen benchmarks of the ddx/ddy instructions.
Or does the TMU have different latencies depending on the operation, in turn meaning it has to deal with writeback hazards?
All usual TMU instructions have variable latency (the derivative/gradient instruction could be fixed latency) and have to deal with writeback hazards as they have to cope with possible bank conflicts, cache misses, going through the whole cache hierarchy, and finally memory accesses (which may necessitate an access over PCIExpress to the host memory) before they filter the samples and write the results back to the ALU registers. That can easily take several hundred clock cycles.
 
Back
Top