How is that supposed to work? Adressing the samples of a texture involves a (convoluted) address generation procedure (which also involves translating virtual to physical addresses). Additionally, textures are traditionally laid out non-linearly in memory, more often the layout resemble space filling curves to improve the cache hit rate due to spatial locality (and this makes the address generation of course a bit more complicated). Radeons for instance use this one (which enables a very simple address generation scheme). One needs all coordinates (at least for 2D textures) to do anything meaningful.Yes, the whole texture sampling operation consists of a lot more steps, but I was merely wondering about the first few, which operate on the 1-3 coordinates independently.
Traditionally, a filter unit has the ability to generate one bilinear filtered texel (per clock) while the address unit limits the number of individually adressed texture coordinates (and a trilinear or anisotropic filtered texel still needs only a single texture address). That means for a trilinear filtered sample, one needs just one address unit (getting the two physical adresses for the different mipmaps is just a small extension) and two filter units (actually slightly more) to balance the throughput and get fullspeed trilinear filtering (not slowed down by the filtering, it can still be slower because of the higher bandwidth needs). And as with DX10/11 all anisotropic filters are based on trilinear filtered samples, that is the optimal ratio if one doesn't shoot for highest possible bilinear filtering speed (gets less and less relevant).Anyway, I think I got confused/tricked by marketing. GF104 is advertised as having two times more texture filter units than texture address units. This gave me the impression that they are decoupled, which would make some sense in light of anisotropic filtering. I then started wondering if there are any GPUs which perform the independent part of the calculations per coordinate in a serial fashion, instead of in parallel and thereby leaving hardware unused in the 1D and 2D cases.
It appears that the addressing and filter units are not very decoupled at all, and GF104 can simply do FP16 filtering at the same speed as 8-bit filtering, whereas it was half-speed on GF100. They've advertised it as having twice the filter units.
Perspective correction is part of the texture coordinate (attribute) interpolation. It is not part of the texture address generation in a narrower sense. It basically runs between rasterization and the pixel shader (one may even consider it a part of the rasterization). All DX11 GPUs do that in the shader core (I think it is even mandated that way).Please let me know if there's more to it. I'm also still wondering if there are GPUs which perform any part of the texture addressing operations in the shader cores. Where is the multiplication for perspective correction performed these days?
As far as I know those are the last steps before texel fetch, and they even require LOD determination before that (which also requires all coordinates). So I was only referring to anything done before that (which I now realize might not be a whole lot).How is that supposed to work? Adressing the samples of a texture involves a (convoluted) address generation procedure (which also involves translating virtual to physical addresses). Additionally, textures are traditionally laid out non-linearly in memory, more often the layout resemble space filling curves to improve the cache hit rate due to spatial locality (and this makes the address generation of course a bit more complicated). Radeons for instance use this one (which enables a very simple address generation scheme). One needs all coordinates (at least for 2D textures) to do anything meaningful.
Are you saying that GF104 has twice the number of bilinear filtering units over GF100, making the marketing diagrams fairly accurate? Or does all modern hardware support single cycle (pipelined) trilinear filtering and is GF104's only novelty full speed FP16 filtering?Traditionally, a filter unit has the ability to generate one bilinear filtered texel (per clock) while the address unit limits the number of individually adressed texture coordinates (and a trilinear or anisotropic filtered texel still needs only a single texture address). That means for a trilinear filtered sample, one needs just one address unit (getting the two physical adresses for the different mipmaps is just a small extension) and two filter units (actually slightly more) to balance the throughput and get fullspeed trilinear filtering (not slowed down by the filtering, it can still be slower because of the higher bandwidth needs). And as with DX10/11 all anisotropic filters are based on trilinear filtered samples, that is the optimal ratio if one doesn't shoot for highest possible bilinear filtering speed (gets less and less relevant).
NVIDIA uses the SFUs for interpolation, right? Does AMD use the generic arithmetic units?Perspective correction is part of the texture coordinate (attribute) interpolation. It is not part of the texture address generation in a narrower sense. It basically runs between rasterization and the pixel shader (one may even consider it a part of the rasterization). All DX11 GPUs do that in the shader core (I think it is even mandated that way).
LOD plays the combined role of a variable address offset (the integer part) selecting the correct MipMap(s) and on the other hand also specifies a filtering weight in case of trilinear sampling (the fractional part). So using the LOD is part of the address generation and also feeds into the filtering. The determination of the LOD is usually done using the differences of texture coordinates within a quad (also a reason why we have quad TMUs), the gradients (also used for determining the line of anisotropy and the anisotropy degree in case of AF). It usually happens before the actual texture instruction (can be done during interpolation for simple use cases but can also be individually calculated and set in the pixel shader).As far as I know those are the last steps before texel fetch, and they even require LOD determination before that (which also requires all coordinates). So I was only referring to anything done before that (which I now realize might not be a whole lot).
I thought it is common knowledge that GF104 had twice the numbers of texture filters per SM.Are you saying that GF104 has twice the number of bilinear filtering units over GF100, making the marketing diagrams fairly accurate?
The fullspeed bilinear/trilinear question and the data format a filter unit can handle at full speed are in principle orthogonal. It really depends on the implementation.Or does all modern hardware support single cycle (pipelined) trilinear filtering and is GF104's only novelty full speed FP16 filtering?
In newer GPUs it's basically some kind of "interpolation shader" run before (or within) the pixel shader. In principle, it could use the other units, too. AMD specifically added interpolation instructions for this purpose to the ALUs in the Cypress generation.NVIDIA uses the SFUs for interpolation, right? Does AMD use the generic arithmetic units?
Yes, that's why I wrote all DX11 GPUs do it that way because it is basically mandated by DX11.Didn't Cypress gave up the dedicated attribute interpolation to accommodate this function for D3D11 spec's? NV has been using the SFUs for interpolation since G80 (the case for the "lost" MUL co-issue).
I've implemented it in software, so I know how it works. But what do you mean by it usually happens before the actual texture instruction (on a GPU)? Do today's GPUs perform ddx/ddy operations in the shader cores and then send that along with the texture coordinates to the texture units? In other words, at the hardware level there's no tex2D(s, t) instruction only a tex2D(s, t, ddx, ddy) instruction?LOD plays the combined role of a variable address offset (the integer part) selecting the correct MipMap(s) and on the other hand also specifies a filtering weight in case of trilinear sampling (the fractional part). So using the LOD is part of the address generation and also feeds into the filtering. The determination of the LOD is usually done using the differences of texture coordinates within a quad (also a reason why we have quad TMUs), the gradients (also used for determining the line of anisotropy and the anisotropy degree in case of AF). It usually happens before the actual texture instruction (can be done during interpolation for simple use cases but can also be individually calculated and set in the pixel shader).
Well it's the different implementations I'm interested in. It's indeed common knowledge that GF104 was advertised to have twice the number of texture filters, but it's still not clear to me what that means and where the FP16 filtering fits in. Is the doubling of the filter units the cause of doubling the FP16 filtering performance, or is there really a 4x improvement from having twice the filter units which perform FP16 filtering at full speed instead of half speed?I thought it is common knowledge that GF104 had twice the numbers of texture filters per SM.
The fullspeed bilinear/trilinear question and the data format a filter unit can handle at full speed are in principle orthogonal. It really depends on the implementation.
GF104 has twice the number of filter units and they are twice as fast with FP16 than those in GF100 (but NOT GF110). This is per SM of course, so in total both chips have the same amount of texture units (and gf104 is twice as fast as GF100 with FP16).Well it's the different implementations I'm interested in. It's indeed common knowledge that GF104 was advertised to have twice the number of texture filters, but it's still not clear to me what that means and where the FP16 filtering fits in. Is the doubling of the filter units the cause of doubling the FP16 filtering performance, or is there really a 4x improvement from having twice the filter units which perform FP16 filtering at full speed instead of half speed?
My wording above wasn't exact. And frankly, I don't know exactly how it is done on nV hardware, the following is specific to Radeons (but could be somewhat similar on nV as they also have limits on the number of source parameters for each instruction).But what do you mean by it usually happens before the actual texture instruction (on a GPU)? Do today's GPUs perform ddx/ddy operations in the shader cores and then send that along with the texture coordinates to the texture units? In other words, at the hardware level there's no tex2D(s, t) instruction only a tex2D(s, t, ddx, ddy) instruction?
Thanks, that clarifies a lot!GF104 has twice the number of filter units and they are twice as fast with FP16 than those in GF100 (but NOT GF110). This is per SM of course, so in total both chips have the same amount of texture units (and gf104 is twice as fast as GF100 with FP16).
Thanks, that's largely how I imagined it was done. The only surprise is that the shader units can't compute gradients. Doesn't that mean that ddx and ddy are much higher latency than other arithmetic instructions? Or does the TMU have different latencies depending on the operation, in turn meaning it has to deal with writeback hazards?My wording above wasn't exact. And frankly, I don't know exactly how it is done on nV hardware, the following is specific to Radeons (but could be somewhat similar on nV as they also have limits on the number of source parameters for each instruction).
When using a normal sample instruction, the gradients are calculated on the fly from the texture coordinates within a quad (only in a quite distant past it was predetermined by the texture coordinates from the interpolation). But you actually can't supply the gradients directly within a sample instruction, there is no 1:1 mapping of the tex2D(s,t,ddx,ddy) instruction in the ISA. Instead, the compiler generates multiple instructions for it. First, it writes the user supplied gradients to the TMU (2 instructions needed, 3 horizontal and 3 vertical gradients are written) and after that it uses a special sample instruction which does not use the automatically calculated gradients, but the ones set up before. Alternatively, one can supply texture coordinates to the TMU from which the gradients are calculated and kept there (no texture access is executed). Later, the same special sample instruction as above can use these stored gradients with other texture coordinates. The HLSL ddx and ddy instructions are actually implemented using the same TMU functionality: the values are written to the TMU, the gradients are calculated there and then read back to the ALU registers. There is no specific functionality in the ALUs to calculate derivatives.
Quite probably, yes. But maybe not that much higher. I can't recall having seen benchmarks of the ddx/ddy instructions.Doesn't that mean that ddx and ddy are much higher latency than other arithmetic instructions?
All usual TMU instructions have variable latency (the derivative/gradient instruction could be fixed latency) and have to deal with writeback hazards as they have to cope with possible bank conflicts, cache misses, going through the whole cache hierarchy, and finally memory accesses (which may necessitate an access over PCIExpress to the host memory) before they filter the samples and write the results back to the ALU registers. That can easily take several hundred clock cycles.Or does the TMU have different latencies depending on the operation, in turn meaning it has to deal with writeback hazards?