Originally Posted by dnavas
:lol: I noticed this divergence awhile ago, but it struck me as being exactly the kind of architecture you might expect if you purged your shader team of everyone who didn't buy the unified approach and put them on the texture unit team. It doesn't seem unified at all. The shader units are unified, but not the texture units.
I was referring to the shader architecture being unified, not the texturing architecture.
I think you're referring to the architecture of the texture units, anyway. But I can't work out what you're saying.
You have single-channel dedicated addressing units, single-channel dedicated samplers, multi-channel dedicated addressing units, and multi-channel samplers, which are effectively just four-wide single-channel samplers, but, err, they're "dedicated".
When you address texels you have to account for LOD and bias and the kind of filtering algorithm you intend to perform (merely bilinear or something more interesting). With higher quality filtering, the texels to be fetched for one pixel in a screen-space quad don't necessarily overlap with all the other texels for the other pixels in the quad. So each set of texels needs to be addressed.
So that's why you need a fair amount of TA capability for filtered texels. Addressing formulae are more involved then I can ever be bothered to remember (or work out) so, ahem, just think of loads of interpolations in each of the 3 dimensions of screenspace.
Now, for vertex fetches, addressing should be much simpler, because fetches are from a stream. Each element in the stream is the same size as its neighbours and there's usually not much reason to flit around, a serial read is fine. Addressing consists of base address + position-in-stream * size-of-element. Much easier to compute than texel addresses for filtering. Having said that, you may want to have a stride factor (for LOD), e.g. reading 1 in 10 vertices in a 3:1 LOD reduction.
In texture filtering, with multi-texturing, each layer of textures has effectively the same address. Well partly, anyway, because the mipmap chain might be different for the extra layers (they can be lower-detail). But anyway, multi-texturing should be able to (at least partly) re-use the texture addresses from level to level. And don't forget multi-texturing usually requires less texturing quality for these extra layers (e.g. only bilinear).
In vertex fetching you may want to sample from multiple streams in parallel. This is where you can pile on the attributes and do instancing. D3D10 allows for 8 streams to be used in parallel.
So, I'm guessing that the TAs for vertex fetch are used less densely than for texel fetch. The VF-TAs can each address one stream. So four of them allow four streams to be fetched in parallel.
Separate from VF and texture filtering, you've got unfiltered texture fetches. In D3D10 these are from texture buffers, 1D, 2D or 3D. These could be something like big blobs of constant data (e.g. for morphing vertices) or they can be for post-processing of render targets (e.g. performing tone-mapping). etc.
When you address a single texel in a texture buffer, the shader will prolly have performed some calculations to identify which texel is required. The TA then fetches the texel based on base address and offset (taking care of 1D, 2D or 3D organisation of the texture). Each of the other objects executing the same shader (vertices, primitives or pixels) will decide their own address for the texel fetch. So that'll keep four TAs occupied. These TAs, I'm guessing, are VF-TAs. I guess that because without filtering, texture buffer fetches shirk most of the complexity of TA-ing (no interpolations are needed to generate these addresses).
As far as I can tell it's prolly best to think of VF-TAs as much less complex than filtered texture TAs. The throughput of both kinds of TAs needs to be high. At the same time there are overlapping and disparate kinds of fetches that need to be performed within a shader program, so you want to maximise the potential throughput per clock.
It's also worth remembering that the L2 cache in R600 is shared by both vertex L1 and texture L1 caches. In R600 the L1 texture cache is specifically for the filtering pipelines, as far as I can tell (based on patent documents). That would mean that all vertex fetches and texture buffer fetches come through the vertex L1.
In classical DX9 pixel shader code, some texel fetches are unfiltered. Typically these are for things like BRDFs (providing a short cut to the behaviour of light on a material) or for things like the infamous D3 specular lighting lookup. These texel fetches on R5xx and G7x have to be performed by the filtering pipelines, with the filtering turned off.
In theory a D3D10 GPU can perform these fetches using the texture buffer (vertex fetch) pipelines. This would then free-up the filtering pipelines for their normal duty, instead of wasting them on unfiltered texels - onerous when your shader is trying to apply four or more textures per pixel.
I'm still interpreting here, nothing's set in stone...