Centralized Texturing

trinibwoy

Meh
Legend
Supporter
I'll be honest, I haven't done much research or put much thought into this but figured I'd post the question anyway and hopefully spark some discussion and learn something in the process :)

Anyway, I was thinking about the nature of workloads in the future and it seems that it becomes less and less practical to have fixed-function units like TMUs dedicated to a limited number of processors due to the greater diversity of workloads that a given GPU might find itself undertaking in parallel at any point in time. Is it feasible to break out TMUs and L1 caches into a separate array with its own global scheduler that receives and schedules requests from any and all processor cores/clusters? This would ideally maximize the utilization of these fixed-function units at some latency cost. Or would that totally destroy the spatial cache locality afforded by cluster dedicated units?

This concept could be extended to other fixed-function stuff like setup / tessellation / rasterization etc as a way to relieve those bottlenecks. So instead of the traditional pipeline where there's all this stuff "above" the shader core, the processors just execute programs and hand off work to fixed-function blocks as required.

Apologies in advance if none of this makes sense :LOL:
 
Anyway, I was thinking about the nature of workloads in the future and it seems that it becomes less and less practical to have fixed-function units like TMUs dedicated to a limited number of processors due to the greater diversity of workloads that a given GPU might find itself undertaking in parallel at any point in time. Is it feasible to break out TMUs and L1 caches into a separate array with its own global scheduler that receives and schedules requests from any and all processor cores/clusters? This would ideally maximize the utilization of these fixed-function units at some latency cost. Or would that totally destroy the spatial cache locality afforded by cluster dedicated units?
Well, r6xx did something like that to some extent (it was not fully centralized since the tmus were bound to array pieces representing pixel quads but they were shared across clusters). I am not sure exactly what the problems with this approach are. Maybe it's problematic to move that much data around, or the scheduling overhead isn't worth it.
 
Shouldn't this rather be called distributed texturing? Central anything = serial bottleneck.

Isn't one of the primary advantages of having localized texture units that you don't have to have a huge on-chip interconnect to transfer around texture request packets. Texture requests have a large overhead (texel offset, etc) to data (output texel) ratio, and this is something you don't have with global memory access (one segment address and mask per block of data).

If one was going to distribute texture requests across the chip, you'd have to distribute tiles of individual textures around the chip. This a problem for filtering since you need neighboring texels to filter. So you end up with having either borders (a nightmare for texture storage, and not anisotropic friendly unless have a large border) or each distributed TEX unit ends up with duplicated tiles. Could perhaps play with the granularity here, so each distributed TEX unit has much larger tiles associated with it than physical tiles transferred to/from global memory (have a larger border, which might solve the anisotropic filtering problem).

Because of filtering having dedicated TU per ALU cluster makes a lot of sense.
 
If one was going to distribute texture requests across the chip, you'd have to distribute tiles of individual textures around the chip. This a problem for filtering since you need neighboring texels to filter. So you end up with having either borders (a nightmare for texture storage, and not anisotropic friendly unless have a large border) or each distributed TEX unit ends up with duplicated tiles.
This happens anyway on existing GPUs that have more than one quad-TU.

Jawed
 
This happens anyway on existing GPUs that have more than one quad-TU.

Jawed

Yes memory is duplicated in TUs in both cases. The difference is the interconnect network needed in the distributed case.

Jawed, I know you want to extend this topic into joint distributed TEX/RBE(ROP) units... which might make it easier to realize an advantage in the interconnect network (assuming one didn't have a fixed mapping of clusters to ROPs for render target tiles).
 
I'm no engineer but the fact that the only FF units featured in LRB are TMUs speaks pretty loudly to me. Not saying texture ops can't be done in software, but I think we'd want a lot more ALUs than are currently available to pull software texturing off effectively.
 
Centralized anything doesn't make sense, even if you were to make it possible for TUs to be used by remote ALUs it still makes sense to distribute them to minimize traffic for the typical workloads.
 
Isn't one of the primary advantages of having localized texture units that you don't have to have a huge on-chip interconnect to transfer around texture request packets. Texture requests have a large overhead (texel offset, etc) to data (output texel) ratio, and this is something you don't have with global memory access (one segment address and mask per block of data).

Centralized anything doesn't make sense, even if you were to make it possible for TUs to be used by remote ALUs it still makes sense to distribute them to minimize traffic for the typical workloads.

Well Nvidia is already making noises about focusing heavily on on-chip networks in future architectures (hence Dally's ascension). I don't see how the current approach is sustainable in the long run as the texturing workload is bound to vary considerably across the array as algorithms become more complex and dynamic. But maybe by that time FF texture units will go away so it'll be a moot point.

On a slightly different note, does anyone know of a diagram, article, patent etc that clearly describes the relationship between TMUs, ROPs and load/store units?
 
Hopefully I'm not way off with this, but doesn't the 360 GPU do something a little like what the OP is suggesting? The texture units are not tied to any particular shader block...
 
Last edited by a moderator:
Hopefully I'm not way off with this, but doesn't the 360 GPU do something a little like what the OP is suggesting? The texture units are not tied to any particular shader block...
Is what Xenos is doing there actually different to R600? Diagrams I've seen seem to suggest so, but Xenos has 4 quad-tmus which would fit perfectly to the 16-wide simd arrays for r600-style shared texture units.
 
When will the time come when we dump the whole concept of textures, and instead convert textures into a mathematical representation, for example in the form of a polygon mesh or other suitable data structure?

Performance continues to rise, but image quality is still hamstrung by the fact everything turns into a blur whenever you closely approach an object (or else you're forced to supply ridiculously huge and memory intensive texture maps for most every object).
 
When will the time come when we dump the whole concept of textures, and instead convert textures into a mathematical representation, for example in the form of a polygon mesh or other suitable data structure?

Why would a polygon mesh take up less space than a compressed texture map? You would need sub-pixel sized polygons to represent unique detail. Procedural textures can never do that.
 
It doesn't really matter, there will always be a cross over where storage is cheaper than computation ... and synthesized textures are generally seeded by normal textures.
 
Back
Top