I'll be honest, I haven't done much research or put much thought into this but figured I'd post the question anyway and hopefully spark some discussion and learn something in the process
Anyway, I was thinking about the nature of workloads in the future and it seems that it becomes less and less practical to have fixed-function units like TMUs dedicated to a limited number of processors due to the greater diversity of workloads that a given GPU might find itself undertaking in parallel at any point in time. Is it feasible to break out TMUs and L1 caches into a separate array with its own global scheduler that receives and schedules requests from any and all processor cores/clusters? This would ideally maximize the utilization of these fixed-function units at some latency cost. Or would that totally destroy the spatial cache locality afforded by cluster dedicated units?
This concept could be extended to other fixed-function stuff like setup / tessellation / rasterization etc as a way to relieve those bottlenecks. So instead of the traditional pipeline where there's all this stuff "above" the shader core, the processors just execute programs and hand off work to fixed-function blocks as required.
Apologies in advance if none of this makes sense
Anyway, I was thinking about the nature of workloads in the future and it seems that it becomes less and less practical to have fixed-function units like TMUs dedicated to a limited number of processors due to the greater diversity of workloads that a given GPU might find itself undertaking in parallel at any point in time. Is it feasible to break out TMUs and L1 caches into a separate array with its own global scheduler that receives and schedules requests from any and all processor cores/clusters? This would ideally maximize the utilization of these fixed-function units at some latency cost. Or would that totally destroy the spatial cache locality afforded by cluster dedicated units?
This concept could be extended to other fixed-function stuff like setup / tessellation / rasterization etc as a way to relieve those bottlenecks. So instead of the traditional pipeline where there's all this stuff "above" the shader core, the processors just execute programs and hand off work to fixed-function blocks as required.
Apologies in advance if none of this makes sense