And even then you wouldn't have the right instructions and paths to make it great.
SIMT/Compute is a little bit limiting. But if you schedule your vector lanes/threads by hand, then you could maybe achieve a way higher utilization and efficiency.
with the increasing complexity of shader programs, it will become useful at some point to split those into 'passes'. this is simply because the shader is limited by the worst case scenario. maybe 90% of your shader could actually run with half of the resources (e.g. register count) but you allocate for the worst case. if you split work into passes, you could use a completely different setup to gather shadowing informations than for doing your PBR shading than for applying detail textures or gathering light probe informations.....
NVidia has a paper about it targeting raytracing:
http://www.nvidia.com/docs/IO/76976/HPG2009-Trace-Efficiency.pdf where they manage 'jobs' per lane by hand and also divide the task into two parts and whatever part has higher occupancy is executed. it's a bit cumbersome on SIMT/Compute, but on a CPU alike programming model, you could increase the efficiency. I'm referring to
It often happens that a few long-running rays keep the entire warp hostage. We modified the persistent threads approach to periodically replace the terminated rays with new ones that start from the root. This approach clearly improves the SIMD efficiency...
and IF you could increase efficiency, then you could go for wider SIMT, because the sweet-spot would move from the current SIMT16 (on NV/ATI afaik) to maybe 64.
but I have doubts this could be done on driver level or just a small extension. you'd need to create a higher level control of how data flows (which is currently handled by drivers/hardware). e.g. you could bin fragments by the light source count, but not only using some screen-space carving and depth bound checks, but by really using the shadowing information on top (that you gather anyway). BUT you cannot just do that for the whole screen and potentially create hundrets of megabyte of data that is written to main memory and consumed later on, you'd rather just keep the buffers in L2 or L3 size and dynamically schedule between various task types that generate and consume that data.
AVX512 should have all the needed parts to research that.