Really wide SIMT in GPUs

For efficiency, I'd guess you'd want to aim for something in the order of 10~20, but the more the merrier :)

Sure, from consumer PoV, I thought about producer PoV.
To be totally scalable you would have to be able to rasterize 1px triangles as fast as 300px large ones (relatively, so at 1/300th rate). No current architecture achieves this I think.
 
Sure, from consumer PoV, I thought about producer PoV.
To be totally scalable you would have to be able to rasterize 1px triangles as fast as 300px large ones (relatively, so at 1/300th rate). No current architecture achieves this I think.
But why would you want to? Rasterising one pixel triangles is enormously wasteful - and I don't mean just execution unit occupancy, but density of input data per surface area. At that level of detail you're much better off using alternative representations.
 
But why would you want to? Rasterising one pixel triangles is enormously wasteful - and I don't mean just execution unit occupancy, but density of input data per surface area. At that level of detail you're much better off using alternative representations.

Because if you can do that your hardware architecture is of the kind where you are able to merge all 1px+ triangles into larger vectors and rasterize that without penalty at full throughput.
Currently you always loose for anything multiple of > 1. Either it becomes super complex with lots of overhead to merge and then split, or you can't because one vector needs to serve exactly one z-tile or other reasons.
 
Pixels are generally rendered in quads in order to support the gradient functions of the rendering APIs, so a 1x1 triangle will be as slow as a 2x2, at least AFAICS. Also, the triangle set-up is a relatively non-trivial operation and would still need to be done if you wanted anything other than point-sampling.
 
nowadays it would probably be more efficient and accurate if the fragment programs would calculate the gradient (yes, this implies FPs would also need to interpolate explicitly), then fragment merging would work out better and small triangles would be less of a problem.
 
Pixels are generally rendered in quads in order to support the gradient functions of the rendering APIs, so a 1x1 triangle will be as slow as a 2x2, at least AFAICS.

You can always generate gradients, which really are gradients, per pixel because they are always geometrically derived. The other stuff, is actually cross-lane "delta", and less useful than a real cross-lane operation. Cross-lane can be configured symmetric with the packing when you have full vector swizzles. I don't see the need for ddx/ddy support.

Also, the triangle set-up is a relatively non-trivial operation and would still need to be done if you wanted anything other than point-sampling.

True. But I think you could load-balance way better. Anyway, it's a fantasy. I don't think something large could be redesigned radically different nowadays. And nobody buys mini-simple GPUs which could start from scratch. Honestly I don't know how we could possibly get alternatives, except by going full software compute. And even then you wouldn't have the right instructions and paths to make it great.
 
And even then you wouldn't have the right instructions and paths to make it great.
SIMT/Compute is a little bit limiting. But if you schedule your vector lanes/threads by hand, then you could maybe achieve a way higher utilization and efficiency.
with the increasing complexity of shader programs, it will become useful at some point to split those into 'passes'. this is simply because the shader is limited by the worst case scenario. maybe 90% of your shader could actually run with half of the resources (e.g. register count) but you allocate for the worst case. if you split work into passes, you could use a completely different setup to gather shadowing informations than for doing your PBR shading than for applying detail textures or gathering light probe informations.....
NVidia has a paper about it targeting raytracing: http://www.nvidia.com/docs/IO/76976/HPG2009-Trace-Efficiency.pdf where they manage 'jobs' per lane by hand and also divide the task into two parts and whatever part has higher occupancy is executed. it's a bit cumbersome on SIMT/Compute, but on a CPU alike programming model, you could increase the efficiency. I'm referring to
It often happens that a few long-running rays keep the entire warp hostage. We modified the persistent threads approach to periodically replace the terminated rays with new ones that start from the root. This approach clearly improves the SIMD efficiency...
and IF you could increase efficiency, then you could go for wider SIMT, because the sweet-spot would move from the current SIMT16 (on NV/ATI afaik) to maybe 64.

but I have doubts this could be done on driver level or just a small extension. you'd need to create a higher level control of how data flows (which is currently handled by drivers/hardware). e.g. you could bin fragments by the light source count, but not only using some screen-space carving and depth bound checks, but by really using the shadowing information on top (that you gather anyway). BUT you cannot just do that for the whole screen and potentially create hundrets of megabyte of data that is written to main memory and consumed later on, you'd rather just keep the buffers in L2 or L3 size and dynamically schedule between various task types that generate and consume that data.

AVX512 should have all the needed parts to research that.
 
Back
Top