There are no "helper" lanes. Each active lane is processing the entire triangle. It's a traditional SIMD loop over as many distinct triangles as your width. If you need derivatives, then each lane calculates it's own private derivative _once_ from the three available vertices, and has it as a constant available in the loop body.
We just need to give those helper lanes some name.
But i thought derivatives available in PS are not calculated from vertices, but from adjacent lanes working on adjacent pixels of a 2x2 quad.
I imagine each triangle is rasterized coarsely and at conservatively to form a 'half res, double sized' triangle of 2x2 quads, 'wasting' some lanes falling outside of the full res triangle.
But because full quads are guaranteed to exist, derivatives / gradients can be calculated using the registers of the other 3 threads in the quad. PS stage so no longer needs access to vertex or triangle data.
Those quads give us also some limited but fast options to get registers from adjacent threads in compute shaders after introduction of wave intrinsics.
I further assume the coarse allocation of quads to a triangle is tight, using a half res bounding triangle, but not somethign simple like a bouding rectangle which would baste many threads and entire quads.
So each quad will have at least one pixel inside the triangle.
However, that's not actually bad, and i wonder why SW raster in Nanite is a win at all.
Generating visibility buffer means little work happening in PS, so even wasting half of threads with very small triangles does not sound too bad for me.
For a compute rasterizer we need some way to fuse two nested loops into a single one to keep all threads working in lockstep on multiple triangles per wave, increasing per pixel logic.
So i assume it should be hard to beat the HW rasterizer.
Thus i wonder if there are some other factors here as well, beside those 'wasted helper lanes' we always hear to be the reason.
But what could those reasons be? The guarantee to render triangles in given API order? Can be turned off these days. So what else?
But the even bigger question to me is: Do we really need subpixel triangles in the days of 4K? Or is it rather a limitation of our LOD solution being unable to keep triangles large enough to be efficient?
For Nanite i speculate the latter may apply at least in cases. They have fine grained and watertight LOD switching over the geometry, which is nice. But afaict they don't have the same for texture UVs.
There are cracks in the UVs, and they may need to hide them by using very small triangles, to some degree.
How much this can be a problem depends a lot on content. High genus / noisy meshes or many UV charts would be bad, and results might look good only if we crank up visible geometry resolution, which kinda defies the primary goal of dynamic LOD.
That said i'm not really sure about 'micropolygon support' of future GPUs, or an eventual removal of ROPs all together, besides some Jokes.
It would mean more fixed function blocks, and the more such stuff we establish, the harder it becomes to maintain and develop new GPU architectures.
As often, i expect a short termed push and marketing hype, but in the long run we just stack up legacy bloat and complexity.
Being worried fixed function stuff might eat up too much area for compute is not the only argument here.