I don't know about the ISAs on the big discretes (which I need to look over again sometime), but I do know that there are at least mobile GPUs that have instructions that perform MUL + ADD in more complex arrangements than just FMA. They also have simple input modifiers like 1 - X and small shifts/multiplications. I don't really know what of this makes sense to incorporate into functional units but it could help. SSE does have a lot of pretty specialized instructions as well, but not that focused on graphics.
Indeed we should keep our eyes open for useful instructions of this kind. All I'm saying is that for a unified architecture it probably doesn't make sense to go beyond what the latest GPUs do.
When I was defending my theory that Haswell would have two FMA units (which proved out right), I read a research paper on fused instructions that went beyond FMA. I believe that the conclusion was that unless all you're doing is processing huge matrices or long polynomials, it's not worth it over having two FMA units. FMA itself is a significant improvement over just MUL and ADD though. I can't seem to find the paper any more, but the discussion is somewhere on realworldtech if you're interested.
Based on what I've read, when people say dedicated interpolation units have been dropped they mean the application of barycentric coordinates, which is linear-interpolation, not the calculation of the barycentric coordinates themselves, which is non-linear. I don't know what the latency is like for divides in GPUs so I don't know if there's any cost to having a big dependency on one early on or not.. but for the sort of ISA in a unified system that's more CPU like I could see anything that gets around having to deal with the divide latency upfront as useful.
NVIDIA's SFU design reveals that the complexity of a division (reciprocal really) on the GPU is around the order of a few arithmetic operations. And I believe several architectures match the latency of their FMA units to the SFU latency. However, they have a
6:1 ratio in the number of FMA units and SFU units (which is also used for interpolation). So their biggest concern isn't latency it's the bottleneck that could occur from having too high a number of operations that require the SFU units. Anyway, back on the topic of a unified architecture, there surely is a lot of room for improvement for the performance of a full precision division. But I don't think the perspective division in graphics has a desperate need for that. There's a 2:1 ratio for FMA and RCP and one Newton iteration suffices, which is cheap if you have FMA anyway. Most importantly, only one perspective division is needed per pixel. The increase in average shader length has already made it insignificant a while ago. You can use a low-throughput high-latency division with no appreciable effect on performance.
These are pretty common in DSPs.. I don't know where they apply in modern high end GPUs but I know you don't have to dig back terribly far to find more odd multiplication width pairs in GPUs.. they might still lurk in fixed function stuff. But I don't have applications outside of things like emulating archaic platforms like Nintendo DS
Well if it's a thing of the past, i.e. GPUs are moving away from it, then I don't think the CPU should be moving toward the old stuff since it would become a burden if it's practically never used.
That said I've started to realize that there are most definitely mixed width multiplication instructions that could be of great help for texture filtering. So thanks for the suggestion to look for these.
Was LRB TBDR or was it just tiling?
I'm terribly sorry but I laughed out loud when I read this. You can't categorize Larrabee's hardware one way or the other. Aside from the dedicated texture units it is highly generic and can be used any way you like. Whether or not Intel's default driver was intended to be TBDR-based or not, or some form of hybrid or configurable renderer, that's a whole other question.
I think some form of tiling is a must for rendering on something with a CPU-like cache hierarchy/bandwidth. Why is it not suitable w/tessellation?
Again I wouldn't say anything is a "must". There are cases where forms of tiling help a lot and other cases where not tiling avoids wasting cycles and power. And that's the beauty of a unified architecture; you can do things in much smarter ways tailored to the situation. The hardware doesn't impose one approach on you. GPUs have already given up fixed-function vertex and pixel processing for something much better, despite the cost, and now we're getting closer to the point where the rendering processes themselves will cease to be fixed-function. For better.
I hope this clarifies why I think it would potentially derail the discussion to respond to why TBDR is not well suited for tessellated geometry. You can find some answers/opinions here:
Early Z IMR vs TBDR - tessellation and everything else.
Further thoughts on fixed function ISA:
What's most useful for acceleration compressed textures? How about table lookup/shuffles over quantities below 8-bits? Or using different index widths vs access widths? I use the 8-bit lookup instructions in NEON a fair amount, but often (for graphics things) what I really want to do is lookup 16-bit values with an 8-bit index, requiring two lookups and an interleave. Sometimes what I really want is to lookup 16-bit values using only 4-bit indexes. Parallel pext helps a lot with this, but being able to do it directly is even better.
Note that GPUs dropped support for palletized textures, despite being a perfectly useful way to compress textures
if you have custom pallets per texture. It's really the latter restriction that killed this feature, because it can't be done efficiently in dedicated hardware when you're constantly sampling from different textures. That is, it can't be done more efficiently than a generic gather operation from memory.
So gather and PDEP/PEXT are all you can hope for. AVX2's gather is limited to 32-bit indices, but in a couple more silicon shrinks there might be timing room to lower that down without affecting latency. For 4-bit to 16-bit lookup an in-register permute (
vpermd) should be of great help already.