Fixed function hardware is a double edged sword. Yes, it does one thing efficiently, but each added fixed function unit splits the computing resources further. In order to get best performance out of the hardware you have to utilize all the fixed function units all the time. The more fixed function units you have, the harder this becomes. One of these fixed function units is always going to be a bottleneck, causing others to idle. Examples: Vertex & pixel shader hardware was unified in DX10 and it improved the shader unit utilization a lot. New Radeons removed texture coordinate calculation from texture sampling units and are doing it in the programmable shader ALUs (fixed function arithmetic hardware was removed and more programmable units were added, improving performance in ALU bound shaders).
Often the best way to utilize all the fixed function units all the time is to run several parts of the algorithm in parallel. This is popular approach for example in TV sets (fixed function image decompression & processing). You have to do N processing steps to the source image before you can display it. With programmable hardware you'd use all your processing units to process a single step at a time. All processing units would be working 100% of the time, and you would have a latency of one frame to display the result (just enough hardware to finish the frame just in time to begin processing the next). With fixed function hardware, if you process one step at a time, all the other fixed function hardware designed for the other steps idles. To utilize all this hardware you need to process N frames in parallel. Hardware for each step is processing a different frame, and sends data to the next step. This approach yields a 100% hardware utilization (and likely a lower total power requirement than programmable hardware), but introduces a latency of N frames.
The good thing about programmable hardware is that it removes the fixed function hardware bottlenecks. With fixed function hardware you have to always program around the bottlenecks, and it doesn't help us that the bottlenecks change several times per frame (shadow map rendering for example has drastically different bottlenecks than for example deferred lighting). In a fully programmable architecture, all execution units can help in solving the current algorithm as fast as possible. Idle time is minimized (and latency is minimized). It also saves lots of programming time (inventing workarounds to fixed function bottlenecks is very time consuming process).
Just pick a paradigm and add more ff hw to accelerate it. Other paradigms will not be disadvantaged since better compute helps everything.
Of course other paradigms will be disadvantaged. There's a huge piece of unused (fixed function) hardware just sitting there doing nothing. If you are not using the hardware, it will be wasted. That's not what (console) developers are willing to do.
Pick tris for continuity/interop with classic rasterization.
I personally believe that triangles will be one of the reasons why we are leaving rasterization behind (if it happens in future). Subpixel sized triangles are a huge waste (of processing power, bandwidth and memory). Pure triangles are very inefficient in modeling high detail meshes (high detail CAD model geometry can be several gigabytes of size, and that's just for a single building or airplane). With other methods you can have the same level of detail with much smaller memory footprint (and smaller bandwidth requirements). For example patches with displacement maps are considerably better, but are a much harder problem for ray intersection calculation. Voxels on the other hand are very efficient target for ray casting (while being somewhere in the middle of the two in storage requirement for super high detail models, depending of course of the scene).