A poor fit for software rendering, perhaps, but having a hardware renderer fork off a hundred threads in order to process a hundred vertices or a hundred pixels doesn't sounds that hard to me, as long as the vertices/pixels aren't dependent on each other. IIRC, the vertex shader in NV20 is already 6-way multithreaded to mask the 6-cycle latency of most vertex shader instructions, and the pixel shader in NV30 juggles around ~170 execution threads, corresponding to 170 pipeline steps (with >100 steps set aside for texturing alone; yes, texturing has MUCH higher latencies than piddly branches). With such long instruction latencies, especially for texturing, NOT doing massive multithreading will hurt your performance so badly it isn't even funny.