OpenGL guy said:
aaronspink said:
Pixel shaders DO NOT have more than one context. They are NOTHING like SMT at either the architecture or micro-architecture level.
That's entirely implementation specific.
Oh, I agree that eventually as the pixel pipeline becomes more programable that may change but I am unaware of any current GPU that has more than one control flow context per pipe. Perhaps you'll educate me as it is fairly hard to get concrete info since ATI/NV/etc seem to have an absolute aversion to technical papers on designs.
What you see as a "stream processor with SIMD operations" may be far more complex. You can't know the internal workings just based on how it performs.
I agree that figuring out all the internal workings of a given design from how it performs can be difficult, but you can make educated guesses based on programming models and performance as a combination.
You're wrong. Simple example. Take the following shader:
Code:
mad r0, c0, t0
tex r1, r0, s0
mov o1, r1
So when pixel n comes along and hits that tex read, what does the HW do? Wait potentially hundreds of cycles while the texture unit does the look up? No, the shader can work on pixel n+1. When pixel n+1 hits the texture read, the HW can either go back to pixel n (if the data is ready) or start working on pixel n+2... etc. This is very much like a multithreaded environment where each thread is a pixel. Far from losing efficiency and performance, you gain them.
What you do is what I assume is currently done which is effectively creating additional virtual pipeline stages via some method of delay buffering. This is a far different thing than multi-threading.
The main differentiator is that a multi-threaded architecture has independant control flow for each given stream within the architecture. This allows a far greater amount of programming flexibility while at the same time requiring more hardware resources to implement. From the great aversion to state changes and branches in both current and future GPU designs from both of the market leaders, I can only assume that they don't really support independant flow control and instead act more like stream processors with conditional execution and reconfiguration for different instruction streams/workloads. In general it appears that the pixel pipelines are even more limited in their designs, working on a quad as the minimum unit of computation instead of an individual pixel.
I'm not arguing that there aren't simularities between how GPU/Stream processors opperate and how a multi-threaded architecture opperates, I am merely arguing that there are significant differences and that the appropriation of the terms thread/multi-threading is incorrect.
Again, implementations vary, but I see evidence that other vendors are doing this as well. The question is if you can juggle enough threads to hide all of the latencies in different parts of the chip.
I would assume that with current workloads, you should be able to juggle enough pixels to keep the pipeline at near peak utilization. But I don't believe this will be the case in future workloads that use significant flow control. Already there are significant programming model limitations put into place to try to prevent the hardware from becoming too complex and devolve in performance.
A lot of the reason that pixel engines perform so well at what they do is the programming model is kept deliberately simple (full pre-allocation of resources, limitations on flow control changes, etc), allowing a lot of parallel resources to be used.
Aaron Spink
speaking for myself inc.