Jawed said:
I have to say I'm more intrigued to know if the ultra threaded despatch processor (or whatever it's called) is programmable and what effects we might see from that.
And what kind of relationship exists between the UTDP and the MC. There must be a fair degree of symbiosis there.
Jawed
The whole thing is one system -- All units depend on each other to work to correctly. The MC requires the clients to have lots of latency tolerance so that it can establish a huge number of outstanding requests and pick and chose the best ones to maximize memory bandwidth (massive simplification).
However, texture ends up being a MC client but also has the shader dependant on it. Consequently, if the MC wants high latency, the shader has to be designed to deal with that.
There are two reasonable ways to deal with that: You can either have large batch sizes of pixels, in which case you hide the latency of fetches, more or less, just by doing the same thing over and over on many pixels before going to the next thing. This would be an architecture that, says, executes the same pixel shader instruction on 1000's of pixels. This works well to hide latency, and is somewhat cheap, area wise. However, it suffers granularity loss, since it has to work in large batches. This would make for a good SM2 type part. The new way, is to make small batches, but have lots of them. So you execute one instruction on a small batch (say 16 pixels), then switch to another instruction and batch until the data for the first one returns. You need to have lots of live threads in this type of architecture, and you need lots of resources (i.e. area) for it to properly hide latency. But, its advantage is that it rules from a granularity standpoint and branching (prime feature of SM3) works perfectly. That's what we did for the R5xx. I believe that the first architecture is more popular for others.
At the end, the whole thing works together. To achieve high memory bandwidth, you need an efficient memory controller design with windowed requests, and clients capable of dealing with long latencies. We did all this, and made our control units very programmable on top -- Since we knew that tuning would be difficult and that we need the flexibility to be able to achieve high efficiency (we did not trust that we would get it right with the first set of settings/prgrms
). It also allows us to experiment and try new things, so that we'll be more ready for the future.
Edit: corrected some terrible grammar.