The L2 might be tied more closely to the sampler units than the SIMDs.
Yes. It's hard to know whether L2 is distributed or not.
I think it's worth bearing in mind the "symmetry" of R600. There are four ring stops, each with a 128-bit connection to memory (two 64-bit channels). If you build a ring bus, then it makes sense to distribute the memory clients equally around the ring bus. If you don't, then the ring bus is a seriously bad configuration.
So, it seems to me the logical conclusion is that both texturing and render target operations, which are memory's heavy hitters, are distributed equally. This means 4 TUs per SIMD and 4 RBEs per SIMD.
For one thing, the I think the TLB in that patent would be tied to the address calculators of the texture processors.
I'm curious what is being defined as a client in that picture.
A client is any unit in the GPU that can address "off-die memory".
Virtual memory fragment aware cache
[0098] A client interface or "client" 602 is an interface between a single memory client and a memory controller 620. A memory client can be any application requiring access to physical memory. For example, a texture unit of a GPU is a memory client that makes only read requests. Clients that are exterior to the coprocessor of FIG. 6 may also access memory via the client interface 602, including any other processes running on any hosts that communicate via a common bus with the coprocessor of FIG. 6.
A client can also be the CPU or another GPU (e.g. in a CrossFire configuration) - these are what are described as "clients that are exterior to the coprocessor of FIG. 6".
Within R600 a TLB is associated with every "L1" cache, so that would be:
- L1 vertex cache
- L1 texture cache
- instruction cache
- constant cache
- memory read/write cache
- hierarchical-Z/stencil cache
- colour cache
- ... erm, what else?
where some clients have an L2 path, whereas other clients are forced to go direct to the memory controller. It's unclear to me how broadly L2 is used in R600. It appears to be solely as backing for L1 vertex and texture caches. Patent documents for these two clients clearly indicate an L2. I haven't seen documents relating to other kinds of clients, where those clients use L2...
According to what I've read, RV630 is going to have half the sampler blocks that R600 has, but less than half when it comes to SIMD capability.
Following suit with the texture units, the L2 capacity of RV630 is half that of R600.
You raise a good point there, because RV630 has 2 TUs, 128-bits of memory bus and 3 SIMDs.
I rationalise this as "1 ring stop" (i.e. no ring bus at all, just 2x 64-bit memory channels and a local crossbar), with a pair of TUs and 3 SIMDs all sharing. So each quad of TUs has its own VL1 cache and TL1 cache. The L2 cache is then, seemingly, supporting 4 caches, 2x VL1 and 2x TL1.
In terms of capability, RV630 simply has a lower ALU:TEX ratio than R600 has.
It would allow each L2 bank to serve as an L3 for another texture processor, and each texture processor indirectly links the L2 to 1/4 of a SIMD.
Yes. And the other 3/4 of a SIMD is forced to send its request for a texture around the ring bus. This clearly trades in-GPU latency for increased bandwidth efficiency. The cost is obviously higher internal bus bandwidth and higher register file commitment in order to hide the additional latencies generated both by the "L3" cache routing and the routing of texture requests to non-local TUs.
Jawed