3D Technology & Architecture

3dilettante · Jun 5, 2007

Jawed said:
In this diagram each L2 acts as an L3 for other L2s in the GPU:

I don't know if this is how R600 works, but it seems likely if L2 is distributed 4 ways.

Jawed

The L2 might be tied more closely to the sampler units than the SIMDs.
For one thing, the I think the TLB in that patent would be tied to the address calculators of the texture processors.
I'm curious what is being defined as a client in that picture.

According to what I've read, RV630 is going to have half the sampler blocks that R600 has, but less than half when it comes to SIMD capability.
Following suit with the texture units, the L2 capacity of RV630 is half that of R600.

It would allow each L2 bank to serve as an L3 for another texture processor, and each texture processor indirectly links the L2 to 1/4 of a SIMD.

Jawed · Jun 5, 2007

3dilettante said:
The L2 might be tied more closely to the sampler units than the SIMDs.

Yes. It's hard to know whether L2 is distributed or not.

I think it's worth bearing in mind the "symmetry" of R600. There are four ring stops, each with a 128-bit connection to memory (two 64-bit channels). If you build a ring bus, then it makes sense to distribute the memory clients equally around the ring bus. If you don't, then the ring bus is a seriously bad configuration.

So, it seems to me the logical conclusion is that both texturing and render target operations, which are memory's heavy hitters, are distributed equally. This means 4 TUs per SIMD and 4 RBEs per SIMD.

For one thing, the I think the TLB in that patent would be tied to the address calculators of the texture processors.
I'm curious what is being defined as a client in that picture.

A client is any unit in the GPU that can address "off-die memory".

Virtual memory fragment aware cache

[0098] A client interface or "client" 602 is an interface between a single memory client and a memory controller 620. A memory client can be any application requiring access to physical memory. For example, a texture unit of a GPU is a memory client that makes only read requests. Clients that are exterior to the coprocessor of FIG. 6 may also access memory via the client interface 602, including any other processes running on any hosts that communicate via a common bus with the coprocessor of FIG. 6.

A client can also be the CPU or another GPU (e.g. in a CrossFire configuration) - these are what are described as "clients that are exterior to the coprocessor of FIG. 6".

Within R600 a TLB is associated with every "L1" cache, so that would be:

L1 vertex cache
L1 texture cache
instruction cache
constant cache
memory read/write cache
hierarchical-Z/stencil cache
colour cache
... erm, what else?

where some clients have an L2 path, whereas other clients are forced to go direct to the memory controller. It's unclear to me how broadly L2 is used in R600. It appears to be solely as backing for L1 vertex and texture caches. Patent documents for these two clients clearly indicate an L2. I haven't seen documents relating to other kinds of clients, where those clients use L2...

According to what I've read, RV630 is going to have half the sampler blocks that R600 has, but less than half when it comes to SIMD capability.
Following suit with the texture units, the L2 capacity of RV630 is half that of R600.

You raise a good point there, because RV630 has 2 TUs, 128-bits of memory bus and 3 SIMDs.

I rationalise this as "1 ring stop" (i.e. no ring bus at all, just 2x 64-bit memory channels and a local crossbar), with a pair of TUs and 3 SIMDs all sharing. So each quad of TUs has its own VL1 cache and TL1 cache. The L2 cache is then, seemingly, supporting 4 caches, 2x VL1 and 2x TL1.

In terms of capability, RV630 simply has a lower ALU:TEX ratio than R600 has.

It would allow each L2 bank to serve as an L3 for another texture processor, and each texture processor indirectly links the L2 to 1/4 of a SIMD.

Yes. And the other 3/4 of a SIMD is forced to send its request for a texture around the ring bus. This clearly trades in-GPU latency for increased bandwidth efficiency. The cost is obviously higher internal bus bandwidth and higher register file commitment in order to hide the additional latencies generated both by the "L3" cache routing and the routing of texture requests to non-local TUs.

Jawed

sonix666 · Jun 5, 2007

chavvdarrr said:
I mean all ATi/AMD chips with ring bus were late to the market, and failed on price/performance/size vs NV ones.

I disagree on that one. At the time I bought my Radeon X1900XT here in The Netherlands, it was a much better performing card in reviews compared to equally priced nVidia cards.

Nick · Jun 5, 2007

Jawed said:
In ATI terminology, R600 has "thousands of threads" in flight (sigh, maybe it's just 1001 threads, eh?) where each thread has 64 pixels (or vertices or primitives). So, call it 2048 threads, that's 131,072 pixels in flight. If there are 2 vec4 fp32 registers assigned per pixel, that's 16 bytes * 2 registers * 128000 pixels = 4MB of register file. I have to guess at 2 registers per pixel, it's not documented anywhere that I know of. It could be 1 or it could be more...

The Tech Report talks about hundreds of threads. So 512 is more likely than 2048, resulting in a 1 MB register file. Besides, a 4 MB register file with 10T per bit would be 336 million transistors (not counting decoders and multiplexers). Not very likely for a 700 million transistor chip.

"Thousands of threads" seems to apply to G80 when counting each element as an individual thread.

R580 can support 24,576 pixels in flight (512 threads, 48 pixels per thread), so for 2 vec4 fp32s, that's 16 bytes * 2 registers * 24576 = 1.5MB of register file for pixel shading (vertex shading is separate). Again, I'm guessing at there being support for 2 registers.

You might want to change your calculator's batteries. Mine sais 16 * 2 * 24576 = 768k.

Anyway, my guess of 64 kB was definitely way off.

But this 1 MB register file does support my original thought. Register files are huge, while caches are small. By improving hit rate, average latency would be lower, less threads need to be kept pending, more registers per pixel are available, less threads means less cache contention, etc.

...

so it seems some slow-downs can only be explained by register file "conflicts".

Interesting. Thanks for the pointers!

G80's scalar architecture does start to look like a silver bullet...

Nick · Jun 5, 2007

silent_guy said:
And no matter what, speculative prefetches are, well, speculative. So when you're wrong, you've just wasted really valuable bandwidth.

On CPUs prefetch loads have the lowest priority. So when the memory manager has nothing else to do it's better to prefetch things than just not using the bandwidth at all.

You're fighting Amdahls law with a technique that scales exponentially in size. I'm doing it with a linear one.

I'm not sure whether your approach is linear. Say we have a chip like G80, but with only one 32-bit memory interface. Now, you can double the register file size as many times as you like, performance will still be bottlenecked. If instead we would make the cache several times larger and added the best possible prefetching we would get better performance with a normal sized register file.

Now translate this extreme situation back to R600 verus G80, 512-bit versus 384/320-bit. If G80 has better cache hit ratios than R600 then the lower bandwidth isn't a signficiant bottleneck in practice, answering Julidz question.

It's just one attempt at explaining the situation. I haven't even seen a G80 nor an R600 in action yet.

Nick · Jun 6, 2007

Bob said:
How much cache do said CPU need to reduce latency by a factor of 100? Now consider that this number is only for a single thread. Is building a cache 100-1000000x larger to scale it up by the number of threads in flight on a GPU really worthwhile?

You're falsely assuming there's no data locality between the threads, and that all threads are waiting for a texture sample.

Look at it this way: For a 1280x1024 screen and 5 textures per pixel you need 25 MB. So they would get frame-to-frame cache coherency if 25 MB could fit on-die. Really only a fraction of the loaded textures is actually visible on the screen, and mostly only the smaller mipmaps levels.

I'm not saying we actually need a cache that big. There's still plenty of RAM bandwidth we can put to good use. But I do believe it's possible to compensate for slightly lower bandwidth by improving the texture cache hit ratio. But feel free to prove me wrong.

Jawed · Jun 6, 2007

Nick said:
The Tech Report talks about hundreds of threads. So 512 is more likely than 2048, resulting in a 1 MB register file. Besides, a 4 MB register file with 10T per bit would be 336 million transistors (not counting decoders and multiplexers). Not very likely for a 700 million transistor chip.

Ha, that's definitely different from what I've seen somewhere else, ah well. It would be interesting to know if R600 has "only" ~512 threads in flight - I question this because of the mix of vertices, primitives and pixels. A 3-way workload split (as opposed to the 2-way workload of Xenos, say) might benefit from extra threads in flight. Dunno.

I was under the impression that it's 6 transistors per bit in building a register file... Also, memory tends to be very dense in terms of die space usage...

"Thousands of threads" seems to apply to G80 when counting each element as an individual thread.

You might want to change your calculator's batteries. Mine sais 16 * 2 * 24576 = 768k.

sigh.

Jawed

Bob · Jun 6, 2007

Nick said:
You're falsely assuming there's no data locality between the threads, and that all threads are waiting for a texture sample.

What's the hit ratio of your cache? If you want a 100x reduction in average latency, you will need 95%+ hit ratio. How likely is that when painting the screen with (given your example) 5 textures, really? Unless all textures fit in the cache (unlikely), nearly any amount of spill will drastically reduce your hit ratio, and correspondingly, your latency will shoot up through the roof. I hope you can deal with all that extra latency

Nick said:
Say we have a chip like G80, but with only one 32-bit memory interface. Now, you can double the register file size as many times as you like, performance will still be bottlenecked. If instead we would make the cache several times larger and added the best possible prefetching we would get better performance with a normal sized register file.

Most of the gain will be from your cache size increase, not from prefetching. With the memory interface completely saturated, there would be no opportunity to prefetch unless the large fraction of your dataset fits in cache. Given that that's unlikely (frame-to-frame coherent datasets more like 1000x typical cache sizes, as opposed to 10x), then your cache acts as a bandwidth magnifier, not as a latency reducer.

Now, if your whole dataset somehow fits in the cache (ie: FB interface is underutilized), then you're left with several options when you need to fetch something out of cache. You can hide the latency in the typical manner that GPUs use, or you can prefetch the data.

Given that the area for hiding the latency to memory given underutilized FB is known and (largely) fixed, the cost of this scheme tends to be reasonable.

The cost (area) of prefetching is also low, if you're right. If you're wrong, then you end up stalling the whole system for a very long time.

If you have a hybrid approach where you can hide latency, then why bother with prefetching? The incremental cost of hiding just a bit more latency isn't that much higher.

Prefetching right is hard.

silent_guy · Jun 6, 2007

Nick said:
I'm not sure whether your approach is linear. Say we have a chip like G80, but with only one 32-bit memory interface. Now, you can double the register file size as many times as you like, performance will still be bottlenecked. If instead we would make the cache several times larger and added the best possible prefetching we would get better performance with a normal sized register file.

I've been saying all along that a cache is great to reduce bandwidth and that's why they are there, but you'll reach a point very quickly where your incremental gain will drop like a stone (and perf/area even more). After all, when you're already reaching a 95% hit rate, you can increase your cache as much as you want, you'll never be able to gain more than 5%.

At that point, it's way cheaper to just make sure that you can avoid bubbles by using latency hiding.

Does that mean that texture cache sizes won't go up in the future? Probably not: as total memory size goes up, I assume caches will need to scale also to maintain the same hit rate?

Does it mean that "large caches will make a significant difference in the near future". I highly doubt it.

Now translate this extreme situation back to R600 verus G80, 512-bit versus 384/320-bit. If G80 has better cache hit ratios than R600 then the lower bandwidth isn't a signficiant bottleneck in practice, answering Julidz question.

I completely agree that a higher hit rate can compensate for a difference in external bandwidth.

3dilettante said:
Amdahl's law specifically addresses the issue where greater parallel throughput results in diminishing returns as inherently serial execution takes up a larger proportion of the time.

As hinted at in this article, I was using the law for the general case of diminishing returns instead of the explicit parallel case. Sorry for the confusion.

Your technique maximizes parallel throughput at the cost of 1/n of serial performance.
You multiply the effect of any serial code by the length of the FIFO, which means in all but embarrassingly parallel problems, the cost of serial code rises to offset any gains made by parallel latency hiding.

Yes, that is definitely the case. On the other hand, if one problem in computing is often cited as an example of embarrassingly parallel, it's computer graphics in general. But there are definitely boundary cases that make parallelization not always as perfect as the theory would show. I don't think there's a lot of inter-thread communication, I guess synchronization after branching etc must be indeed an issue.

To make up for this, R600's caching must be capable of hitting at least 50% of the time in order to make sustained performance possible in bandwidth-limited situations.

Absolutely!

Jawed said:
Lecture 12 here:
http://courses.ece.uiuc.edu/ece498/al/Syllabus.html
by Mike Shebanow elucidates very nicely on the topic of texture fetch latency hiding.

Very nice. Amazing the stuff that's floating on the net. Must read other lectures too.

BTW, I don't care whether the whole worlds adopts the ATI thread naming convention or the Nvidia convention, but it's clear that one has to go... (and preferably the Ati one: I kinda like the Nvidia thread/warp metaphor.) Too confusing...

trinibwoy · Jun 6, 2007

silent_guy said:
Very nice. Amazing the stuff that's floating on the net. Must read other lectures too.

Very interesting. So Nvidia does consider G80 a 300+ GFLOPs part. So if they are talking close to 1TFLOP for G92 then it really is on the order of 2-3x faster at least in the shader domain.

Jawed · Jun 6, 2007

silent_guy said:
Does that mean that texture cache sizes won't go up in the future? Probably not: as total memory size goes up, I assume caches will need to scale also to maintain the same hit rate?

Does it mean that "large caches will make a significant difference in the near future". I highly doubt it.

I think your conclusions here ignore the fact that prior to R600, it would seem ATI's GPUs had 10s of KB of texture cache (low 10s). Now they have hundreds. That's not a trivial difference. Historically, GPUs have always had gob-smackingly small texture caches, it seems.

How L2 is deployed in G80 is an open question, i.e. how much L2 is typically used by texturing and is L2 shared by any other clients (e.g. ROPs)? I don't remember an explicit size for G80's L2 being mentioned anywhere.

BTW, I don't care whether the whole worlds adopts the ATI thread naming convention or the Nvidia convention, but it's clear that one has to go... (and preferably the Ati one: I kinda like the Nvidia thread/warp metaphor.) Too confusing...

I think ATI's use is justified. All pixels in an ATI thread perpetually share a program counter - there's no pretence at hiding the SIMDness of the architecture. Predication is the only means by which execution "varies", "per pixel".

You realise that a warp in G80 is really two batches (of 16) stitched together. It's funny how this kinda "slipped out" during that course. Certain slides were not supposed to talk about "half-warp" but the cat got out of the bag as it were.

Jawed

Nick · Jun 6, 2007

Jawed said:
...I question this because of the mix of vertices, primitives and pixels. A 3-way workload split (as opposed to the 2-way workload of Xenos, say) might benefit from extra threads in flight. Dunno.

What's your reasoning behind that?

I was under the impression that it's 6 transistors per bit in building a register file... Also, memory tends to be very dense in terms of die space usage...

According to Wikipedia, a 2 read port 1 write port register file bit cell takes 10 transistors. So 3 read ports must take even more transistors/size, and R600's WLIV demands possibly make it even more complicated. And we're not looking at decoders and MUXes yet. So 1 MB of register space really seems like the maximum to me.

6 transistors is for an SRAM cache, and it's so common that it should always be size optized to the extreme.

silent_guy · Jun 6, 2007

Jawed said:
I think your conclusions here ignore the fact that prior to R600, it would seem ATI's GPUs had 10s of KB of texture cache (low 10s). Now they have hundreds. That's not a trivial difference. Historically, GPUs have always had gob-smackingly small texture caches, it seems.

Maybe there are other candidates that profit more from a cache, where latency hiding is harder? If there were no L2 caches in previous generations (?), are there features in DX10 that make a cache more useful?

I think ATI's use is justified. All pixels in an ATI thread perpetually share a program counter - there's no pretence at hiding the SIMDness of the architecture. Predication is the only means by which execution "varies", "per pixel".

If you look at it from high above, the theoretical machine that runs each 'pixel' (aaarglgl) has its own state, registers, and, yes, PC, now that branching is allowed. Implementation performance may suck, but at least you can reuse terminology that's been around forever. Whatever...

Nick · Jun 6, 2007

Bob said:
What's the hit ratio of your cache? If you want a 100x reduction in average latency, you will need 95%+ hit ratio. How likely is that when painting the screen with (given your example) 5 textures, really?

The 100x was only referring to CPUs. The average latency improvement also obviously highly depends on the access pattern. So 95% is not likely at all.

However, what I'm saying is that even a small improvement in cache hit ratio has a lot of positive effects. It reduces bandwidth need, reduces the number of threads you have to keep in flight, improves register file usage, improves data access locality, etc.

Prefetching right is hard.

Which isn't a reason not to use it...

Look, this thread tries to find reasons why R600, despite its brute-force numbers, doesn't beat G80. I don't know enough about the differences between their cache architectures, I just think it could be one of the factors that contributes a few percentages of (in)efficiency here and there. And if I'm right, I believe we'll see GPUs with larger and/or more efficient caches in the future. If RAM bandwidth doesn't follow fast enough I don't see any other option.

silent_guy · Jun 6, 2007

Nick said:
6 transistors is for an SRAM cache, and it's so common that it should always be size optized to the extreme.

I believe that the 4 transistor basis cell of an SRAM is the same for single or multi-port memories. This is the cell that's optimized to the extreme, usually by the fab itself because it always violates the standard design rules. Then you need 2 additional transistors per port (read or write) to load and extract the content. Density will definitely be lower, of course, but you can still make it highly optimized.

ATI using real 3R1W RAMs for their complete register files would be monumentally inefficient. As Rys wrote in the architecture article, there is a cache in the middle: there must be a reason for that. OTO, then there must also be traffic patterns then that prevent maximum throughput, and we haven't seen any thus far.

Nick · Jun 6, 2007

silent_guy said:
I've been saying all along that a cache is great to reduce bandwidth and that's why they are there, but you'll reach a point very quickly where your incremental gain will drop like a stone (and perf/area even more). After all, when you're already reaching a 95% hit rate, you can increase your cache as much as you want, you'll never be able to gain more than 5%.

At that point, it's way cheaper to just make sure that you can avoid bubbles by using latency hiding.

Are we anywhere near that yet then?

Nick · Jun 6, 2007

trinibwoy said:
So Nvidia does consider G80 a 300+ GFLOPs part.

Who doesn't? 128 stream processors at 1.35 GHz is 346 GFLOPS...

3dilettante · Jun 6, 2007

silent_guy said:
Maybe there are other candidates that profit more from a cache, where latency hiding is harder? If there were no L2 caches in previous generations (?), are features of DX10 that make a cache more useful?

There is a lot more read after write traffic with the latest GPUs. Threads are allowed to feed data into other threads, which is inherently a read after write dependency.

If one thread's results were not kept on chip, the consumer thread would have to take a significant latency hit.

In the case of a FIFO, the question becomes complicated because outcomes can differ depending on where the consuming thread is located in the FIFO relative to the producer.

There's a sweet spot where the producer can immediately pass results back to the consumer, assuming some kind of forwarding structure exists. Depending on how much data can be cached, the window of opportunity varies in size.

With a wide SIMD processing model like R600, failure to cache such results would result in a massive bubble where dozens of threads/objects in the FIFO check for data and find that it has gone off-chip.

The FIFO would have to be larger than the minimum amount of entries needed to cover best-case memory latency, in order to keep dependent threads from completely blocking further execution.

On top of that, utilization would drop significantly since the FIFO would have to pass those stalled threads through for at least one stage of execution that evaluates operand fetches. Mechanisms for skipping such threads would be increasingly expensive as FIFO length increases.

After all that, the lack of caching would spawn a large number of off chip accesses.

silent_guy · Jun 6, 2007

3dilettante said:
There is a lot more read after write traffic with the latest GPUs. Threads are allowed to feed data into other threads, which is inherently a read after write dependency. ...

Ok, so basically, that's all related GS and/or streamout? Yes, that makes perfect sense.

Jawed · Jun 6, 2007

Nick said:
What's your reasoning behind that?

In a two-way work-balanced system, there's only one ratio for the GPU to manage: the length of the vertex queue(s) versus the length of the pixel queue(s). If there's any "vertex texturing" in a vertex shader, then you need more vertex threads in flight in order to hide the resulting latency. If the pixel shader code is doing lots of dependent texturing (which will increase average texturing latency per pixel) and not very much ALU code then more pixels need to be in flight to minimise stalls. Not only are you trying to avoid stalling the ALU pipeline, but you'd also like to avoid stalling the texture and render target pipelines.

In D3D10, with VS, GS and PS all taking their moments in the limelight, you increase the risk of work drying up or a queue being full. There's more resources to manage. A bigger "buffer" enables you to smooth-out variations. Obviously the GPU designer has to decide where to draw the line on register file size...

According to Wikipedia, a 2 read port 1 write port register file bit cell takes 10 transistors. So 3 read ports must take even more transistors/size, and R600's WLIV demands possibly make it even more complicated. And we're not looking at decoders and MUXes yet. So 1 MB of register space really seems like the maximum to me.

6 transistors is for an SRAM cache, and it's so common that it should always be size optized to the extreme.

Ah thanks, I was just thinking in terms of "bits" and forgetting about the porting.

Interestingly enough, I think it was Silent Guy who pointed out a while back that it can be simpler to implement a register file's multiple read ports by duplicating the entire memory multiple times. That would obviously cost even more...

Us mere mortals aren't meant to reason much about register files

Jawed

3D Technology & Architecture

3dilettante

Jawed

sonix666

Nick

Nick

Nick

Jawed

Bob

silent_guy

trinibwoy

Meh

Jawed

Nick

silent_guy

Nick

silent_guy

Nick

Nick

3dilettante

silent_guy

Jawed

Similar threads