Because bus width is only part of the amount of bandwidth. A 32 bit bus can be faster than a 512 bit bus if you clock it high enough. AMD has been using relatively slow memory on the 2900 compared to faster memory being used on the 8800's.
Since AMD is rather tight-lipped about the exact reason that R600 falls so short of peak in so many cases, it may be speculated that if it weren't for the excess bandwidth, it would be doing even worse.
Because 320-bit is enough. It wouldn't help R600 even if it had a 1024-bit interface. The bottleneck is elsewhere.why isn't the 512bits from HD2900XT making the difference against 320bits from 8800GTS ?
That being said, I believe large caches will make a significant difference in the near future. Good cache hit ratios not only reduce bandwidth, it lowers texture filtering latency. This allows a more efficient use of the register file.
Looking at CPUs, caches reduce latency by a factor of 100. Yes it's another context and yes CPU caches take a huge amount of die space, but reducing latency by a factor 100 sounds very efficient to me.I don't believe this is going to happen: caches are generally a very inefficient scheme to reduce latency...
True.It will never help to reduce latency for stuff that's not already in there...
Not true (in context). Prefetching significantly reduces the number of cold misses....so for fresh data you'll always have bubbles.
For every clock cycle you have to wait for a memory access, you need registers to store all temporary results. Yes it's a linear trade-off but one that rises very fast. With caches, every decrease in cache miss ratio increases the chances of being able to keep the ALUs busy. Correct me if I'm wrong, but as far as I know caches have a larger density than register files.This is not the case with latency hiding FIFO's, where you have a linear trade-off between area and latency reduction.
Ok, I initially pointed this out, but then removed it because it's too obvious.Looking at CPUs, caches reduce latency by a factor of 100. Yes it's another context and yes CPU caches take a huge amount of die space, but reducing latency by a factor 100 sounds very efficient to me.
You can not prefetch what you don't know.Not true (in context). Prefetching significantly reduces the number of cold misses.
Let's guestimate a bunch of numbers.For every clock cycle you have to wait for a memory access, you need registers to store all temporary results. Yes it's a linear trade-off but one that rises very fast. With caches, every decrease in cache miss ratio increases the chances of being able to keep the ALUs busy.
Yes, for multi-megabit caches. Hardly, for the little stuff we're talking about here.Correct me if I'm wrong, but as far as I know caches have a larger density than register files.
Yes you can. Just guess the next address by looking at the access pattern history.You can not prefetch what you don't know.
Let's look at the chip as a whole. R600 can probably keep 1024 threads in flight with 16 scalars each? That's a 64 kB register file. Its L2 cache is 256 kB in total if I'm not mistaken.That's 64 bytes per thread or 1536 for all threads.
I'm not sure if that's true for the semi-ordered texture access pattern... unless with prefetching.If you double your cache, you can reasonably expect the miss rate to halve...
Why? If your data set fits in the cache then no RAM bandwidth is wasted and you need very little temporary registers. And with a well predictable access pattern using prefetching allows larger data sets while still only requiring minimal bandwidth and temporary registers.But the general rule about all memory architectures is that caches are a terrible way to solve latency if other solutions are possible.
The register file has three read ports and one write port, which makes it fairly complex as far as I know. The cache only needs just one read port. Furthermore with static CMOS a register is 10 transistors while SRAM is 6 transistors. And finally, cache cells are typically optimized for size with full custom design.Yes, for multi-megabit caches. Hardly, for the little stuff we're talking about here.
Yep. Look at ATI's most recent texturing patent and you'll see it's explicitly designed to work in concert with rasterisation, i.e. "maximising prefetch" whilst also optimising memory access patterns (by organising cache in a 3-dimensional way). Sure that's a generalisation for what all texturing systems do, but they've spent yet more transistors making it work "better".Yes you can. Just guess the next address by looking at the access pattern history.
Woah! Way off.Let's look at the chip as a whole. R600 can probably keep 1024 threads in flight with 16 scalars each? That's a 64 kB register file.
That's just L2 for texturing. There's two distinct L1s, one for 2D texturing and one for vertex data (1D), both of which are backed by the 256KB of L2. Note L2 is prolly split 4 ways, one quarter per SIMD (where each SIMD has 1/4 of the 16 TMUs, i.e. 4). But to be honest I don't know of any confirmation of this.Its L2 cache is 256 kB in total if I'm not mistaken.
I'm not aware of any specific info for the organisation of R600's register files. Logically it needs the 3+1 organisation you describe because that's what a MAD is. But if you consider that R600 can co-issue a vec4 MAD and a SF, e.g. RCP, then it's a bit more fiddlesome. ARGH.The register file has three read ports and one write port, which makes it fairly complex as far as I know.
If the L2 is distributed in like manner as the TMUs, then 1/4 of each SIMD can access 1/4 of the L2, since 1/4 of each SIMD is tied to one of the 4 sampler units.That's just L2 for texturing. There's two distinct L1s, one for 2D texturing and one for vertex data (1D), both of which are backed by the 256KB of L2. Note L2 is prolly split 4 ways, one quarter per SIMD (where each SIMD has 1/4 of the 16 TMUs, i.e. 4). But to be honest I don't know of any confirmation of this.
A prefetcher after a cache tries to predict the next access based on previous miss. This works well enough for a linear case, where you can basically just overfetch, and in some cases you're even able to detected linked lists, but that's considered to be already really advanced stuff, yet it's still a 1D data structure.Yes you can. Just guess the next address by looking at the access pattern history.
Yes, sorry, my calculation above was just a general example with one executing engine, not really GPU tailored. When you have n parallel engines, you have to multiply the register size accordingly. This introduces a one time scaling of n in favor of a cache, but it also increase the chance of a cache hit when threads are grouped as quads. Good, because the 20% cache miss rate was too charitable for the cache case. Counter intuitively, higher hit rates favor the latency hiding fifo case, because they reduce the delta between the number of cycles for the average latency when you increase the cache. (See difference between avg. latency below)Let's look at the chip as a whole. R600 can probably keep 1024 threads in flight with 16 scalars each? That's a 64 kB register file. Its L2 cache is 256 kB in total if I'm not mistaken.
Miss Rate 5% 2.5% Delta
Cache latency 5 5
Mem latency 100 100
Cache Size 16384 32768 16384
Thread size 64 64
Avg latency 9.75 7.375
RF per engine 624 472
Nr of engines 64 64
Register file 39936 30208 -9728
For a CPU, it generally is, until your core set of data basically fits the cache and your long term locality is fairly constant, as is the case for a CPU. There must be some different characteristics for textures, but it's probably a good enough approximation.I'm not sure if that's true for the semi-ordered texture access pattern... unless with prefetching.
But please correct me if got some stuff wrong. I'm here to learn.
In this diagram each L2 acts as an L3 for other L2s in the GPU:If the L2 is distributed in like manner as the TMUs, then 1/4 of each SIMD can access 1/4 of the L2, since 1/4 of each SIMD is tied to one of the 4 sampler units.
3-D rendering texture caching schemeA texture pattern is pretty much 3 dimensional: s,t and lod. Maybe you can come up with fancy storage patterns that allow you to fetch multiple directions in one go, but it won't be easy. And then there's tri-linear interpolation and AF.
A 3D rendering texture caching scheme that minimizes external bandwidth requirements for texture and increases the rate at which textured pixels are available. The texture caching scheme efficiently pre-fetches data at the main memory access granularity and stores it in cache memory. The data in the main memory and texture cache memory is organized in a manner to achieve large reuse of texels with a minimum of cache memory to minimize cache misses. The texture main memory stores a two dimensional array of texels, each texel having an address and one of N identifiers. The texture cache memory has addresses partitioned into N banks, each bank containing texels transferred from the main memory that have the corresponding identifier. A cache controller determines which texels need to be transferred from the texture main memory to the texture cache memory and which texels are currently in the cache using a least most recently used algorithm. By labeling the texture map blocks (double quad words), a partitioning scheme is developed which allow the cache controller structure to be very modular and easily realized. The texture cache arbiter is used for scheduling and controlling the actual transfer of texels from the texture main memory into the texture cache memory and controlling the outputting of texels for each pixel to an interpolating filter from the cache memory.
But in GPUs speculative should be a big win because texture data is highly regimented. Additionally it's possible to hide complex memory layouts for texture data, so whatever hare-brained scheme the GPU designers invent for the "tiling" of textures in GPU memory, the programmer doesn't have to mess about with this stuff to optimise cache line usage or whatever.And no matter what, speculative prefetches are, well, speculative. So when you're wrong, you've just wasted really valuable bandwidth.
Ok, so prefetching is doable. Good.
You're fighting Amdahls law with a technique that scales exponentially in size. I'm doing it with a linear one.
Except in the case of synchronization or inter-thread dependencies.I just have to increase a FIFO and a register file and I'm guaranteed to solve the problem under all circumstances.
Yeah, but GPUs have never had texture cache for latency reduction - it's always been about bandwidth efficiency. When a texel is fetched its lifetime, per frame, is almost certainly over within a few clock cycles of it appearing in the texture mapping pipeline.But for latency they're no good...
Please note that "threads" here may not mean what you think: ATI defines a thread as a group of pixels, e.g. 64 in R600 (a batch, effectively). NVidia defines a thread as a pixel (or vertex or primitive).I'm a bit surprised by your number about the total numbers of threads in a GPU: this indicates that the number of threads required for texture fetch latency hiding is actually quite low relative to the total number of live threads (and makes the whole thead cost issue even more irrelevant).
2 vec4s, as I described, is not enough for anything but the most trivial DX8 shader (as far as I can tell). Normal shader programs might have 4 or 12 or more vec4 registers defined. When you increase the register payload per pixel, you proportionally decrease the number of 64-pixel batches (threads in ATI terminology) that can occupy the full 4MB, say, of register file. A 12-register shader is 6x more registers than the example I gave, so that means 1/6th of the pixels can exist concurrently in the register file. So, 131,072 pixels in flight gets reduced to ~21,000. That's ~340 64-pixel batches. Split that 4 ways (for each SIMD) and that's ~85 batches. Each batch consumes 4 core clocks per instruction, so that's ~340 clocks of worst-case latency hiding. It's overkill.Any idea why the other threads are needed? Trying to cover instruction dependencies would be one thing, of course. Anything else?
Lecture 12 here:[...] this indicates that the number of threads required for texture fetch latency hiding is actually quite low relative to the total number of live threads (and makes the whole thead cost issue even more irrelevant).
Nick said:Looking at CPUs, caches reduce latency by a factor of 100. Yes it's another context and yes CPU caches take a huge amount of die space, but reducing latency by a factor 100 sounds very efficient to me.