AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

If my calculations are right, a 32-CU chip would have:

— 4×64kB (vec regs) + 64kB (LDS) + 16kB (R/W data L1) + 8 kB (scalar regs) = 344 kB per Compute Unit,
— 16 kB (shared L1) + 32 kB (shared iL1) = 48 kB per CU Array,
— Probably 512 kB of L2.

That's a total of 32×344 + 8×48 + 512 = 11,904 kB or 11.6 MB of internal memory, i.e. registers + cache. That's quite a lot, and if I recall correctly, Fermi has about 4 MB.

Edit:

Actually, Fermi has:

— 128 kB (vec regs) + 64 kB (L1) = 192 per SM,
— 768 kB of L2.

That's a total of 16×192 + 768 = 3,840 kB or 3.75 MB of internal memory.
 
Last edited by a moderator:
If my calculations are right, a 32-CU chip would have:

— 4×64kB (vec regs) + 64kB (LDS) + 16kB (R/W data L1) + 8 kB (scalar regs) = 344 kB per Compute Unit,
— 16 kB (shared L1) + 32 kB (shared iL1) = 48 kB per CU Array,
— Probably 512 kB of L2.

That's a total of 32×344 + 8×48 + 512 = 11,904 kB or 11.6 MB of internal memory, i.e. registers + cache. That's quite a lot, and if I recall correctly, Fermi has about 4 MB.

Edit:

Actually, Fermi has:

— 128 kB (vec regs) + 64 kB (L1) = 192 per SM,
— 768 kB of L2.

That's a total of 16×192 + 768 = 3,840 kB or 3.75 MB of internal memory.
Right (edit: you forgot the Tex-L1 for Fermi, if you include that it's a bit more than 4 MB), and Cayman has:
24 x (256 kB Regs + 32 kB LDS + 8 kB L1) = 24 x 296 kB
512 kB L2 ( and 64 kB GDS)

Total: 7.5 MB
 
Last edited by a moderator:
Isn't it per memory controller, each controller driving two 32-bit channels? I was assuming this and a 256-bit interface.
 
Isn't it per memory controller, each controller driving two 32-bit channels? I was assuming this and a 256-bit interface.
Okay, but do we really know if a channel is comprised of (at least) two parallel DRAM chips or a memory controller controls two (32 bit) channels?
 
Right (edit: you forgot the Tex-L1 for Fermi, if you include that it's a bit more than 4 MB), and Cayman has:
24 x (256 kB Regs + 32 kB LDS + 8 kB L1) = 24 x 296 kB
512 kB L2 ( and 64 kB GDS)

Total: 7.5 MB
And the 64KB of constant (uniform) cache per SM in Fermi, too! ;)
 
And the 64KB of constant (uniform) cache per SM in Fermi, too! ;)
The spec allows the use of up to 64 kB constant memory. But the cache is much smaller, only 8kB or so, if I remember correctly. And it's not like AMD GPUs do not also have a constant buffer ;)

Maybe we should start to count the bytes in the write combining buffers. :LOL:
 
Last edited by a moderator:
I guess Fermi no longer has an instruction cache either :)

We don't know what 28nm Fermi (Kepler) looks like yet and it could have 2MB L2 for all we know. I'm half expecting it. The first iteration of FSA could have 1-2MB of L2 as well.
 
Okay, but do we really know if a channel is comprised of (at least) two parallel DRAM chips or a memory controller controls two (32 bit) channels?

Perhaps I'm misunderstanding what you mean, but would that really make any difference?

I guess Fermi no longer has an instruction cache either :)

We don't know what 28nm Fermi (Kepler) looks like yet and it could have 2MB L2 for all we know. I'm half expecting it. The first iteration of FSA could have 1-2MB of L2 as well.

Sounds plausible. So far, GPUs tend to have a buttload of registers, lots of small L1 cache, and a tiny L2. The current trend (visible on NVIDIA's architectures anyway) is that registers don't grow as quickly as SPs or L1 cache. L2 is very new so there's not much we can say about it.

So yeah, maybe the future will see register_size/SP going down a bit with L1 and L2 caches growing significantly. GPUs integrated into APUs are quite likely to have access to a pretty large L3 cache as well (as is already the case in Sandy Bridge), so it's possible that future GPU memory hierarchies will look much more like those of CPUs.
 
Perhaps I'm misunderstanding what you mean, but would that really make any difference?
A memory channel is independent, it can read from a different address than other channels (at least I would define it that way). Using two 32 bit DRAM chips in parallel on a single channel, reads 64 bit/cycle from a single address.

Edit:
Register files as used by GPUs are basically cheaper to implement than caches with the same bandwidth and comparable latency.
 
Last edited by a moderator:
@128k / 64b mem, it only comes to 512k for SI.

Right, assuming a 256-bit bus :)

Sounds plausible. So far, GPUs tend to have a buttload of registers, lots of small L1 cache, and a tiny L2. The current trend (visible on NVIDIA's architectures anyway) is that registers don't grow as quickly as SPs or L1 cache. L2 is very new so there's not much we can say about it.

So yeah, maybe the future will see register_size/SP going down a bit with L1 and L2 caches growing significantly. GPUs integrated into APUs are quite likely to have access to a pretty large L3 cache as well (as is already the case in Sandy Bridge), so it's possible that future GPU memory hierarchies will look much more like those of CPUs.

Yeah, GPUs have solved the latency hiding problem for the most part but register files and thread counts can't grow indefinitely. There should be a lot more focus on reducing absolute memory latencies going forward.

Does anyone know how Cayman's LDS coalescing works? I believe broadcast is supported but does it also support swizzled reads in a single request?
 
I don't think there's any real alternative to more registers and threads, although returns will diminish. Unless there's a revolution in package tech.

It would help massively though if they could unify the registers, lds and caches into a single pool. Just registers and lds would be a big step forward as well.
 
It would help massively though if they could unify the registers, lds and caches into a single pool. Just registers and lds would be a big step forward as well.
Thinking of Larrabee with a swizzle/permute unit between register file and vector ALU?
 
I was expecting AMD to abandon 64-wide SIMD batches and switch to 32 or even 16 as they evolve their architecture to better run non-graphics and less regular workloads. Certainly that would cost something in terms power and performance per area.
 
I was expecting AMD to abandon 64-wide SIMD batches and switch to 32 or even 16 as they evolve their architecture to better run non-graphics and less regular workloads. Certainly that would cost something in terms power and performance per area.

I expected that as well but this is arguably a better overall decision. 32 NOPs on Cayman waste 100% of compute resources for 2 cycles. 32 NOPs on GCN waste only 25% of compute resources for 2 cycles. So you still get a significant benefit.
 
Not really. From the point of view of a divergent wavefront things might be a bit better due to improved occupancy, certainly not 4x better.
 
My understanding was discrete GPUs at the end of the year and integration to the successor of Trinity in H2/2012, probably more towards the end of 2012. Perspectively, they aim to reduce the delay between discrete and APU to about 6 months.

Are these updated architecture going to be inside SI or something after? I thought NI was a hybrid between the SI and Evergreen, or was that smoke screen for architectural change?
 
Are these updated architecture going to be inside SI or something after? I thought NI was a hybrid between the SI and Evergreen, or was that smoke screen for architectural change?
Yes, this new CU architecture will be in SI. The full feature list of FSA will be implemented gradually over the next few generations.
VLIW4 was a baby step towards this larger architectural overhaul.
 
Back
Top