AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Alexko · Jun 18, 2011

If my calculations are right, a 32-CU chip would have:

— 4×64kB (vec regs) + 64kB (LDS) + 16kB (R/W data L1) + 8 kB (scalar regs) = 344 kB per Compute Unit,
— 16 kB (shared L1) + 32 kB (shared iL1) = 48 kB per CU Array,
— Probably 512 kB of L2.

That's a total of 32×344 + 8×48 + 512 = 11,904 kB or 11.6 MB of internal memory, i.e. registers + cache. That's quite a lot, and if I recall correctly, Fermi has about 4 MB.

Edit:

Actually, Fermi has:

— 128 kB (vec regs) + 64 kB (L1) = 192 per SM,
— 768 kB of L2.

That's a total of 16×192 + 768 = 3,840 kB or 3.75 MB of internal memory.

rpg.314 · Jun 18, 2011

L2 is per mem channel.

Gipsel · Jun 18, 2011

Alexko said:
If my calculations are right, a 32-CU chip would have:

— 4×64kB (vec regs) + 64kB (LDS) + 16kB (R/W data L1) + 8 kB (scalar regs) = 344 kB per Compute Unit,
— 16 kB (shared L1) + 32 kB (shared iL1) = 48 kB per CU Array,
— Probably 512 kB of L2.

That's a total of 32×344 + 8×48 + 512 = 11,904 kB or 11.6 MB of internal memory, i.e. registers + cache. That's quite a lot, and if I recall correctly, Fermi has about 4 MB.

Edit:

Actually, Fermi has:

— 128 kB (vec regs) + 64 kB (L1) = 192 per SM,
— 768 kB of L2.

That's a total of 16×192 + 768 = 3,840 kB or 3.75 MB of internal memory.

Right (edit: you forgot the Tex-L1 for Fermi, if you include that it's a bit more than 4 MB), and Cayman has:
24 x (256 kB Regs + 32 kB LDS + 8 kB L1) = 24 x 296 kB
512 kB L2 ( and 64 kB GDS)

Total: 7.5 MB

Gipsel · Jun 18, 2011

rpg.314 said:
L2 is per mem channel.

Yes, 128 kB per channel, same as with AMDs higher end GPUs.

Alexko · Jun 18, 2011

Isn't it per memory controller, each controller driving two 32-bit channels? I was assuming this and a 256-bit interface.

Gipsel · Jun 18, 2011

Alexko said:
Isn't it per memory controller, each controller driving two 32-bit channels? I was assuming this and a 256-bit interface.

Okay, but do we really know if a channel is comprised of (at least) two parallel DRAM chips or a memory controller controls two (32 bit) channels?

fellix · Jun 18, 2011

Gipsel said:
Right (edit: you forgot the Tex-L1 for Fermi, if you include that it's a bit more than 4 MB), and Cayman has:
24 x (256 kB Regs + 32 kB LDS + 8 kB L1) = 24 x 296 kB
512 kB L2 ( and 64 kB GDS)

Total: 7.5 MB

And the 64KB of constant (uniform) cache per SM in Fermi, too!

Gipsel · Jun 18, 2011

fellix said:
And the 64KB of constant (uniform) cache per SM in Fermi, too!

The spec allows the use of up to 64 kB constant memory. But the cache is much smaller, only 8kB or so, if I remember correctly. And it's not like AMD GPUs do not also have a constant buffer

Maybe we should start to count the bytes in the write combining buffers.

trinibwoy · Jun 18, 2011

I guess Fermi no longer has an instruction cache either

We don't know what 28nm Fermi (Kepler) looks like yet and it could have 2MB L2 for all we know. I'm half expecting it. The first iteration of FSA could have 1-2MB of L2 as well.

rpg.314 · Jun 18, 2011

@128k / 64b mem, it only comes to 512k for SI.

Alexko · Jun 18, 2011

Gipsel said:
Okay, but do we really know if a channel is comprised of (at least) two parallel DRAM chips or a memory controller controls two (32 bit) channels?

Perhaps I'm misunderstanding what you mean, but would that really make any difference?

trinibwoy said:
I guess Fermi no longer has an instruction cache either

We don't know what 28nm Fermi (Kepler) looks like yet and it could have 2MB L2 for all we know. I'm half expecting it. The first iteration of FSA could have 1-2MB of L2 as well.

Sounds plausible. So far, GPUs tend to have a buttload of registers, lots of small L1 cache, and a tiny L2. The current trend (visible on NVIDIA's architectures anyway) is that registers don't grow as quickly as SPs or L1 cache. L2 is very new so there's not much we can say about it.

So yeah, maybe the future will see register_size/SP going down a bit with L1 and L2 caches growing significantly. GPUs integrated into APUs are quite likely to have access to a pretty large L3 cache as well (as is already the case in Sandy Bridge), so it's possible that future GPU memory hierarchies will look much more like those of CPUs.

Gipsel · Jun 18, 2011

Alexko said:
Perhaps I'm misunderstanding what you mean, but would that really make any difference?

A memory channel is independent, it can read from a different address than other channels (at least I would define it that way). Using two 32 bit DRAM chips in parallel on a single channel, reads 64 bit/cycle from a single address.

Edit:
Register files as used by GPUs are basically cheaper to implement than caches with the same bandwidth and comparable latency.

trinibwoy · Jun 18, 2011

rpg.314 said:
@128k / 64b mem, it only comes to 512k for SI.

Right, assuming a 256-bit bus

Alexko said:
Sounds plausible. So far, GPUs tend to have a buttload of registers, lots of small L1 cache, and a tiny L2. The current trend (visible on NVIDIA's architectures anyway) is that registers don't grow as quickly as SPs or L1 cache. L2 is very new so there's not much we can say about it.

So yeah, maybe the future will see register_size/SP going down a bit with L1 and L2 caches growing significantly. GPUs integrated into APUs are quite likely to have access to a pretty large L3 cache as well (as is already the case in Sandy Bridge), so it's possible that future GPU memory hierarchies will look much more like those of CPUs.

Yeah, GPUs have solved the latency hiding problem for the most part but register files and thread counts can't grow indefinitely. There should be a lot more focus on reducing absolute memory latencies going forward.

Does anyone know how Cayman's LDS coalescing works? I believe broadcast is supported but does it also support swizzled reads in a single request?

rpg.314 · Jun 18, 2011

I don't think there's any real alternative to more registers and threads, although returns will diminish. Unless there's a revolution in package tech.

It would help massively though if they could unify the registers, lds and caches into a single pool. Just registers and lds would be a big step forward as well.

Gipsel · Jun 18, 2011

rpg.314 said:
It would help massively though if they could unify the registers, lds and caches into a single pool. Just registers and lds would be a big step forward as well.

Thinking of Larrabee with a swizzle/permute unit between register file and vector ALU?

nAo · Jun 18, 2011

I was expecting AMD to abandon 64-wide SIMD batches and switch to 32 or even 16 as they evolve their architecture to better run non-graphics and less regular workloads. Certainly that would cost something in terms power and performance per area.

trinibwoy · Jun 18, 2011

nAo said:
I was expecting AMD to abandon 64-wide SIMD batches and switch to 32 or even 16 as they evolve their architecture to better run non-graphics and less regular workloads. Certainly that would cost something in terms power and performance per area.

I expected that as well but this is arguably a better overall decision. 32 NOPs on Cayman waste 100% of compute resources for 2 cycles. 32 NOPs on GCN waste only 25% of compute resources for 2 cycles. So you still get a significant benefit.

nAo · Jun 19, 2011

Not really. From the point of view of a divergent wavefront things might be a bit better due to improved occupancy, certainly not 4x better.

chiadog · Jun 19, 2011

Gipsel said:
My understanding was discrete GPUs at the end of the year and integration to the successor of Trinity in H2/2012, probably more towards the end of 2012. Perspectively, they aim to reduce the delay between discrete and APU to about 6 months.

Are these updated architecture going to be inside SI or something after? I thought NI was a hybrid between the SI and Evergreen, or was that smoke screen for architectural change?

LordEC911 · Jun 19, 2011

chiadog said:
Are these updated architecture going to be inside SI or something after? I thought NI was a hybrid between the SI and Evergreen, or was that smoke screen for architectural change?

Yes, this new CU architecture will be in SI. The full feature list of FSA will be implemented gradually over the next few generations.
VLIW4 was a baby step towards this larger architectural overhaul.

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Alexko

rpg.314

Gipsel

Gipsel

Alexko

Gipsel

fellix

Gipsel

trinibwoy

Meh

rpg.314

Alexko

Gipsel

trinibwoy

Meh

rpg.314

Gipsel

nAo

Nutella Nutellae

trinibwoy

Meh

nAo

Nutella Nutellae

chiadog

LordEC911

Similar threads