Predict: The Next Generation Console Tech

Crossbar · Mar 13, 2012

McHuj said:
I do think costs can be sufficiently lower for an APU than discrete components on a mature process.

Yes, yields will always be better for two smaller chips than one big one, but at some point on a mature process they'll be good enough and eventually the costs of packaging, testing, shipping, and installing two chips will be more expensive than one bigger one.

I don't know when then break even point is, but AMD seems to think it's in their business interest to sell a GPU+APU chip instead of two lower cost chips.

For my experience in the embedded world, I've seen that I/O can be a significant contributor to power. I've seen very high speed I/O ( at least for our application ~1GB/s) consume around 15-20% of the power budgets (~10W). Granted some of that could have been our own implementation faults, but I imagine that keeping CPU/GPU IO on chip would be very beneficial especially if there's a large-on-chip eDRAM.

Your experience correlates well to articles I have read. Hi speed IO usually requires pretty hi voltage to maintain signal integrity, kind of a viciuos circle with regard to power consumption, the higher frequency, the higher voltage.
Using a silicon interposer for the two chips would actually mitigate some of this problem due to shorter traces and better signal conditions.

Regarding some break even point (die size wise) for when to stay discrete and when to go SOC, that is of course very much depending on the maturity of the process in question and the chip design. If you have a design with fine grained redundant elements that will help yields significantly. The GPU part may be better than the CPU in that regard, but this is where you can tweek your design to better suit a SOC or to stay discrete.

Packing and testing one chip design instead of two is certainly a significant cost reduction as well as you point out.

Acert93 · Mar 13, 2012

I wasn't saying they should go with a PPE / Waternoose (which from what I have read and heard has some pretty crappy performance) only that the AMD products don't seem to be super competitive in area right now. e.g. Bulldozer dual core modules (1 module / 2 integer cores + 2MB L2) is 30.9mm^2. 4 modules being 123.6mm^2 for a 315mm^2 chip. So where is the other 190mm^2 of chip going? At first glance looking at the specs the missing space is for the 8MB of L3, but looking at the block diagram the issue seems that half of that space is going to other integral parts of the chip (HT, IO DDR3 memory controller, Northbridge, chip routing, etc). Infact it seems the L3 is only about 1/3 of that 190mm^2, leaving a whopping ~ 120mm^2 for CPU related architecture not counting in the core count. Looking at it that way the core size becomes quite inflated.

So going back to the Llano pick (another one here as well) the cores are indeed small but the entire "CPU-side" of the chip, eyeballing here, looks well over 60% of the chip. So a Llano core, which without L2 is not even 10mm^2 (with 1MB L2 and power gate ring is about 18mm^2 it seems). So on the one hand the CPU part of the chip is about 150mm^2, but the cores only make up less than 80mm^2 of such.

Xenon was 165M transistors (176mm^2?) on 90nm with 1MB L2 cache. I am pretty sure you are right that the cores were probably only 35mm^2, or in the ball park of such, but how do we go about accounting for the remaining required logic of the chip? Looking over the Bulldozer stuff it looks like hundreds of millions of transistors were necessary for such. But then again not all things scale well on a chip and even the Jasper revision on 65nm was 135mm^2.

Ok, this post is all over the place. I guess all I was looking at was how different ways of slicing up a chip can make it look better, or worse, than it is. Looking at Llano if the non-core components didn't need to scale much with the addition of new cores it would seem a pretty easy task to toss in 4 more (less than 80mm^2 for 4 cores each with 1MB L2). But if it were that easy why hasn't AMD gone that route already? 8 real Llano cores at about 230-250mm^2 on 32nm seems reachable (especially looking at the 6 core Thurban at 45nm being 346mm^2). And if that seems a reasonable projection (probably not?) it actually makes Bulldozer look that much worse.

liolio · Mar 13, 2012

I would say that the memory controller / various IO / glue take quiet some place.
I would say a lot of this "glue" is used by both the CPu cores and GPU, on insulation Gubbi must be right.

If Sony is indeed going for an AMD APU, it's interesting to look at Kaveri as a basis:
Quad cores / 2 modules @ undisclosed spped + a cap verede pro (so 8 SIMD) @900 MHz, all this on TSMC 28nm process.

Acert93 · Mar 13, 2012

If Sony went with a 2 Module / 4 Integer AMD Pile Drive/SteamRoller/Excavator processor I wonder how long before we started seeing some developers comparing cases where the 7 SPE Cell from 2006 outperforming it in specific workloads... I would say the over/under would be less than 6 months from the time of the announcement.

manux · Mar 13, 2012

I would place much higher chance for arm or powerpc chip inside next gen consoles than x86, especially so if it's 2014 affair. That would leave amd enough time to integrate arm+gpu. Ofcourse there is the little question how nvidia solutions add up and what's up with ibm.

For sony having single toolchain for mobile and consoles would be good(i.e. arm in vita, ps4, phones, tv's and whatnots). They could really concentrate on getting the tools and lowlevel libraries right instead of having variants of them for exotic hardwares. It wouldn't hurt devs either if console, handheld and mobile hw would be closer to each other.

cal_guy · Mar 13, 2012

Acert93 said:
Reflecting on an AMD SoC... Llano was 228mm^2 for 4 CPU x64 cores + 240 Cypess Shaders.

It actually 400 shaders.

Acert93 · Mar 13, 2012

cal_guy said:
It actually 400 shaders.

Thanks you are correct.

function · Mar 13, 2012

Crossbar said:
Bog standard DDR3 is max 16 bit wide, but last fall Samsung and Elpida announced LP-versions of DDR3 that actually are 32 bit wide, that are currently sampling.
http://techon.nikkeibp.co.jp/english/NEWS_EN/20120222/205615/

If you arranged 8 units (4 GB) on a 256 bit bus you would get about 50 GB/s bandwidth. Not that impressive, but if you have a separate fast (EDRAM?) memory for your frame buffer, maybe it is not that bad.

I've been considering this alternative for a while since a memory size of less than 4 GB for the next gen consoles does not feel as an attractive option.

Thanks, I guess that explains why I've never seen a dimm with less than four chips then!

Gubbi · Mar 13, 2012

Acert93 said:
I wasn't saying they should go with a PPE / Waternoose (which from what I have read and heard has some pretty crappy performance) only that the AMD products don't seem to be super competitive in area right now. e.g. Bulldozer dual core modules (1 module / 2 integer cores + 2MB L2) is 30.9mm^2.

Alright, I thought you wanted to stick 8 of the crappy in-order cores of the 360 in the next XBOX. IBM have a multitude of OOO cores from their embedded PPC lineup all the way up to the P7. I'm certain that next gen will see a significant jump in single thread performance.

Wrt. BD, don't forget that the 31mm^2 is with 2MB L2, the modules themselves are only 22mm^2 or 11mm^2/core. A next gen console would need at least 4 modules and preferably 6. I still don't understand why AMD went with the big and slow L2 caches, I'd expect a smaller (512KB?) and faster L2 to perform better at desktop/console tasks.

BD still isn't a viable option IMO.

Cheers

tunafish · Mar 13, 2012

Gubbi said:
I still don't understand why AMD went with the big and slow L2 caches, I'd expect a smaller (512KB?) and faster L2 to perform better at desktop/console tasks.

Their L3 is really dreadful compared to the Intel one -- they absolutely need to reduce misses from L2 to it to stay even remotely competitive, and a large exclusive L2 does this.

Of course, since a large L2 is necessarily slow, and they have write-through L1d, this means that the chip is absolutely hosed on loads with a lot of writes.

My point being that it's not as easy as just making the L2 small and fast -- they are not completely stupid, and if it was that easy, they'd have made it that way. It's just that a small, fast L2 needs to be backed by a large, high-throughput L3 to work well. And it seems AMD just can't get that to work.

function · Mar 13, 2012

tunafish said:
Their L3 is really dreadful compared to the Intel one -- they absolutely need to reduce misses from L2 to it to stay even remotely competitive, and a large exclusive L2 does this.

Of course, since a large L2 is necessarily slow, and they have write-through L1d, this means that the chip is absolutely hosed on loads with a lot of writes.

My point being that it's not as easy as just making the L2 small and fast -- they are not completely stupid, and if it was that easy, they'd have made it that way. It's just that a small, fast L2 needs to be backed by a large, high-throughput L3 to work well. And it seems AMD just can't get that to work.

How would you expect a L3 cache-less version to perform?

I guess we're still several revisions away from knowing how an AMD CPU in the PS4 might perform.

Gubbi · Mar 13, 2012

tunafish said:
Their L3 is really dreadful compared to the Intel one -- they absolutely need to reduce misses from L2 to it to stay even remotely competitive, and a large exclusive L2 does this.

But it doesn't. It has a latency of 18 cycles, a direct consequence of the large size. That's up to 72 instructions that need to be scheduled around a L1 miss, - and L1 misses are more frequent than on Intels CPUs because the L1 is smaller. Why didn't they go with a 256KB L2 cache with similar latency (12 cycles) as Intel's offerings, and a 2MB L3 private to the module (the chip-level L3 victim cache would then be a L4).

The only reason I can find is that BD is optimized for server workloads where datasets are larger, in particular active instruction footprint can be much larger (the fact that BD can fully decode 4 instructions per cycle without pre-decode info supports this).

tunafish said:
Of course, since a large L2 is necessarily slow, and they have write-through L1d, this means that the chip is absolutely hosed on loads with a lot of writes.

I think they have an issue with their store/merge queues, but if you have a workload where two threads write every cycle, and both misses L1, you'll be hosed on any CPU.

tunafish said:
My point being that it's not as easy as just making the L2 small and fast -- they are not completely stupid, and if it was that easy, they'd have made it that way. It's just that a small, fast L2 needs to be backed by a large, high-throughput L3 to work well. And it seems AMD just can't get that to work.

You can slice and dice you cache subsystem any way you like. AMD has clearly analyzed other workloads than Intel and reached completely different conclusions. Intel's choices are clearly the right ones for desktop and, by inference, console use.

I don't think it is impossible that we will see a desktop optimized BD with a reworked cache sub-system.

Cheers

tunafish · Mar 13, 2012

function said:
How would you expect a L3 cache-less version to perform?

Very close, at least on desktop loads. We'll have some solid data once trinity is out, but frankly, I don't think BD really gains all that much from the L3.

Gubbi said:
But it doesn't.

Yes, it does. Don't think of it as a cache whose job is to speed up the core, think of it as a cache whose job is sparing the L3 from being completely swamped by all the cores. This job it does perform well. SNB with it's crazy full-speed ringbus has more than 4 times the achievable aggregate L3 bandwidth. That's what it takes to go the small cache way.

It has a latency of 18 cycles, a direct consequence of the large size. That's up to 72 instructions that need to be scheduled around a L1 miss, - and L1 misses are more frequent than on Intels CPUs because the L1 is smaller. Why didn't they go with a 256KB L2 cache with similar latency (12 cycles) as Intel's offerings, and a 2MB L3 private to the module (the chip-level L3 victim cache would then be a L4).

Then L3 and memory access would be even slower as you have to pay all the lookup latencies of the lower levels. Also, adding another level of cache would make invalidation even more messy.

Still, probably faster than what BD has now.

I think they have an issue with their store/merge queues, but if you have a workload where two threads write every cycle, and both misses L1, you'll be hosed on any CPU.

The L1 is write-through. Whether you miss or hit it has absolutely no effect on how you trash the L2. If you make a loop that writes a byte each clock to non-contiguous addresses within 2kB, it will run at L1 speed on SNB and L2 speed on BD. That's how much it sucks.

You can slice and dice you cache subsystem any way you like. AMD has clearly analyzed other workloads than Intel and reached completely different conclusions. Intel's choices are clearly the right ones for desktop and, by inference, console use.

I agree.

I don't think it is impossible that we will see a desktop optimized BD with a reworked cache sub-system.

I'm not so hopeful. The thing is, designing and validating dense, high-speed caches is really, really hard. Actually manufacturing them is even harder. The entire Intel cache subsystem is based on their OMG WTFBBQ AWESOME L3. Which is much faster than anything anyone else has ever made.

Cache speed has been an issue on AMD processors for a long time -- K8 masked it for a while by being faster on misses, but now Intel has caught up. They haven't produced a truly competitive cache subsystem in the past decade, what makes you think they could do it now?

Crossbar · Mar 13, 2012

Acert93 said:
If Sony went with a 2 Module / 4 Integer AMD Pile Drive/SteamRoller/Excavator processor I wonder how long before we started seeing some developers comparing cases where the 7 SPE Cell from 2006 outperforming it in specific workloads... I would say the over/under would be less than 6 months from the time of the announcement.

I've also been thinking about that since the x86 rumours started.

If they are using a ring bus to connect the cores how hard would it be to connect a bunch of SPEs as well? An spu on 28 nm must be just a few mm2.

Anyway, I think we can be pretty sure the cpus and gpus will be customised chips. Hell even the Celeron in Xbox had a shrunk cache.

Gubbi · Mar 13, 2012

tunafish said:
Yes, it does. Don't think of it as a cache whose job is to speed up the core, think of it as a cache whose job is sparing the L3 from being completely swamped by all the cores. This job it does perform well. SNB with it's crazy full-speed ringbus has more than 4 times the achievable aggregate L3 bandwidth. That's what it takes to go the small cache way.

Adding an extra level of cache is exactly what Intel did with Core 2 Duo. They went from a large shared 2MB L2 cache to smaller 256KB L2 caches private to each core and added an extra last level cache. You optimize your cache subsystem so that your core can have the lowest possible average memory latency given die and power constraints on the workload it is designed for.

And don't forget that SNB also has a GPU accessing the L3.

tunafish said:
Then L3 and memory access would be even slower as you have to pay all the lookup latencies of the lower levels. Also, adding another level of cache would make invalidation even more messy.

It doesn't have to be that way. You could look up tags for L2 and L3 in parallel and start the L4 access as soon as you discover that you're missing L2/3.

The biggest problem for AMD is that the last level cache is on the un-core side of things, it doesn't know what the other core's caches hold. When a read request for a shared line that already exists in a different core's cache, it will go to main memory to get the data, whereas Intel's core will get the on die copy.

tunafish said:
The L1 is write-through. Whether you miss or hit it has absolutely no effect on how you trash the L2. If you make a loop that writes a byte each clock to non-contiguous addresses within 2kB, it will run at L1 speed on SNB and L2 speed on BD. That's how much it sucks.

I must admit, I'm puzzled as to why they chose write-through data caches.

Cheers

Acert93 · Mar 13, 2012

Gubbi said:
Alright, I thought you wanted to stick 8 of the crappy in-order cores of the 360 in the next XBOX. IBM have a multitude of OOO cores from their embedded PPC lineup all the way up to the P7. I'm certain that next gen will see a significant jump in single thread performance

Yeah, it was in the middle of the wall of gibberish, "Even assuming a more fleshed out PPC core with some L3 and the requirement for better interchip communication I think IBM could deliver an 8-10 core / 16-20 thread solution." I was trying to use the current PPEs as an example of how few cores 32nm designs have... ugh, 8 Waternoose cores would be a disaster

3dilettante · Mar 13, 2012

I'm trying to track down some statement on this, but the shift to a write-through L1 streamlined reading data into the L1, without incurring latency penalties related to evicting dirty lines to the exclusive L2.
Those would have been more onerous with the L2 currently present.

steviep · Mar 13, 2012

liolio said:
I would say that the memory controller / various IO / glue take quiet some place.
I would say a lot of this "glue" is used by both the CPu cores and GPU, on insulation Gubbi must be right.

If Sony is indeed going for an AMD APU, it's interesting to look at Kaveri as a basis:
Quad cores / 2 modules @ undisclosed spped + a cap verede pro (so 8 SIMD) @900 MHz, all this on TSMC 28nm process.

Or Global Foundries =D

tunafish · Mar 13, 2012

steviep said:
Or Global Foundries =D

AMD just paid GF $700M ($275M in the remaining GF stock, $425M in cash over two years) to get rid of their contractual obligations to build future products at GF.

Given AMD's cash position, this is a pretty clear indication of the state of the future GF processes.

Brad Grenz · Mar 13, 2012

Acert93 said:
If Sony went with a 2 Module / 4 Integer AMD Pile Drive/SteamRoller/Excavator processor I wonder how long before we started seeing some developers comparing cases where the 7 SPE Cell from 2006 outperforming it in specific workloads... I would say the over/under would be less than 6 months from the time of the announcement.

Sooner than that. Would it even be competitive with the 360's 3 enhanced Altivec units for SIMD heavy tasks?

Predict: The Next Generation Console Tech

Crossbar

Acert93

Artist formerly known as Acert93

liolio

Aquoiboniste

Acert93

Artist formerly known as Acert93

manux

cal_guy

Acert93

Artist formerly known as Acert93

function

None functional

Gubbi

tunafish

function

None functional

Gubbi

tunafish

Crossbar

Gubbi

Acert93

Artist formerly known as Acert93

3dilettante

steviep

tunafish

Brad Grenz

Philosopher & Poet

Similar threads