Predict: The Next Generation Console Tech

Kaotik · Aug 28, 2012

Well, they've also said it has quite a lot of eDRAM, so it might be used as L3 cache?

liolio · Aug 28, 2012

Kaotik said:
Well, they've also said it has quite a lot of eDRAM, so it might be used as L3 cache?

Well if there is a real Power7 (no in marketing term) then there might be 3MB of L3 (edram).
If the rumors we heard before have anything right we could assume that out of those 3 MB 2MB are reserved for the Power 7 core and the two others cores have x2 512KB of the pool (though acting as L2 for them).
What sounds a bit scary wrt to L2 latencies, either way the tiny core also have a L2 but those embedded CPU doesn't seem designed to support that. On the other hand I've read stuffs saying that L3 latency in power7 are really low (close the BD L2).

liolio · Aug 28, 2012

Hot chips inofrmation about both streamrollers and jaguar cores are out

I've found them only here from now (within the few websites I usually consult):
It still not available/translated on behardware.com but the slides are in English.

I've to say that Jaguar cores look indeed really good. Whether they are good for a console is another matter though.
AMD promises:
10/15% improvement to IPC.
clock speed up by 10 % within the same power budget.
Big boost in FP performance more than x2. SIMD units is now 4wide (twice as wide than Bobcat) and FMA are supported. Big boost vs Bobcat.
The cache hierarchy look really good to with the shared 2MB of L2.

It looks like a really good one from AMD. The only thing I can't see holding its use in pretty high performance console is the limitation to 4 cores.
Anyway if AMD matches the cpu cores by a sane GPU I expect to see some interesting mobile platform that could run some games surprisingly well

Streamroller looks like AMD is going in the good direction with its two instruction decoders. They speak of improvement as high as 30% when two thread work on a module.
Hardware.fr states that in their test using the second in a module up on average the perfs by only 51.% If they add 30% to that then it going to get close to intial AMD claims wrt to CMT. Then there is the hardware cost of the 2 decoders.
Really welcome has it was supposed to be the forte of BD.
There are other improvement across the board, it's really welcome.

Though I feel like AMD has still a lot of work to do on the architecture. They know what is wrong with the L3 but should fix it later (not a priority).
There are also their new design approach and their high density library. Damned 15%-30% reduction in power and area is F nice

Overall I believe that BD is to get mature with excavator (where they are going to use those libraries along with further tweaks of the architecture).

For console the increase in multithreaded performances would be more than welcome making a a 2 module solution way more tempting. But there is execution. Sadly with the new BD still not available I would be wary to bet that AMD will have all the wrapped in early 2013 (so tapping out the thing, testing, ramping production, etc), in fact I would not bet a dime on it, sadly

For the company overall it's good news if they deliver the Jaguar soon enough they may have something that make them cash as it seems that on the high performance part they are set to suffer till Excavators are ready when at the same time Intel is to launch is new architecture.

bgassassin · Aug 28, 2012

liolio said:
Damned any news about the Jaguar cores? AMD conf at hot chip was this morning @ 8:45 10:15 west coast time. Still nothing in the SMS

EDIT
oops found that just I after I posted:
http://semiaccurate.com/2012/08/28/a...e-jaguar-core/
Not the most reliable source though.

It's something to work with. Thanks.

tunafish · Aug 29, 2012

liolio said:
Hot chips inofrmation about both streamrollers and jaguar cores are out

I've to say that Jaguar cores look indeed really good. Whether they are good for a console is another matter though.
AMD promises:
10/15% improvement to IPC.
clock speed up by 10 % within the same power budget.
Big boost in FP performance more than x2. SIMD units is now 4wide (twice as wide than Bobcat) and FMA are supported.

Where was FMA mentioned? In the slides, there are still separate FPMUL and FPADD units.

It looks like a really good one from AMD. The only thing I can't see holding its use in pretty high performance console is the limitation to 4 cores.

I'd expect that you can have more than 4 cores, it's just that a single 2MB slice of L2 will service 4 of them. To have more, you'd have to have more L2.

Given that the L2 acts as a snoop filter, doing 2*4 multicores should not be too hard. Of course, this would make it a little less of a clean design to use when programming -- sharing data through L2 would likely be much faster than hitting the other core, so the very least you want to do is to treat it as a numa machine and design your programs to minimize that.

Whether anyone wants to use the die area for 8-cores instead of more GPU is a whole another business. I'd expect that a 4-core Jaguar at 2GHz would be ~3-4 times faster than Xenon. I don't know about you, but at that point I'd be very tempted to spend the silicon on more GCN CUs instead.

upnorthsox · Aug 29, 2012

liolio said:
Hot chips inofrmation about both streamrollers and jaguar cores are out
I've to say that Jaguar cores look indeed really good. Whether they are good for a console is another matter though.AMD promises:
10/15% improvement to IPC.
clock speed up by 10 % within the same power budget.
Big boost in FP performance more than x2. SIMD units is now 4wide (twice as wide than Bobcat) and FMA are supported. Big boost vs Bobcat.
The cache hierarchy look really good to with the shared 2MB of L2.

Considering this is the next gen console thread, I'd call it thoroughly disappointing.

I was holding out hope for a surprise, but sadly it's as expected.

Kinda makes you wonder just how poorly a stripped down bulldozer tested out to make this look like an attractive option.

Acert93 · Aug 29, 2012

tunafish said:
This irks me a bit every time it comes up. Modern approachs to OOOe are not features than can be plugged on to existing designs. I doubt any of Intel, AMD or IBM would design an OOOe chip that didn't use a PRF, and when you design a chip that uses a PRF, the PRF is literally the first thing that gets added to an empty design, everything else would be designed in around it. Any OOOe Atom will share very little, if any design with the existing Atom -- it would, in fact, be a completely new chip. Given how bad rep Atom has with consumers, when Intel finally replaces it, I doubt the new one would even be called Atom.

We already patched up our misunderstanding about the above earlier but I wanted to draw attention to the new ATOM line possibly having OOOe based on the new leak: The recent report of supposed ATOM Bay View info leak (with Silvermont CPU) indicates Out-of-Order is in -- but I didn't see it in the slides themselves. No mention of HTT or AVX. Up to 4 cores at a max 2.4GHz, dual channel memory, and the same GPU tech as on the IVB processors.

liolio · Aug 29, 2012

tunafish said:
Where was FMA mentioned? In the slides, there are still separate FPMUL and FPADD units.

I may indeed missread that slide. It's weird though, they say they can do 4 multiply and 4 adds, so it looks like it would have been at the same time. It's indeed misleading.
They also said that they support AVX operations, so AVX supports FMA so my brain took a short cut...

I'd expect that you can have more than 4 cores, it's just that a single 2MB slice of L2 will service 4 of them. To have more, you'd have to have more L2.

Given that the L2 acts as a snoop filter, doing 2*4 multicores should not be too hard. Of course, this would make it a little less of a clean design to use when programming -- sharing data through L2 would likely be much faster than hitting the other core, so the very least you want to do is to treat it as a numa machine and design your programs to minimize that.

That would be nice if doable, I honestly don't know enough to say one way or another.
FYI hardware.fr (so in French) state that the L2 interface can deal simultaneously with 24 read/write operations, I don't know if that would be a limitation when adding (or at which point it would start to be a bottleneck once adding core).

Whether anyone wants to use the die area for 8-cores instead of more GPU is a whole another business. I'd expect that a 4-core Jaguar at 2GHz would be ~3-4 times faster than Xenon. I don't know about you, but at that point I'd be very tempted to spend the silicon on more GCN CUs instead.

Well I was just trying to match the rumors and early document that hinted at 6/8 core running around 2GHz.
If those cores at that speed are indeed "that fast" (vs Xenon not IB or up coming Haswell) you may indeed be right.
But if MS want head room, say to run a proper OS, kinect etc. they may add a couple of cores to handle that. May be the fact that those cores would not run the main game would lessen the issue you are pointing to. Either way may be another tiny Soc could do the trick, AMD has introduce a way to connect multiple Soc so it should be doable.

As a side note from your post I can tell that you have a positive view of this architecture (let's not be shy your pov is way more relevant than mine) and definitely it's a good news AMD needs something good and sexy until BD comes together, I hope there won't be delays, problem, etc.

liolio · Aug 29, 2012

upnorthsox said:
Considering this is the next gen console thread, I'd call it thoroughly disappointing.

I was holding out hope for a surprise, but sadly it's as expected.

Kinda makes you wonder just how poorly a stripped down bulldozer tested out to make this look like an attractive option.

Well for me it turned out better than expected, I was hoping it was good not for console considerations only.

But let not be too harsh, it looks definitively like a really efficient piece of silicon. I expect it to significantly out gun Bobcat. I believe that AMD has come with a good cache hierarchy for the system and that's going to make a difference, they move from their exclusive policies to inclusive. The cache is shared. I would suspect that it will provide decent latency and bandwidth to the 4 cores it has to feed. Definitively a step forward for AMD when the choice they made with BD are questioned constantly, I expect no one to complain here. I feel like they have something really solid here.

But let's not forget their main merit, they are tiny 3.1mm^2 and power efficient.
I don't expect them to compete nicely with BD on perfs but I would bet that it crushes it in perfs per watts and mm^2.

BD/piledriver (I'm going to make friends again...) is not sexy in my opinion, it's big and not power efficient. Imo in the high performance it doesn't compare well to either power7 or Intel offering (even previous offering on old lithography). I've even read complain of AMD fans that were pieced off because they wanted to upgrade and the their phenomII x6 is still a better CPU in a lot of cases.
Imo BD is still not ready for prime, Streamrollers should bring nice improvements but it's unclear if it's possible for AMD to assure console manufacturers of its CPU availability early enough in 2013.

Long story made short, those CPU should not clock well if Bobcat are any clue (even though AMD made improvements it won't reach high clock speed) but let assume you have a two module piledriver @2GHz and a quad core Jaguar at the same speed. Definitely Piledriver should win but by which extend an at which cost (power and silicon)?

Actually in FP heavy code I wonder if Jaguar could actually do pretty well, I read today in an analysis of streamrollers (I guess at anandtech) that actually the scheduler for the FP/SIMD units is not multi-threaded and that it limit the use of the two 4 wide SIMD 9within a module). If the two cores within the module want to access one of the two 4 wide SIMD units at the same time, it's impossible and the scheduler alternate between the two threads, that let one SIMD unit idle.
In the same time a quad core Jaguar can access it's 4 wide SIMD at anytime.

The issue I have is that such a set-up even if it performs greatly keeping in mind its size and power consumption doesn't let much head room for thing as OS, accessories, etc.

I wonder if it would be better:
-add a few cores but kind of messing up with what looks like a nicely balanced set-up or simply to
-add an extra tiny SoC for that purpose.
-have 2 quad cores in the SOC (or one quad core and a dual core depending on the need) and both would pretty much ignore the other and the manufacturers/software would decide which part of the memory they access.

Mobius1aic · Aug 29, 2012

I too am interested to see how Jaguar stacks up versus Trinity, especially low power Trinity. If Jaguar is dual channel (better be) and has a decently sized GCN such as on the order of 256 ALUs/16 TMUs/8 ROPs, Trinity would be immediately threatened at the low speed envelope (and therefore the low TDP version of Trinity).

4 Jaguar cores with shared 2 MB L2 + 256 GCN ALUs I think wouldn't cross 1 billion transistors. That's a **** load of computing and graphics performance.

Shifty Geezer · Aug 29, 2012

XpiderMX said:
The "same processor tech" statement is ambiguous, but the "same #power7 chips." not.

It's PR speak. Watson uses 3.5 GHz 8 core POWER7s. That'll be the 710 going by Wikipedia, with 32 MBs eDRAM L3. Big, hot, expensive chip designed for supercomputers and servers, with lots of features useless for a console. It makes no sense to put a POWER7 in a small home console. The CPU is pretty certainly something other than the same CPU in Watson. Ergo it's 'Power 7 architecture' which could mean almost anything.

Blazkowicz · Aug 29, 2012

it's lame PR, it would be like saying "this Atom netbook runs the core i7 architecture".

Shifty Geezer · Aug 29, 2012

They're short tweets or remarks trying to explain to the layman what the chip is. You can't understand the architecture without talking CPU specifics which few will understand, so instead you relate your product to something people do understand, preferably in a way that makes it look as impressive as possibly playing to people's perception. That's the nature of spin.

"Power 7 architecture" isn't defined anywhere, so they can't be called out as liars if the chip is very different. We'll also never have the specifics of the chip's internals anywhere. For sales, the rumour mill will convince some people WiiU is packing supercomputer-power POWER7 cores and they'll be more inclined to part with their money, so allowing that myth to perpetuate can only be good for business.

There's no benefit to anyone to be explicit in describing the internals of WiiU unless they are somewhat special and you'd win kudos. eg. If WiiU had exactly a 710 with 32 MBs eDRAM, Nintendo would win more positive PR by being explicit about that (even though it'd still be a pretty dumb choice for a console). If they have some cut-down variation, then by saying "POWER7 architecture" they imply the same level of performance without being dishonest.

You're right that saying something like "this i3 based console is built on the same architecture as i7" would be somewhat misleading, but it's also true. As IBM don't have a naming convention for a lesser spec'd custom POWER7 chip, calling WiiU's processor POWER7 based isn't misleading other than knowing people's natural interpretation is going to favour full-on POWER7 rather than seeing it could be a far smaller, cheaper, cooler, less-powerful POWER7. There's no reason to expect tweets or interviews to ever be more descriptive as there's no advantage to IBM or Nintendo in being more descriptive. We've had this same discussion with other chips like RSX. Some people were adamant that there was special source in RSX and demanded specifics. Others took a comment here and there to mean Wii packed a physics processing unit. No-one came out and dismissed these misconceptions because there's no point. If Joe Gamer buys a Wii believing it has a PPU, that's money in the bank so let him. It was never advertised as having as much, so he has no cause for complaint if he finds otherwise.

McHuj · Aug 29, 2012

Seems to me that if Sony and MS are going to use Jaguar as the basis for the next console, then we're going to be getting an APU solution from the start.

I've only been able to find the slide on Semiaccurate's website, but it indicates that the Jaguar core is only 3.1mm @28nm. That's tiny.

For a custom solution, I think 8 of those guys could fit in about 25mm (not including memory controller and caches). I can see adding 2-4MB of L2 cache plus memory controller could probably double the size to ~50mm.

Now lets add a 28nm GPU to the mix. If the total die size budget is around ~200mm (and some more hand waving math based on the Pitcrain numbers 1CU ~ 10.6mm), the remainder could hold 12-16 CU's.

So hypothetically:

200mm SOC:
8 Jaguars @ 2GHz, ~ 64Gigaflops of AVX throughput
12 CU's @ (750MHz-1GHz) ~ 1.15-1.5 Teraflop GPU (inline with the rumors of 1.15-1.5 tflops).

This would be pretty powerful imo, cheap, and fairly power efficient. Although, personally, I'd prefer double the CU's.

Ruskie · Aug 29, 2012

BG's source said his source was one saying Durango is 1+ GFLOPS GPU, not 1.1-1.5.

Gipsel · Aug 29, 2012

liolio said:
I may indeed missread that slide. It's weird though, they say they can do 4 multiply and 4 adds, so it looks like it would have been at the same time. It's indeed misleading.
They also said that they support AVX operations, so AVX supports FMA so my brain took a short cut...

It is a dual issue architecture, it has two FPU/SIMD pipes so it can do separate MUL and ADD instructions at the same time. There is nothing misleading.
And AVX also does not include FMA, it's a separate extension (or actually two separate ones if you count FMA3 and FMA4).

But what is more interesting is that the DP multiplier is obviously only 1:4 rate. While Jaguar can do 4 SP muls (+ 4 SP adds), it can do only a single DP mul (+ 2 DP adds) in a cycle.

Kaotik · Aug 29, 2012

Even though it's "only", how many would use Jaguar in the first place for DP intensive jobs?

SKYSONY · Aug 29, 2012

Ruskie said:
BG's source said his source was one saying Durango is 1+ GFLOPS GPU, not 1.1-1.5.

Acert93 seems to suggest it will be widely more powerful. We don´t know how old BG´s info is.

tunafish · Aug 29, 2012

liolio said:
. It's weird though, they say they can do 4 multiply and 4 adds, so it looks like it would have been at the same time.

They can do multiples and adds at the same time -- as separate instructions on separate registers, issued to separate execution units. This is different from doing FMA, where you do an add immediately on the result of the multiply and a third register. Which is better? It's complicated. Separate units often have lower latencies for FADD, which helps in some situations, but longer latencies when you are doing FMA, which is what a lot of the vector math you do will be all about.

Throughput should be as good as having a single FMA unit, with the caveat that you need two instructions, and that the chip frontend can only decode two instructions per clock. It's not as bad as it sounds, because in x86 an alu op can also contain a memory op (and you effectively need at least one of those for every alu op). However, since there is always some overhead, I'd expect that the chip can't quite reach it's theoretical max due to issue throughput, unless it can decode two 256-bit AVX ops per clock. That would be awesome. (I'm not too optimistic about that, though -- BD can't do that.)

That would be nice if doable, I honestly don't know enough to say one way or another.
FYI hardware.fr (so in French) state that the L2 interface can deal simultaneously with 24 read/write operations, I don't know if that would be a limitation when adding (or at which point it would start to be a bottleneck once adding core).

The hard part of putting more cores in a system is cache coherency, or snooping. Every time you write to a cache line for the first time, or read in a new cache line, you effectively need to ask every cache in the system if they have that line. In older cache systems, that's literally what happened. As you add more cores, that doesn't scale.

So, instead you make your LLC (Last Level Cache, the cache that will pass requests to memory if they are not found, L3 for Intel, L2 for Jaguar) inclusive, so that every cache line must be found in it, and then deal with coherency there. So now you only have to ask one place. Having two L2 caches means you have to ask your own, and the foreign cache, and it's worse than the cleaner system, but it's not yet horrible (and it's probably much easier to do than just making a LLC that can support the volume of requests needed by 8 cores).

That, by the way, is exactly how ppc470 cores are laid out. They are coupled with L2 in "clusters" of 4, and then these clusters can be combined together on the bus.

The slides released are completely silent on the Jaguar system architecture, so all this is just speculation. However, I really do think that the very last point on the L2 slide points to this -- the 16 additional snoop queue spots would be very useful if you wanted to put more Jaguar "clusters" in a system.

As a side note from your post I can tell that you have a positive view of this architecture

BD disappointed me, Bobcat was a very positive surprise. Frankly, given the design restrictions, I did not expect to be anywhere near as good as it is. It has a few shortcomings (integer divide wtf), but all in all it's a very efficient, simple architecture. It's basically what Atom should have been.

A lot of people seem disappointed at the apparent lack of oomph in Jaguar. Don't be. It won't reach as high numbers as the best of them, but it is freakishly efficient compared to what console devs are used to. For walking down decision trees and the like it will easily be at least 5x better per clock, with intelligent optimizing even more than that. (I mean, this is a cpu with a 1-cycle latency, 1/2 cycle reciprocal throughput conditional move...). And in the areas where it will be at it's weakest, a lot of the work can be offloaded to the GCN CUs.

McHuj said:
So hypothetically:

200mm SOC:
8 Jaguars @ 2GHz, ~ 64Gigaflops of AVX throughput

128GFlops. 8 cores * 2 instruction per clock (FMUL + FADD) * 4 elements per vector * 2GHz. And that includes doing the operand loads.

kagemaru · Aug 29, 2012

SKYSONY said:
Acert93 seems to suggest it will be widely more powerful. We don´t know how old BG´s info is.

I think most people are expecting the GPU to be above 1TFLOP.

Besides, even if BGs info is up to date, a 1.8TFLOP GPU would still fit since anything between 1-2TFLOPs would still fall in line with the 1+TFLOP quote..

Predict: The Next Generation Console Tech

Kaotik

Drunk Member

liolio

Aquoiboniste

liolio

Aquoiboniste

bgassassin

tunafish

upnorthsox

Acert93

Artist formerly known as Acert93

liolio

Aquoiboniste

liolio

Aquoiboniste

Mobius1aic

Quo vadis?

Shifty Geezer

uber-Troll!

Blazkowicz

Shifty Geezer

uber-Troll!

McHuj

Ruskie

Gipsel

Kaotik

Drunk Member

SKYSONY

tunafish

kagemaru

Similar threads