New AMD low power X86 core, enter the Jaguar

fellix · Sep 1, 2012

sebbbi said:
Another thing to discuss is how they have implemented the 256 bit vector (AVX) support for the chip, and how Bobcat handled 128 bit SSE instructions in it's half width 64 bit vector pipelines. Did Bobcat split 128 bit vector ops in the decoders to two 64 bit uops? Does Jaguar the same for 256 bit ops (split to 128 bit uops)?

Not much different of how P3 and K8 internally handled the 128-bit SSE op's. AMD even had special recommendations in its programming manual for the K8 line, regarding this particular issue that was left for some reason in the manual for the K10 family and certainly caused some confusion.

Raqia · Sep 1, 2012

This slide seems to indicate that it is split into 128 bit instuctions:

http://pc.watch.impress.co.jp/img/pcw/docs/556/374/36.jpg

and that the execution units gets "double pumped" which sounds like it's being run twice per clock.

fellix · Sep 2, 2012

I wonder if this means the vector ALU lanes in Jaguar are hand-tuned design. The entire INT ALU pipeline in P4 was one of the most carefully designed layouts due to its double pumped operation. Well, Jaguar certainly won't have to work at such high clock-rates.

itsmydamnation · Sep 2, 2012

DavidC said:
It is true: http://www.anandtech.com/bench/Product/328?vs=116

K10 based on Deneb I think is about 15-20% faster, so comparing to Bobcat the difference is probably 25-35%. That's not small.

It isn't that massive when you consider what the differences between bobcat to jaguar are and the differences between K8 to K10.

Bobcat does quite well in Integer performance despite both K8 and K10 having 3 Int ALU's that can do any instruction they like while bobcat has 2 INT ALU's and only one has Mul. Obviously feeding the ALU's is a much bigger bottleneck. Jaguar gets an int DIV unit as well.

On the FPU side where bobcat can really suffer basically it's 3 128bit FPU's (Barcelona/Deneb) vs 2 64bit (bobcat). So with jaguar it becomes 2 128bit FPU's vs 3 128bit FPUs ( jaguar to Barcelona) or 4 128bit FPU's vs 2 128bit FMA's (2x jaguar core vs 1x BD module). I'm gonna bet that with better predictors/decoupled fetch/LSU/schedulers in Jaguar compared to K10 that jaguar on average will sustain higher FPU performance at the same clock.

I guess in a years time we will see if im right

.

I wonder when steamroller APU's come out? at that time there could be some very interesting comparisons, bobcat vs jaguar vs trinity vs steamroller vs k10 clock for clock :runaway:

3dilettante · Sep 2, 2012

I'm curious how more familiar with what goes into making processors think about how web sites are latching onto AMD's use of automated layout tools for its CPU cores. What are the new features of this, other than the fact that AMD of all companies is using them so extensively?

That "amoeba-like" arrangement of logic is something some may still recall from Intel's Prescott.
The tools have doubtless advanced quite a bit since then.

mczak · Sep 2, 2012

itsmydamnation said:
Bobcat does quite well in Integer performance despite both K8 and K10 having 3 Int ALU's that can do any instruction they like while bobcat has 2 INT ALU's and only one has Mul.

K8 and K10 couldn't quite do all instructions in all 3 pipes (but most of them - something bobcat mostly retained, the int pipes are still quite symmetric, just one less pipe). Most notably they certainly could only handle mul in one pipe too (in fact there's not a single x86 cpu out there which has more than one multiplier in the int domain).

As for simd I highly doubt Jaguar will achieve the performance of K10 clock per clock (even if the distribution of ops to pipes was very dumb). That is, if you run the same binary code at least - those new instructions it supports could definitely help quite a bit in some cases.

AlexV · Sep 2, 2012

3dilettante said:
I'm curious how more familiar with what goes into making processors think about how web sites are latching onto AMD's use of automated layout tools for its CPU cores. What are the new features of this, other than the fact that AMD of all companies is using them so extensively?

That "amoeba-like" arrangement of logic is something some may still recall from Intel's Prescott.
The tools have doubtless advanced quite a bit since then.

This is anecdotal, but just the other day somebody was telling me about a competition between hand laid and automated tools ran at a rather big shop in the business, with automated showing favourable results by a significant margin. Now, these guys weren't Intel, albeit quite huge in their own right, and also the measurements were for some blocks, it's an open question how things end up when you have to do global as opposed to local optimization.

The Intel mention is relevant because in my opinion when people are whining about the use of auto tools they miss the fact that auto tools are likely to be worse than hand layout done by very capable, large, well funded teams...which is not necessarily the case for anybody but Intel these days. So yeah, Intel's teams will probably do better overall with hand-layout...but that does not automatically mean that handwork is good for everyone, IMHO. AMD using automated seems very reasonable given their context / state.

Raqia · Sep 2, 2012

AlexV said:
This is anecdotal, but just the other day somebody was telling me about a competition between hand laid and automated tools ran at a rather big shop in the business, with automated showing favourable results by a significant margin. Now, these guys weren't Intel, albeit quite huge in their own right, and also the measurements were for some blocks, it's an open question how things end up when you have to do global as opposed to local optimization.

The Intel mention is relevant because in my opinion when people are whining about the use of auto tools they miss the fact that auto tools are likely to be worse than hand layout done by very capable, large, well funded teams...which is not necessarily the case for anybody but Intel these days. So yeah, Intel's teams will probably do better overall with hand-layout...but that does not automatically mean that handwork is good for everyone, IMHO. AMD using automated seems very reasonable given their context / state.

Hand laid designs do incorporate a certain level of regularity that in some instances only facilitate human level organization and understanding and aren't necessary to achieving the smallest size or best performance, and I could easily see it being detrimental in some cases. I'm sure that mindlessly following some rules and brute-force permuting within the space of all permissible layouts will certainly net you something alien looking but better than what a human team could design. (Ofcourse, that's never going to finish before the heat death of the Universe on any realistic transistor count...)

A much simplified but similar kind of problem is http://en.wikipedia.org/wiki/Circle_packing_in_a_square. The best of the known packings aren't necessarily what people would come up with in a reasonable amount of time, even after a lot of head scratching:

http://hydra.nat.uni-magdeburg.de/packing/csq/d5.html
http://hydra.nat.uni-magdeburg.de/packing/csq/d64.html

Circuit designers also operate on the level of logical blocks rather than spatial packings, so you could imagine that there's a lot of efficiency to be gained from automation when there are a lot of transistors are involved. Automation has some drawbacks now, but it's entirely possible that one day, extra computing resources and refined heuristics will make automated layout better in every way than hand drawn designs.

sebbbi · Sep 3, 2012

itsmydamnation said:
Bobcat does quite well in Integer performance despite both K8 and K10 having 3 Int ALU's that can do any instruction they like while bobcat has 2 INT ALU's

In their Bulldozer slides AMD revealed that 3 integer ALUs in their previous architectures was not a good way to spend transistors. They couldn't extract enough ILP from majority of the code to keep all the 3 integer pipelines filled. The performance gain of the third integer ALU was marginal, so it was removed from Bulldozer. Bobcat is just utilizing the same principles as Bulldozer here (remove underutilized hardware).

On the other hand, Intel has hyperthreading, so they can better fill the execution pipelines of their CPU even in code that doesn't have sufficient amount of ILP.

itsmydamnation · Sep 3, 2012

Are you sure that was ALU? Im pretty sure that was in relation to the 3rd AGU, but they kept it there just to keep everything symmetrical.

Gubbi · Sep 3, 2012

itsmydamnation said:
Are you sure that was ALU? Im pretty sure that was in relation to the 3rd AGU, but they kept it there just to keep everything symmetrical.

I think so too. The three symmetric units could execute a macro-op each, but if you actually threw three instructions at it with memory operands it would stall on D$ accesses.

Cheers

mczak · Sep 3, 2012

I think there were 2 issues at hand:
a) in K8/K10 design ALU/AGUs are paired hence you need 3 AGUs if you have 3 ALUs even if you can only ever perform 2 loads per clock, so the third AGU is a bit pointless (not quite 100% though since it can perform a LEA which doesn't require a memory load).
b) it is quite difficult to actually find 3 independent instructions you could execute simultaneously. To increase probability of this you need larger ROBs etc., so the overall power efficiency will decrease. And for the cases where you actually could extract 3 independent instructions you'd need a fatter decoder for BD to be really useful I guess.

3dilettante · Sep 5, 2012

The third AGU was present because it spent a tiny amount of area and transistors to simplify the job of the scheduler. While only a few scenarios could make use of a third AGU, keeping the pipelines mostly symmetrical meant picking the right lane for a macro-op was simpler.

An ALU or AGU in isolation would not be a significant bloat to Bulldozer in terms of area or transistor count.
The high-clock philosophy and the need to get all the register accesses and forwarding for an additional ALU and AGU sounds like a big motivator for lopping off the extra pair.

iwod · Sep 6, 2012

1C of Jaguar is only 3.1 square mm? A quad core with SRAM and a Very Good Radeon Graphics would be what? Less then 40 square mm?

To me the only good thing Atom was that it speed up the NAS Market transition. It perform poorly on Netbook and Light Usage Desktop. ( Although it sold quite well )

BobCat was great, but it was late. And if Jaguar will allow that kind of improvement, while bringing in Quad Core and Better Graphics, Why would one spend money on its Desktop APU Trinity? To me both are aiming at market that are suited for Light weight work load, Video Viewing and Internet Browsing.

And which one will be more profitable?

hkultala · Sep 6, 2012

iwod said:
And if Jaguar will allow that kind of improvement, while bringing in Quad Core and Better Graphics, Why would one spend money on its Desktop APU Trinity? To me both are aiming at market that are suited for Light weight work load, Video Viewing and Internet Browsing.

And which one will be more profitable?

Trinity has MUCH better single-thread performance than Jaguar. Everything that is not heavily threaded will work much faster on Trinity.

mczak · Sep 6, 2012

iwod said:
1C of Jaguar is only 3.1 square mm? A quad core with SRAM and a Very Good Radeon Graphics would be what? Less then 40 square mm?

Not sure what you mean with "Very Good Radeon Graphics" but your number is way too small.
4 cores may be only 12 mm^2, double it to include L2. You could then probably fit that into 40mm² with the required i/o (64bit ddr3 and some more) but then you'd have no graphics at all.
A scaled-down Cape Verde (let's say 4 CUs) most likely adds another 50mm² on its own.

Blazkowicz · Sep 7, 2012

hkultala said:
Trinity has MUCH better single-thread performance than Jaguar. Everything that is not heavily threaded will work much faster on Trinity.

yes, if you want a cheap desktop there's the single module trinity (or the celeron, which is quite a problem for AMD given how actually fast it is, it also has the industry's only credible open source linux drivers)

the A6 5400K variant is even unlocked so you can flip a BIOS option and clock it to 4.5GHz or something.
you may sadly benefit from this for your "Internet Browsing", because web pages are pigs (maybe you have to clock to 5GHz for a "turning pages" html5 reader to be smooth)

don't forget the 4GB memory if you do the folly of running firefox and chrome at the same time (or even chrome alone) with 2GB I got so much swap that the USB mouse cursor would freeze for five seconds

swaaye · Sep 8, 2012

Really, I miss ugly '90s web sites that ran fast on a Pentium 90. It has been nice having the modern browser war though, with all of the clamoring for improved performance being beneficial (and free!) for everyone.

Raqia · Sep 9, 2012

swaaye said:
Really, I miss ugly '90s web sites that ran fast on a Pentium 90. It has been nice having the modern browser war though, with all of the clamoring for improved performance being beneficial (and free!) for everyone.

Ah, the days when most sites had "Under-construction" and some gif of a guy working w/ a jack hammer. Plus a "you are visitor #: XXX" counter.

bearmoo · Sep 10, 2012

mczak said:
A scaled-down Cape Verde (let's say 4 CUs) most likely adds another 50mm² on its own.

4 CUs would be really nice. I remember single channel memory configuration like Brazos being mentioned for the Jaguar derived APUs. I wonder if it's enough to feed all the cores. Also, don't forget to add the die size for the integrated south bridge.

New AMD low power X86 core, enter the Jaguar

fellix

Raqia

fellix

itsmydamnation

3dilettante

mczak

AlexV

Heteroscedasticitate

Raqia

sebbbi

itsmydamnation

Gubbi

mczak

3dilettante

iwod

hkultala

mczak

Blazkowicz

swaaye

Entirely Suboptimal

Raqia

bearmoo

Similar threads