New AMD low power X86 core, enter the Jaguar

sebbbi · Sep 21, 2012

Found the original Bobcat technical whitepaper (from IEEE Micro). Some good stuff inside it, including the confirmation that Bobcat was aimed to be 90% performance of K8 (not 90% IPC of K10). I am making a comparison table based on the information

hkultala said:
... Totally different design.

AMD's own Bobcat technical whitepaper states in many occasions that they reused, improved, finetuned old concepts and hardware blocks (from K8 and K10/Barcelona). For example they stated that Bobcat floating point "coprocessor" is very similar to K8 floating point "coprocessor" except that they dropped one of the 3 pipelines.

Exophase said:
You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU. And Bobcat can only do 1 load + 1 store per cycle, while PD can do 2 loads (I don't think it can actually do 2 stores though). These are not small differences - mainly, being able to support a load/store or two in conjunction with two ALU/branch/multiply/etc is a big deal, especially for x86. Even in FPU heavy code it's nice to be able to issue at least one integer instruction in addition to two FP ops for flow control/pointer arithmetic/etc.

Incorrect.

Bobcat/Jaguar core can issue up to 6 instructions per clock (it has a dual port integer scheduler, a dual port AGU scheduler and a float "coprocessor" scheduler that can issue 2 instructions per clock). It however cannot decode/retire more than 2 instructions per clock.

In comparison a BD/PD module (2 cores) can issue 2x4+4 instructions = 12 instructions per clock, and decode/retire 4 instructions. So a BD/PD module (two cores) provides exactly the same peak and sustained rates as two Bobcat/Jaguar cores.

Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc). The peak (all cores running at full steam) instruction throughput of Jaguar and BD/PD cores are however exactly the same.

Exophase said:
As far as L1D Is concerned, Jaguar does have the bigger cache but loses in associativity (2-way vs 4-way) which is a liability on some workloads.

Incorrect. Bobcat/Jaguar L1D is 8-way. It's twice the size and twice the associativity compared to BD/PD L1D. And Bobcat/Jaguar L1D latency is 3 cycles, while BD/PD L1D latency is 4 cycles.

--> Bobcat/Jaguar L1D is better than BD/PD L1D in every way.

Exophase said:
And from test numbers I've seen its L2 is not just lower bandwidth but at least as high latency.

That's not right. Bobcat/Jaguar have 17 cycle L2 latency and Bulldozer L2 latency is 20-22 cycles.

hkultala · Sep 21, 2012

sebbbi said:
Found the original Bobcat technical whitepaper (from IEEE Micro). Some good stuff inside it, including the confirmation that Bobcat was aimed to be 90% performance of K8 (not 90% IPC of K10). I am making a comparison table based on the information

AMD's own Bobcat technical whitepaper states in many occasions that they reused, improved, finetuned old concepts and hardware blocks (from K8 and K10/Barcelona). For example they stated that Bobcat floating point "coprocessor" is very similar to K8 floating point "coprocessor" except that they dropped one of the 3 pipelines.

Do you mean this whitepaper?

http://home.dei.polimi.it/sami/architetture_avanzate/AMDbobcat.pdf

Read the whole chapter, not just the first sentence. They changed practically everything. List of zillion things that are different.

For example: "The floating-point multiplier (FPM) was redesigned entirely to use a smaller multiplier tree (76 bits x 27 bits) to save area and power.2 "

Incorrect.

Bobcat/Jaguar core can issue up to 6 instructions per clock (it has a dual port integer scheduler, a dual port AGU scheduler and a float "coprocessor" scheduler that can issue 2 instructions per clock). It however cannot decode/retire more than 2 instructions per clock.

In comparison a BD/PD module (2 cores) can issue 2x4+4 instructions = 12 instructions per clock, and decode/retire 4 instructions. So a BD/PD module (two cores) provides exactly the same peak and sustained rates as two Bobcat/Jaguar cores.

Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc). The peak (all cores running at full steam) instruction throughput of Jaguar and BD/PD cores are however exactly the same.

These are micro-ops, not instructions. And ofter FP code on bobcat needs 4 times more micro-ops than bulldozer.

But calculating/comparing "peak instruction throughput" is very meaningless anyway. It's a situation that's happening very rarely and has very little correlation with real world performance.

And even when comparing these stupid peak execution rates:
If there is one integer-only and one fp-heavy thread active at same time, 2 bobcat cores can issue 10 micro-ops/cycle, one bd-module can issue 12 micro-ops/cycle.

Incorrect. Bobcat/Jaguar L1D is 8-way. It's twice the size and twice the associativity compared to BD/PD L1D. And Bobcat/Jaguar L1D latency is 3 cycles, while BD/PD L1D latency is 4 cycles.

--> Bobcat/Jaguar L1D is better than BD/PD L1D in every way.

Bobcat L1D runs at much lower clock speed than Bulldozer's L1D (and would run at lower clock speed even if made at same mfg tech).

Bobcat L1D also has HALF of the bandwidth of Bulldozer's L1D (half the data path width).

hkultala · Sep 21, 2012

mczak said:
First time I heard of that. Source?

http://home.dei.polimi.it/sami/architetture_avanzate/AMDbobcat.pdf

mczak · Sep 22, 2012

hkultala said:
These are micro-ops, not instructions. And ofter FP code on bobcat needs 4 times more micro-ops than bulldozer.

??? All the normal simd xmm instructions take 2 uops instead of 1, which is a direct result of the units being only 64bit wide. There are a couple exceptions (some shuffles mostly) which are more complex but again this is probably due to the odd "lane exchanging" going on if you've got the 128bit regs implemented as 2x64bit. This will go away with Jaguar. Ok due to the multiplier being a simpler design mulpd will still have more uops but that's about it.
Otherwise I don't see much difference in uops, even the complex stuff which runs like forever has fairly similar uop count (and actually execution time of these often seems much smaller on Bobcat - probably a result of the whole simd unit being simpler).

Bobcat L1D also has HALF of the bandwidth of Bulldozer's L1D (half the data path width).

That is true but it would make no sense if it would have more bandwidth (since all units are 64bits wide only anyway). Jaguar is going to double the bandwidth alongside the simd unit width accordingly.

http://home.dei.polimi.it/sami/archi.../AMDbobcat.pdf

Yes for the _multiplier_ (as I said). Does not apply generally.

sebbbi said:
Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc).

Instruction retirement isn't shared (happens in int cores) hence even if one core is doing nothing it is still limited to two/clock in the other core.

Exophase · Sep 22, 2012

sebbbi said:
Incorrect.

Bobcat/Jaguar core can issue up to 6 instructions per clock (it has a dual port integer scheduler, a dual port AGU scheduler and a float "coprocessor" scheduler that can issue 2 instructions per clock). It however cannot decode/retire more than 2 instructions per clock.

The whitepaper is kind of vague about this. It says that 4 ops are input to the scheduler each cycle (it calls it dispatch instead of issue, to really confuse things). This is weird because if the COPs are anything like the macro-ops used on all the other AMD processors or the fused uops used on Intel processors splitting them before the scheduler instead of after is a terrible design move - by virtue of sharing an implicit source and destination operand it's less work to track them together than independently. Not to mention takes up less space. And it'd be a real problem to try to refuse them so they can retire together. So I can't imagine why AMD would do this, and if they aren't doing that it means there's no more than two COPs that they can issue per cycle. Oh, they might be able to issue two to the integer part and two to the FP part but it doesn't matter if the decoder can't actually produce more than two - without some post-decode space (loop buffer, uop cache, etc) there's nowhere that can deliver those. So no matter how you look at it issue is effectively 2/cycle.

sebbbi said:
In comparison a BD/PD module (2 cores) can issue 2x4+4 instructions = 12 instructions per clock, and decode/retire 4 instructions. So a BD/PD module (two cores) provides exactly the same peak and sustained rates as two Bobcat/Jaguar cores.

Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc). The peak (all cores running at full steam) instruction throughput of Jaguar and BD/PD cores are however exactly the same.

Yes, peak is the same. You can say this all day long. I'm sure an octo-core Cortex-A9 would have an even higher peak and use even less power, maybe they should use that instead?

Fact is, 4-core peak is nowhere close to the sweet spot. Not for the markets that either Jaguar or Trinity will be playing in. If you were comparing a dual-core Jaguar with a single-module Trinity you'd have a stronger case, although it'd still be nowhere close to airtight.

AMD didn't add sharing of resources to give some slight advantage. Rather, they took away independent resources where they could afford to, for the type of stuff they wanted to run. The shared decoder was an overstep (as can be seen in Steamroller splitting it) but the the case where a module isn't close to fully loaded is at least as important, probably more important than the case where it is.

sebbbi said:
Incorrect. Bobcat/Jaguar L1D is 8-way. It's twice the size and twice the associativity compared to BD/PD L1D. And Bobcat/Jaguar L1D latency is 3 cycles, while BD/PD L1D latency is 4 cycles.

Yeah, I was confusing it with the icache. It's uncommon these days to see an icache and dcache with same size but different associativity (not counting in BD's case where the two are pretty asymmetric for other reasons). Especially VERY different associativity. Usually they'll use the same set sizes for both..

sebbbi said:
--> Bobcat/Jaguar L1D is better than BD/PD L1D in every way.

That's not right. Bobcat/Jaguar have 17 cycle L2 latency and Bulldozer L2 latency is 20-22 cycles.

I think you quoted the wrong thing.

Yeah, AMD says that. Actually, they say 17 cycles in the BEST case. They don't say what that means. Here's a real world measurement:

http://www.realworldtech.com/forum/?threadid=114178&curpostid=114188

(this would be 21 cycles if it's inclusive of an L1 miss)

That measures worse than Bulldozer - which, btw, AMD claims 18-20 cycles, not 20-22 (but the measurements are, again, more like your number)

hkultala · Sep 22, 2012

mczak said:
??? All the normal simd xmm instructions take 2 uops instead of 1, which is a direct result of the units being only 64bit wide. There are a couple exceptions (some shuffles mostly) which are more complex but again this is probably due to the odd "lane exchanging" going on if you've got the 128bit regs implemented as 2x64bit. This will go away with Jaguar. Ok due to the multiplier being a simpler design mulpd will still have more uops but that's about it.
Otherwise I don't see much difference in uops, even the complex stuff which runs like forever has fairly similar uop count (and actually execution time of these often seems much smaller on Bobcat - probably a result of the whole simd unit being simpler).

Most fp kernels do mul operation followed by add operation. For 128-bit data width (4 32-bit or 2 64-bit numbers), this is..

1 instruction, 1 micro-op in bulldozer (fma)

2 instructions, 4 micro-ops in bobcat

2 instructions, 2 micro-ops in jaguar.

hkultala · Sep 22, 2012

Exophase said:
That measures worse than Bulldozer - which, btw, AMD claims 18-20 cycles, not 20-22 (but the measurements are, again, more like your number)

AFAIK the 18 cycles was for 1 MiB L2 cache size. And there are no bulldozer-based chips with 1MiB L2 cache.

mczak · Sep 22, 2012

hkultala said:
Most fp kernels do mul operation followed by add operation. For 128-bit data width (4 32-bit or 2 64-bit numbers), this is..
1 instruction, 1 micro-op in bulldozer (fma)
2 instructions, 4 micro-ops in bobcat
2 instructions, 2 micro-ops in jaguar.

Ah ok. Well sure yes without fma it is twice as many instructions if you only do multiplay-add (and then twice that uops for bobcat - I was very wrong btw about assuming DP muls need more uops). That is obvious, but with just saying "often 4 times as many uops" without mentioning fma that was really difficult to make sense of...

sebbbi · Sep 22, 2012

hkultala said:
Most fp kernels do mul operation followed by add operation. For 128-bit data width (4 32-bit or 2 64-bit numbers), this is..

1 instruction, 1 micro-op in bulldozer (fma)
2 instructions, 4 micro-ops in bobcat
2 instructions, 2 micro-ops in jaguar.

I am glad you opened up the FMA topic, because it's both a boon and a bane to the BD architecture. It was a risky decision by AMD.

No current processors can fuse mul followed by add (x86 instructions), and create FMAs. If your code isn't brand new or doesn't support the right kind of FMA set (FMA4 and FMA3 sets are incompatible) you will not have any FMA instructions in the program.

Bulldozer module (2 int cores + FP coprocessor) can only execute two 128 bit floating point vector operations per cycle. These can be FMAs, muls, adds or simple float ops. But if your x86 code doesn't have FMAs, BD can only execute two 128 bit floating point adds OR muls per cycle (and simple float operations also use the same FMA pipelines, reducing the cycles available for FMAs/muls/adds). This results in a 4*2 = 8 flops sustained throughput (for a two core module).

In comparison, each Jaguar core can execute two 128 bit floating point vector operations per cycle. Both vector pipelines support most of the simple vector operations. Addition and multiplication are split along the two pipes, so it can co-issue an add and mul (even separate ones) per cycle. This results in 4*2*2 = 16 flops sustained throughput (for two cores).

So Jaguar should be faster in float/vector math heavy legacy programs/games that do not support and extensively utilize FMA3/FMA4 (AVX support isn't enough, since it's a separate instruction set).

Bulldozer FMA pipelines (compared to twice as many non-FMA float pipelines) are a boon only if the software is heavily using FMA, and only if the software is decode/retire/fetch bound (*). This fact is clearly visible in Bulldozer floating point benchmarks. It fares well in heavily FMA/XOP optimized programs, but suffers badly in others.

(*) I have to admit, this is a common case for BD

Exophase said:
Yeah, AMD says that. Actually, they say 17 cycles in the BEST case. They don't say what that means. Here's a real world measurement:

http://www.realworldtech.com/forum/?threadid=114178&curpostid=114188
(this would be 21 cycles if it's inclusive of an L1 miss)

That measures worse than Bulldozer - which, btw, AMD claims 18-20 cycles, not 20-22 (but the measurements are, again, more like your number)

According to http://semiaccurate.com/2011/10/17/why-did-bulldozer-underwhelm/, the Bulldozer measured L2 average latency is also significantly worse than it's reported minimum latency. 20 cycles minimum vs 25-27 cycles measured average. This seems to be a common feature for all AMD L2 caches. They have significantly higher average latencies than minimum latencies. This is also one of the big things AMD is going to focus on in Steamroller. AMD Steamroller slides: "minimum latency is only part of the story", "design to decrease average load latency". This pretty much comfirms that BD average latency is significantly worse than the minimum latency (and I am not at all surprised if the same is true for Bobcat/Jaguar). Jaguar L2 speed cannot be yet measured, so we have to continue this discussion at a later time.

Bulldozer however is much more reliant on it's L2 cache, because it has a tiny 16 kB 4-way L1D cache. Bobcat/Jaguar cores have significantly better 32 kB 8-way caches (that are also one cycle faster), so they have considerably less L1D misses (and thus L2 requests). That's one of the key advantages Jaguar has over BD/PD. Especially considering AMDs slow L2 caches (compared to Intel).

sebbbi · Sep 22, 2012

BD vector pipelines (one module = two cores):
2 x FMA/FADD/FMUL/OtherFloat
2 x MMX

Bobcat/Jaguar vector pipelines (two cores):
2 x FMUL/OtherFloat/MMX
2 x FADD/OtherFloat/MMX

MMX = integer / ALU / comparisons / logic ops / permute / insert / etc (not floating point processing)

If you do pure floating point vector crunching, two Bobcat/Jaguar cores can issue four instructions per clock, while BD module can only issue two (as the MMX pipelines are unused). Both architectures can decode/retire four. Of course in most algorithms you do some MMX ops: logic ops (*-1 = xor highest bit, abs = mask highest bit, etc), permute/insert (to combine/separate lanes and to SOA pack/unpack, etc). However it requires you to have exact 1:1 mix of MMX/FP operations for Bulldozer to catch up (reach 4 uops per cycle), or alternatively some other integer/ALU code to use the remaining of the four fetch/retire slots (loop body, counters, address calculation, etc).

If the algorithm can be represented by series of multiplies followed by dependent additions, the FMA will help BD a lot. But no algorithm is pure FMA, and all float instructions (even the simplest ones) are using the same FMA pipelines. These pipelines can be a bottleneck in many algorithms. Sometimes it's better to have four small pipelines (2 x Jaguar cores) than two heavy hitters (one BD module).

Update: [forgot sources]
- Brad Burgess, et. al.: Bobcat: AMD’s Low-Power x86 Processor. IEEE Micro, March/April 2011
- AMD Jaguar official slides (for example here: http://www.amdzone.com/phpbb3/viewtopic.php?f=532&t=139363)
- http://semiaccurate.com/2011/10/17/why-did-bulldozer-underwhelm/

3dilettante · Sep 23, 2012

I'm seeing a lot of per-clock measures in these comparisons, which is usually forgivable if comparing architectures that operate in the same general clock envelope. That isn't the case with BD and the low-power x86 cores.
Every place where BD is supposedly at parity per-clock should be followed up with the modifier "and it's done 2x faster". This comes at significant area and power cost, but it is a necessity to be even considered for certain workloads that would give Bobcat or Jaguar little notice.

BD's L2 latency in cycles is a liability relative to the a core like Sandy Bridge or Haswell. In wall-clock terms it is not slower than Bobcat's, and BD supports a much higher number of outstanding misses from the L2 than its low-power brethren. On a per-clock basis, it would probably take a whole 4-core Jaguar chip to match a single module L2's ability to sustain MLP, and that's before realizing the module is 2x faster.
The prefetch methods employed by Bulldozer are a whole other league compared to Bobcat.

BD gets depressing when the write numbers come into play, but even then it's still running double time compared to Jaguar.

sebbbi · Sep 23, 2012

3dilettante said:
I'm seeing a lot of per-clock measures in these comparisons, which is usually forgivable if comparing architectures that operate in the same general clock envelope. That isn't the case with BD and the low-power x86 cores.
Every place where BD is supposedly at parity per-clock should be followed up with the modifier "and it's done 2x faster". This comes at significant area and power cost, but it is a necessity to be even considered for certain workloads that would give Bobcat or Jaguar little notice.

Of course. There's no denying that BD/PD have 2x-3x higher clock ceilings (but at cost of almost 100W TDP). For desktop PCs and high performance servers Jaguar can never challenge BD/PD.

However AMD did show a 17W ULV Trinity (PD) prototype at a trade show. They show clear interest towards entering the highly profitable Ultrabook market. Apple even considered AMD chips for their Macbook Airs, but chose Intel instead. Laptop and ultraportable markets are growing fast, and BD/PD core isn't that well suited for that market.

Basically my Jaguar vs Trinity/PD investigation began, because I wanted to get deeper insight how a 17W (2 module, 4 core) low clocked Trinity would compare to the new Jaguar core based APU (both have four cores, similar clock rate, similar TDP and can sustain 2 uops/cycle/core). After the trade show event (in January), there has been zero news about the 17W Trinity, and it has been eight months since. I am just wondering if AMD are going to replace it with a Jaguar based APU. A 1.815 GHz (=1.65*1.1) Jaguar based APU should be very close in performance compared to the 17W Trinity running at the rumored 1.5-1.6 GHz clocks. That's why IPC comparisons make sense. I am just trying to figure out how they would compare in a TDP constrained setting.

3dilettante said:
BD supports a much higher number of outstanding misses from the L2 than its low-power brethren.

Agner Fog didn't find any L2 bottlenecks in his Bobcat analysis. However this is a valid concern for Jaguar. It doubles the peak vector throughput per core and has twice as many cores sharing the same L2 cache. L2 can be a real bottleneck for it.

AMD has noticed that the L2 can be a bottleneck in Jaguar, and improved things:
"All of the supporting functionality for the L2 has been enlarged, enhanced, and widened as well. There are now L2 prefetchers per core, and there can be up to 24 simultaneous read and write transactions in flight at once. This is made a lot saner by the addition of more L2 snoop queues, 16 more in Jaguar. It is all new, all better, and should support many more active cores than Bobcat without choking on itself."
[source: http://semiaccurate.com/2012/08/28/amd-let-the-new-cat-out-of-the-bag-with-the-jaguar-core/]

We will have to wait and see how the shared L2 behaves in benchmarks. The good thing is that Jaguar keeps the excellent 32 kB 8-way L1D caches from Bobcat, and those will help to fight against the possible L2 bottlenecks.

3dilettante said:
The prefetch methods employed by Bulldozer are a whole other league compared to Bobcat.

AMDs slides describe that prefetchers has been improved for Jaguar. The information isn't however detailed enough to draw any conclusions yet.

sebbbi · Sep 23, 2012

17W ULV Trinity models and clock speeds:
[source: http://www.pclaunches.com/notebooks/lenovo-ideapad-s405-with-upcoming-17w-amd-trinity-apu.php]

1 module, 2 cores:
AMD A4-4355M is a dual-core processor with 1.9GHz clock speed, 2.4GHz Turbo frequency, AMD Radeon HD 7400G graphics, and 1MB L2 cache.

2 modules, 4 cores:
AMD A8-4555M is a quad-core processor with 1.6GHz clock speed, 2.4GHz Turbo frequency, AMD Radeon HD 7600G graphics, and 4MB L2 cache.

AMD A8-4555M (2 modules, 4 cores) at 1.6 GHz vs Jaguar based APU (4 cores) at 1.815 GHz would be a interesting battle. I would expect them to trade blows when all four cores are taxed, but the Trinity should win easily in single core situations (because of it's 2.4 GHz turbo clock).

RudeCurve · Sep 23, 2012

sebbbi said:
17W ULV Trinity models and clock speeds:
[source: http://www.pclaunches.com/notebooks/lenovo-ideapad-s405-with-upcoming-17w-amd-trinity-apu.php]

Fixed the link for you.

hkultala · Sep 23, 2012

sebbbi said:
BD vector pipelines (one module = two cores):
2 x FMA/FADD/FMUL/OtherFloat
2 x MMX

Bobcat/Jaguar vector pipelines (two cores):
2 x FMUL/OtherFloat/MMX
2 x FADD/OtherFloat/MMX

MMX = integer / ALU / comparisons / logic ops / permute / insert / etc (not floating point processing)

If you do pure floating point vector crunching, two Bobcat/Jaguar cores can issue four instructions per clock, while BD module can only issue two (as the MMX pipelines are unused). Both architectures can decode/retire four. Of course in most algorithms you do some MMX ops: logic ops (*-1 = xor highest bit, abs = mask highest bit, etc), permute/insert (to combine/separate lanes and to SOA pack/unpack, etc). However it requires you to have exact 1:1 mix of MMX/FP operations for Bulldozer to catch up (reach 4 uops per cycle), or alternatively some other integer/ALU code to use the remaining of the four fetch/retire slots (loop body, counters, address calculation, etc).

If the algorithm can be represented by series of multiplies followed by dependent additions, the FMA will help BD a lot. But no algorithm is pure FMA, and all float instructions (even the simplest ones) are using the same FMA pipelines. These pipelines can be a bottleneck in many algorithms. Sometimes it's better to have four small pipelines (2 x Jaguar cores) than two heavy hitters (one BD module).

What do you mean by "the simplest float instructions" ?

And the pipelines are usually not the worst bottleneck. The bottlenecks are on:
1) Getting data to the FPU.
2) Waiting for dependent instructions to execute.
3) Other slowdowns, like branch prediction misses.

Now go and find some real-world fpu code and look at the flop numbers you get from the real world code.
Then compare this to the theoretical flops of your processor.

But actually, often the situation is, that many operations needs same value. And when that value becomes ready, it's nice to have more than 1 FU which can then execute those operations as quickly as possible. And then wait again couple of cycles until some another value arrives and some another operation or operations can execute.

But while all operations of one thread are sitting in their reservation stations waiting for their data, the fpu could be used to execute operations from another thread, so bulldozer's shared fpu and intel's hyperthreading both work very well for this.

And no algorithm is pure fma, but actually, in most algorithms like >90% of the executed FP operations really are fma operations, so fma can really give a huge boost.

hkultala · Sep 23, 2012

sebbbi said:
Basically my Jaguar vs Trinity/PD investigation began, because I wanted to get deeper insight how a 17W (2 module, 4 core) low clocked Trinity would compare to the new Jaguar core based APU (both have four cores, similar clock rate, similar TDP and can sustain 2 uops/cycle/core). After the trade show event (in January), there has been zero news about the 17W Trinity, and it has been eight months since. I am just wondering if AMD are going to replace it with a Jaguar based APU. A 1.815 GHz (=1.65*1.1) Jaguar based APU should be very close in performance compared to the 17W Trinity running at the rumored 1.5-1.6 GHz clocks. That's why IPC comparisons make sense. I am just trying to figure out how they would compare in a TDP constrained setting.

Then you should start doing ipc comparisons than make sense instead of senseless "peak execution throughput" comparisons.

itsmydamnation · Sep 23, 2012

The other thing about bulldozer is the L2 isn't that slow if you consider what it handles, its function is closer to SB/IB L3 then there L2 and the latencies are closer as well. The issue is the small L1 and/or lack of an intermediate cache rather then the L2 itself.

sebbbi · Sep 24, 2012

itsmydamnation said:
The other thing about bulldozer is the L2 isn't that slow if you consider what it handles, its function is closer to SB/IB L3 then there L2 and the latencies are closer as well. The issue is the small L1 and/or lack of an intermediate cache rather then the L2 itself.

Exactly. AMDs L2 caches (in BD/PD/Jaguar/Bobcat/etc) are larger and slower than Intel's L2 caches. They are somewhere in between Intel's L2 and L3 caches (in both size and latency). Because there's nothing in between L1D<->L2, the size and associativity of L1D is very important for AMD architectures. This is unfortunately an area where BD/PD are lacking (tiny 16 kB 4-way L1) compared to other AMD designs (Bobcat/Jaguar/Stars/Phenom have all bigger L1D caches with better associativity).

hkultala said:
But while all operations of one thread are sitting in their reservation stations waiting for their data, the fpu could be used to execute operations from another thread, so bulldozer's shared fpu and intel's hyperthreading both work very well for this.

Agreed. This is the biggest advantage BD/PD has over all other AMD architectures (including Stars, Bobcat and Jaguar). When running generic object oriented code (with unpredictable access patterns) BD/PD should have a nice advantage (because of all the pipeline and cache stalls that can be filled).

How much this advantage affects your code base is another debate. As a console programmer, I naturally have a completely different view towards this issue as many PC (or server) software programmers. In games there's huge amount of (mainly vectorized) batch processing happening every frame (viewport culling, matrix multiplies, particle processing, etc). All this batch processing can take 50%-80% of your frame time (depending on game type of course). This kind of code is often highly optimized and uses (often manual) data prefetching heavily, and thus doesn't hit cache (or pipeline) stalls that much. This is also something that is visible in BD benchmarks. It fares well in many PC based software, but not so good in games ported from consoles.

hkultala said:
in most algorithms like >90% of the executed FP operations really are fma operations, so fma can really give a huge boost.

I have to disagree with this one. I have never seen heavy inner loops with more than 75% of MAD/FMA instructions. Even the simplest pure SOA-style dot product loop has 3xFMA+1xMUL per four dot products (75% FMA). Our inner view culling loop (mainly 5 dot products per viewport) has some vector float compares and splats in addition to those (around 50% FMA). 4x4 matrix multiply is another FMA heavy operation, but that is 16 splat, 4 mul, 12 FMA (37.5% FMA). If you consider the fact that splats go to MMX pipeline on BD, the percentage of FMA entering float vector pipeline is again 75% (common for pure dot product based operations). It's very hard to create a function with more than 75% FMA.

fellix · Sep 24, 2012

sebbbi said:
Because there's nothing in between L1D<->L2, the size and associativity of L1D is very important for AMD architectures. This is unfortunately an area where BD/PD are lacking (tiny 16 kB 4-way L1) compared to other AMD designs (Bobcat/Jaguar/Stars/Phenom have all bigger L1D caches with better associativity).

Well, for BD actually there's a write coalescing cache in between L1D and L2, but it's too small and strictly profiled to cover all the issues.

mczak · Sep 24, 2012

sebbbi said:
Because there's nothing in between L1D<->L2, the size and associativity of L1D is very important for AMD architectures. This is unfortunately an area where BD/PD are lacking (tiny 16 kB 4-way L1) compared to other AMD designs (Bobcat/Jaguar/Stars/Phenom have all bigger L1D caches with better associativity).

Stars/Phenom l1d do not have better associativity - quite the contrary it's only 2-way (which does seem low indeed) but of course they are much bigger (64kB). And they were exclusive and write-back, of course.

New AMD low power X86 core, enter the Jaguar

Similar threads