New AMD low power X86 core, enter the Jaguar

Another thing I noticed is the L2 latency is pretty damn bad.. no less than 25 cycles load to use (I guess that includes the L1 miss cycles as well).. so at best no better than Bobcat
That is typical to AMD CPUs (Piledriver cache hierachy is also similar). AMD CPUs usually have only two cache levels: L1 and L2, where the L1 is similar to Intel L1, and the L2 is similar to Intel L3 (several megabytes of size / shared between many cores). Intel's L2 (starting from Nehalem) is pretty much an extended L1 cache ("L1.5"). It's not that much larger than their L1 (64 KB vs 256 KB) and is private to a single core just like the L1.

The eight cycle L2 latency of recent Intel CPUs is awesome, but it's not fair to compare Intel's small (non-shared) 256KB L2 to a eight times bigger (2 MB) AMD L2 cache that is shared between many cores. Much bigger shared cache is of course going to have much longer latency. It's a good question why AMD doesn't add another cache level in between their L1 and L2 caches just like Intel did years ago. Nehalem introduced the fast 256 KB L2 cache between the L1 and the bigger shared cache (now called L3). Core 2 was the last Intel CPU to have a large (2 MB) and slow shared L2 (similar to AMDs current designs).
 
Ugh, wrote a long reply a while ago and apparently it disappered somewhere.

It looks like AVX256 is indeed split into two macro-ops and that there's a bottleneck that allows only two macro-ops to be dispatched per cycle. Because of this I could see AVX256 usage being detrimental vs 128-bit instructions, but AVX128 could still be advantageous.

Especially given how ridiculously overprovisioned the fetch and scan are. 32B per clock, scan instruction from worst case 17 bytes... that's better than any Intel frontend ever. Using longer instructions (like 3-op AVX) instead of more shorter instructions should basically always be a win. Long immediates should also never hurt.

Another thing I noticed is the L2 latency is pretty damn bad.. no less than 25 cycles load to use (I guess that includes the L1 miss cycles as well).. so at best no better than Bobcat, but maybe slightly worse?

Yep, although I think the L2 can sustain more misses at a time, and there is more BW.

I'm not sure if two macro-ops per cycle retirement is much of an issue. The whole design pretty much looks like it's optimized for 2 of them per cycle. The requirement that the fast path double macro-ops retire "simultaneously" shouldn't hurt much neither I think (after all they have all the same dependency and latency so they should be able to execute right after each other), but yeah given that the only advantage of avx-256 is lower instruction count / lower instruction size in the first place there's probably not much point for avx-256.

What I was interested in was if it would be possible to do all your housekeeping ops for free without having to unroll much. Seems that getting all the throughput out of the FPU takes all of the frontend and retire BW. At least we have free load ops/insn

Some instructions also got notably faster. Ok we knew about the integer divider already (roughly twice as fast as previous int-domain divs), but omg 2 popcnt/lzcnt per clock with 1 cycle latency?

I'm pretty hyped about that one. popcnt can be used to implement a fast space-efficient immutable structurally sharing hash trie, which is a great complex data structure for mt work.

palignr ... now executes at 2 per clock

Is it used for anything other than avoiding cacheline-crossing load penalties on Intel CPUs? Boundary-crossing unaligned loads have always been comparatively very cheap on AMD, so I don't think there's that much use for it anyway.

It's also got _very_ good horizontal operations like haddps (worlds better than Bulldozer, and in fact better than even Ivy Bridge

Yeah, what is it, 16 horizontal sums of PS in 12 ops over 15 clocks? With half the loads baked in? Leaving half the frontend and FPU free for other work? No reason ever to not use horizontal ops.

Incidentally, this more or less proves that the FPU is new, and not a derivative of any of AMDs previous ones. Their other FPUs implement 128-bit as two physically separate lanes of 80-bit and 64-bit. No way can you do this fast horizontals unless they are close together.

even DPPS which is quite a mess on Bulldozer looks quite decent
If I understand the "no local forwarding" hit right, it's exactly the same cost in latency as if you do it by hand. However, it issues 2 more macroops, and that probably completely occupies the frontend for 3 clocks, meaning the handcrafted version is probably better unless you want the masking/result placement properties.

And it keeps the incredible 2-cycle simd mul latency - still wondering how AMD do this when they only manage 3 cycles for an add...

Isn't FP single multiply is an easier operation than FP single add? 24-bit mul + round vs match, add, round.
 
I also found the denormal penalties description fairly insightful, that you get a denormal input penalty when you've loaded it with mov instruction but not when the load was issued directly from a float arithmetic instruction is interesting. And you also get similar penalties when using denorms in logic ops - obviously there's some internal conversion going on.
I personally don't like denormals at all (and always disable the denormal support in our programs). All CPUs have huge penalties when processing them, but must still have extra denormal handling hardware to be IEEE compliant. If someone wants extra precision, it's much cheaper (and better) just to switch to 64 bit floats.
I'm pretty hyped about that one. popcnt can be used to implement a fast space-efficient immutable structurally sharing hash trie, which is a great complex data structure for mt work.
I love popcnt and lzcnt as well :). We also use them in hashing structures. I always seem to be finding new ways to abuse them. The latest thing being a distance field generator :)
Yeah, what is it, 16 horizontal sums of PS in 12 ops over 15 clocks? With half the loads baked in? Leaving half the frontend and FPU free for other work? No reason ever to not use horizontal ops.

Incidentally, this more or less proves that the FPU is new, and not a derivative of any of AMDs previous ones. Their other FPUs implement 128-bit as two physically separate lanes of 80-bit and 64-bit. No way can you do this fast horizontals unless they are close together.

If I understand the "no local forwarding" hit right, it's exactly the same cost in latency as if you do it by hand. However, it issues 2 more macroops, and that probably completely occupies the frontend for 3 clocks, meaning the handcrafted version is probably better unless you want the masking/result placement properties.
Yes. Assuming I understand the excel table correctly, the SSE3 style (mul+2xhadd) is the fastest way (by both total latency and uop count) to perform (AOS) dot product on Jaguar. SSE 4.1 DPPS is slower in the general case (unless you need the masking).
 
I personally don't like denormals at all (and always disable the denormal support in our programs). All CPUs have huge penalties when processing them, but must still have extra denormal handling hardware to be IEEE compliant. If someone wants extra precision, it's much cheaper (and better) just to switch to 64 bit floats.
IIRC some intel cpus no longer have a denormal penalty at least for some ops. But useful or not it's refreshing to see the penalties detailed a bit more (even though the docs don't say how big the pre-calculation vs. post-calculation hit is).

Yes. Assuming I understand the excel table correctly, the SSE3 style (mul+2xhadd) is the fastest way (by both total latency and uop count) to perform (AOS) dot product on Jaguar. SSE 4.1 DPPS is slower in the general case (unless you need the masking).
Yes I think it's pretty much only useful if you need the masking. But that's pretty much the same even on intel cpus. But on Bulldozer, it looks like even if you need the masking (any of the possible ones), you can always do much better by hand.
 
The eight cycle L2 latency of recent Intel CPUs is awesome, but it's not fair to compare Intel's small (non-shared) 256KB L2 to a eight times bigger (2 MB) AMD L2 cache that is shared between many cores. Much bigger shared cache is of course going to have much longer latency. It's a good question why AMD doesn't add another cache level in between their L1 and L2 caches just like Intel did years ago. Nehalem introduced the fast 256 KB L2 cache between the L1 and the bigger shared cache (now called L3). Core 2 was the last Intel CPU to have a large (2 MB) and slow shared L2 (similar to AMDs current designs).

Just speaking in terms of absolute time, the L2 latency on Jaguar is much worse than it was on Core 2, for similar shared L2 sizes. AFAIK it was around 15 cycles (let's say 18 cycles if that doesn't include a presumably 3 cycle L1 miss time), and Core 2 went up to 3.33GHz on 45nm (not counting a scrapped 3.5GHz part). That's about 5.4ns vs 12.5ns (2GHz for Jaguar, it's possible there's a little more headroom but I doubt it's a lot). This doesn't factor in that Jaguar shares with twice as many cores much less the difference in power and area goals, but it's still a lot slower.

You could say Phenom (then Phenom II) had something more like the Nehalem hierarchy before Nehalem existed: dedicated small-ish (512KB) caches per core and larger shared L3, and the L2 latency wasn't nearly as bad as we see on Jaguar now. I'd call Bulldozer another level in between since it only shares the L2 for a module, not the cores.
 
Is it? And if so whose clock is the 16x4 bytes per cycle throughput on? And I wonder how much of the latency is driven by a minimum number of clocks needed by the L2 state machine rather than absolute time needed for signal propagation. In other words, how much should we expect a half clock L2 to impact latency to begin with?
 
I don't think the L2 has 1/2 the clock rate. Unless they are being simplistic in the bobcat/jaguar hotchips presentations, I'm pretty sure its powered down every other cycle. So it still runs at the same clock speed its just only accessible every other cycle. So my simple thinking is the most that would do is add a 1 cycle delay assuming your not throughput limited.

edit: I guess it could add more then 1 cycle depending on exactly how it works. Their could be multiple points within the cache where a single piece of data is waiting.
 
Last edited by a moderator:
The hotchip slides do indicate that the L2 runs at half clock and that it's clocked when required, and further contrasts it with the L2 interface that does run at full speed.

http://www.extremetech.com/gaming/1...-notebooks-and-tablets-if-it-launches-on-time

I guess there is nothing preventing the L2 array to be clocked at half the CPU core frequency and still support the L1<->L2 interface running at full tilt. It just means the internal data paths in the L2 array are twice as wide (256 bit/bank). Banking is controlled by address-bits 7:6 which means the same bank serves a full cache line at a time (critical word first).

Cheers
 
I imagine they put some time into making this doc readable because it will be in 2 next-gen consoles not to mention the usual netbooks and tablets. There certainly isn't a doc like this out of AMD for Bulldozer, though Agner Fog does a nice job.
 
Yes there is. AMD family 15h optimization guide is for Bulldozer:
http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf

362 pages of very detailed information. AMDs optimization guides have always been top quality.

That's actually an extensive copy&paste exercise from their older manual. So it is inaccurate in quite a few (sometimes non-obvious) places. Moreover, the Jag one seems far more polished (as is to be expected from a narrow, focused document, as opposed to an everything-and-the-kitchen sink one).
 
A Kabini die shot has surfaced:

kabini_singleskepq.jpg


Have fun!
 
And some extra info about AMD G-Series:

http://www.youtube.com/watch?feature=player_embedded&v=LCFLqBgX4Ac

http://www.amd.com/us/Documents/AMDGSeriesSOCProductBrief.pdf

In short:

FIRST GENERATION SOC DESIGN
> Delivers up to 70% overall improvement over AMD G-Series APU6
> Integrates Controller Hub functional block as well as CPU+GPU+NB
> 28nm process technology, 24.5mm x 24.5mm BGA package

JAGUAR CPU CORE WITH PERFORMANCE INCREASES
> Dual-core and quad-core, up to 2MB shared L2
> 113% CPU performance improvement over AMD G-Series APU7

NEXT GENERATION GRAPHICS CORE WITH PERFORMANCE INCREASE OVER PREVIOUS GENERATIONS
> 20% compute performance improvement over AMD G-Series APU when running multiple industry-standard graphics-intensive benchmark
> DirectX
11.1 graphics support

IMPROVED POWER SAVING FEATURES
> Power gating added to Multimedia Engine, Display Controller & NB
> DDR P-states for reduced power consumption

MEMORY SUPPORT: SINGLE-CHANNEL DDR3
> Up to DDR3-1600 at 1.35V and 1.25V voltage levels supported
> Up to 2 UDIMMs or 2 SO-DIMMs
> ECC support

INTEGRATED DISPLAY OUTPUTS
> Supports two simultaneous displays
> Supports 4-lane DisplayPort 1.2, DVI, HDMI? 1.4a
> Integrated VGA
> Integrated eDP or 18bpp single channel LVDS

UPDATED I/O (FEATURES MAY BE SKU DEPENDENT)
> Four x1 links of PCIe? Gen 2 for GPPs
> One x4 link of PCIe Gen 2 for discrete GPU
> 8 USB 2.0 + 2 USB 3.0
> 2 SATA 2.x/3.x (up to 6Gb/s)
> SD Card Reader v3.0 or SDIO controller
And top embedded model spec.

GX-420CA
GE420CIAJ44HM
4
25W
2MB
2.0GHz
600MHz
(HD 8400E)
DDR3-1600
Yes
90C
 
Last edited by a moderator:
The Kyoto is actually not very interesting. It seems to be just the same chip as the normal 4-core Jaguar, just marketed towards servers.

I'd be much more interested if they bundled ~16 jaguar cores without the GPU on a single chip, set up as independent systems each with their own 1-channel memory controller, 1Gbit/s eth and a shared io system. That could actually be pretty good for value as a web server, and would fit the seamicro stuff well.

As for the Jaguar dieshot, they have embedded almost all the IO a tablet or small laptop needs on the chip. Looks to me that all it needs is some network interface and it's good.
 
Back
Top