New AMD low power X86 core, enter the Jaguar

Blazkowicz · Mar 4, 2013

pjbliverpool said:
It's already planned for Haswell (Crystalwell).

and it's 128MB, which is quite more comfortable than 32MB or lower.
to be pedantic, liolio addressed edram and esram, these would be right there on the CPU+GPU die taking a lot of space.. I think we can rule that out, it still only makes sense on consoles (or for such memory to be used as L2 or L3 eventually)

Haswell really brings a kind of silicon on interposer prototype memory out on the market, it's a significant first, they're doing it because they can, it'll be expensive and high margin.

Raqia · Mar 6, 2013

From the Orbis thread (BSN alert though):

http://www.brightsideofnews.com/news/2013/3/5/amd-kaveri-unveiled-pc-architecture-gets-gddr5.aspx

Their concerns about the number of chips needed is a little ridiculous; they regularly sell a bigger die w/ 3-6 gigs of gddr5 in the 7950/7970.

Edit: On second thought, given the that the bus width on Kaveri is only 4x32 bit, the density per chip required for 4 gigs even in clam-shell would be higher than most common parts today so they have a point. I imagine whatever solution they're using for the PS4 could be shoe-horned onto Kabini though but it might be overkill for the netbook / tablet market to have a 128 bit interface.

mczak · Mar 6, 2013

Raqia said:
Edit: On second thought, given the that the bus width on Kaveri is only 4x32 bit, the density per chip required for 4 gigs even in clam-shell would be higher than most common parts today so they have a point. I imagine whatever solution they're using for the PS4 could be shoe-horned onto Kabini though but it might be overkill for the netbook / tablet market to have a 128 bit interface.

Kaveri isn't a ultra-low-end part, and even smartphones have 2GB of memory these days, so even talking about some Kaveri implementation with less than 4GB memory is imho ridiculous. Now you can get 4GB gddr5 on a 128bit bus with those newfangled 4gb chips which should be there in time, but this is still quite low. However, it looks like gddr5m is slated to appear later this year, and these chips only have 16bit interface width - hence, if it works similar to gddr5, maybe you can run them in clamshell mode as well so you could get 8GB max (with 16 4gb gddr5m chips running in 2x8bit clamshell mode).
I don't know though that all sounds pretty unbelievable.

Raqia · Mar 6, 2013

I think a move away from commodity DIMMs is perfectly reasonable. Manufacturers can just leverage the experience they have manufacturing graphics cards to embed memory onto APU motherboards.

The result: much better bandwidth with well tuned configurations. I had mismatched pairs of Dimms on my DM1Z when I first got it with each DIMM having different timings and sizes. In the end, soldering DRAM chips yields much simpler manufacturing than cumbersome complex DIMM interface slots with separate costs for the DIMMs themselves (manufacturing Dimms at a third party + shipping to assembly facility + hand installation + reliability costs of physically removable DIMMs & mismatching etc ... ) that are passed onto the consumer. Most consumers never upgrade their memory, and they'll appreciate the massive upgrade in bandwidth for APU graphics since most of their usage isn't so latency sensitive.

fellix · Mar 6, 2013

A whole DIMM in a single small package here.
Supports all the available vertical chip/die stacking options, simplifies wiring, lowers the cost and the area of the PBC, etc. Sounds perfect for the PS4 case, with high memory capacity and limited bus width.

Blazkowicz · Mar 7, 2013

Raqia said:
I think a move away from commodity DIMMs is perfectly reasonable. Manufacturers can just leverage the experience they have manufacturing graphics cards to embed memory onto APU motherboards.

The result: much better bandwidth with well tuned configurations. I had mismatched pairs of Dimms on my DM1Z when I first got it with each DIMM having different timings and sizes. In the end, soldering DRAM chips yields much simpler manufacturing than cumbersome complex DIMM interface slots with separate costs for the DIMMs themselves (manufacturing Dimms at a third party + shipping to assembly facility + hand installation + reliability costs of physically removable DIMMs & mismatching etc ... ) that are passed onto the consumer. Most consumers never upgrade their memory, and they'll appreciate the massive upgrade in bandwidth for APU graphics since most of their usage isn't so latency sensitive.

I don't wish so. I don't want to be stuck at an arbitrary 4GB or something, especially with hardware that seemingly competes with my current desktop - built around an athlon II X2 and a full ATX mobo.

Some users will get by with 1GB memory, others need 6 or 8 or don't really have a ceiling. I know I'd like to run an OS plus two web browsers plus a few hundred megs of disk cache plus not touching the swap yet, this pretty much fills the 4GB already. Yet CPU usage may be flat and I've not started fooling around with something else.

Soldered-in memory can be used, but on a single channel 64bit design such as Kabini (if it's still like Bobcat), I'd rather have the base amount soldered on board plus an available SO-DIMM or DIMM slot for upgrade.
I used a 20-year-old computer that was like that (it had 4MB on-board, plus four 9bit SIMM slots and so was equipped with 20MB total)
A similar computer is the low end Intel Celeron based Chromebook : it comes with 2GB, but you can drop in a 8GB stick if you wish. If you take that away you will only be able to buy the low memory version, and a high memory version won't be provided because it would mean you have a second SKU and build needless inventories in the retail supply chains.

Raqia · Mar 7, 2013

Blazkowicz said:
I don't wish so. I don't want to be stuck at an arbitrary 4GB or something, especially with hardware that seemingly competes with my current desktop - built around an athlon II X2 and a full ATX mobo.

Some users will get by with 1GB memory, others need 6 or 8 or don't really have a ceiling. I know I'd like to run an OS plus two web browsers plus a few hundred megs of disk cache plus not touching the swap yet, this pretty much fills the 4GB already. Yet CPU usage may be flat and I've not started fooling around with something else.

Soldered-in memory can be used, but on a single channel 64bit design such as Kabini (if it's still like Bobcat), I'd rather have the base amount soldered on board plus an available SO-DIMM or DIMM slot for upgrade.
I used a 20-year-old computer that was like that (it had 4MB on-board, plus four 9bit SIMM slots and so was equipped with 20MB total)
A similar computer is the low end Intel Celeron based Chromebook : it comes with 2GB, but you can drop in a 8GB stick if you wish. If you take that away you will only be able to buy the low memory version, and a high memory version won't be provided because it would mean you have a second SKU and build needless inventories in the retail supply chains.

I'm not personally in love with the lack of customization looming in the future either, but I do like its added performance which addresses an obvious bottle-neck in APU based architectures. It's definitely coming because it makes business sense:

- Most consumers probably can't even be bothered to open a laptop up for a memory upgrade, or check if they're supposed to buy DDR1, 2, or 3 to upgrade a laptop, or know that they're supposed to pair up DIMMs for dual channel operation, or get identical pairs for better timing reliability etc.

- Even if you care enough to, we're also at a point where replacing the system completely is more compelling than upgrading; a pair of new DIMMs cost around $80-$100 but in many cases the cost of a better new laptop is hovering around $200-$300.

- The huge success of fixed and stripped spec'd tablets & phones is clear, and these form factors aren't suited for upgrade-able memory slots. Manufacturers don't want 2 different boards for laptops and tablets; they will want to consolidate their designs even further by using tweaked versions of the same board in tablets, convertibles, and laptops.

For now, it seems like this DDR3 will still be included in Kaveri and GDDR5 is on a back-side bus according to the article. I still anticipate that systems of the future will look a lot like graphics cards do today, and we'll get used to it. Not many enthusiasts bemoan the fact that their graphics cards don't have DIMM slots for them to upgrade when a whole new board is about $200-300.

Edit: Looks like DDR3 and GDDR5 are mutually exclusive, GDDR5 isn't on some backside bus so it can be yoked up to one or the other.

Blazkowicz · Mar 7, 2013

I'm open to such cost reductions (and other benefits you speak of), afterall I guess I wish stuff like $99 desktop PC to be available eventually for the 3rd world and 1st world's underclass.

One option would be to put the APU + memory on a small board akin to an MXM graphics card, it has been done already years ago : a small board that contains a 1GHz 486 based SoC (with VGA, sound and stuff) and 512MB ddr2 (later a board with an upgraded chip was released, with graphics and memory a bit better)
http://www.windowsfordevices.com/c/a/News/NorhTec-Gecko-Info-Pad/

but then you need a design with the right I/O mix (between PCIe, USB, sufficient display bandwith etc.)

Raqia · Mar 8, 2013

Some benchmark results for our little hero:

http://semiaccurate.com/2013/03/08/opencl-performance-of-next-gen-low-power-apus/#.UTpVDxzkvS0

There's even a nice boost over a Radeon HD 6520G (AMD A6-3400M APU) in some benchmarks. Sign me up for a Surface tablet based on one of these things.

pjbliverpool · Mar 8, 2013

Raqia said:
Some benchmark results for our little hero:

http://semiaccurate.com/2013/03/08/opencl-performance-of-next-gen-low-power-apus/#.UTpVDxzkvS0

There's even a nice boost over a Radeon HD 6520G (AMD A6-3400M APU) in some benchmarks. Sign me up for a Surface tablet based on one of these things.

Its openCL performance so not necessarily competitive in gaming or general cpu performance but its damn impressive in gpgpu. I wonder how much of that is due to the cu's being GCN and how much is due to HSA?

Blazkowicz · Mar 8, 2013

Nothing due to HSA : the software would have to be rewritten.
Or maybe an extraordinary complex OpenCL implementation would automatically HSA-ify it with such and such parts run on the CPU or GPU but I don't know how likely that is.

3dilettante · Mar 8, 2013

AMD's promised an HSA runtime that sneaks under the OpenCL one, where it will try to optimize certain things like skip the copy from CPU and GPU memory space. It's not clear when that should come about.

hkultala · Apr 7, 2013

[offtopic]

Unified address space between CPU and GPU side seems to favour using clEnqueueMapBuffer instead of using clEnqueueWriteBuffer/clEnqueueReadBuffer.

But some clEnqueueReadBuffer / clEnqueueWriteBuffer calls could also be converted to not actually copy anything.

[/offtopic]

itsmydamnation · Apr 18, 2013

http://citavia.blog.de/2013/04/17/2-ghz-amd-jaguar-benchmarks-15761535/

https://www.osadl.org/CPU-benchmarks.qa-farm-cpu-benchmarks.0.html

its something, at-least we know the 4 core 25watt part is likely to be running @ 2.0ghz.

the numbers look interesting the lower DP FPU performance seems apparent. its hard to tell but IPC seems very good , like core2 level kind of good. its number looks pretty comparable to r6s1 which is a core2 @ 2ghz but the jaguar does have a newer kernel. If it is Core2 kind of level IPC that would very much explain why both sony and MS would pick an "anaemic" core.

mutlithread looks memory bound as dhrystone sees good scaling from single to mutli but other workloads not so much. if you look at the Jaguar config only 1 dimm.

Exophase · Apr 18, 2013

Dhrystone and Whetstone are useless for estimating IPC of real world programs.

itsmydamnation · Apr 18, 2013

Exophase said:
Dhrystone and Whetstone are useless for estimating IPC of real world programs.

i know, i didn't really pay much attention to them, other then devices that have same dhrystone score have around double the DP whetstone score which aligns to what we know. I was looking at the other tests but its till all over the place.

But the one deduction you can make is across the 10 tests is that performance doesn't seem to be any better or worse then all the other modern X86 processors listed. we also had the leaked/rumoured cinebench numbers that also put performance in and around the same category.

cal_guy · Apr 20, 2013

It should be noted that the T7300 benchmark was capped at 1200mhz.

tunafish · Apr 20, 2013

AMD has released an optimization guide for Jaguar: http://support.amd.com/us/Processor_TechDocs/52128_16h_Software_Opt_Guide.zip

Exophase · Apr 20, 2013

tunafish said:
AMD has released an optimization guide for Jaguar: http://support.amd.com/us/Processor_TechDocs/52128_16h_Software_Opt_Guide.zip

Thanks. This is very well written. I'd recommend it to anyone who wants a good overview of modern x86 design or really CPU design in general. I've never seen an x86 CPU reference manual give this much information on the branch predictor, although some stuff is still missing.

It looks like AVX256 is indeed split into two macro-ops and that there's a bottleneck that allows only two macro-ops to be dispatched per cycle. Because of this I could see AVX256 usage being detrimental vs 128-bit instructions, but AVX128 could still be advantageous.

Another thing I noticed is the L2 latency is pretty damn bad.. no less than 25 cycles load to use (I guess that includes the L1 miss cycles as well).. so at best no better than Bobcat, but maybe slightly worse?

mczak · Apr 21, 2013

Exophase said:
Thanks. This is very well written. I'd recommend it to anyone who wants a good overview of modern x86 design or really CPU design in general. I've never seen an x86 CPU reference manual give this much information on the branch predictor, although some stuff is still missing.

Indeed, that's quite nice.

It looks like AVX256 is indeed split into two macro-ops and that there's a bottleneck that allows only two macro-ops to be dispatched per cycle. Because of this I could see AVX256 usage being detrimental vs 128-bit instructions, but AVX128 could still be advantageous.

I'm not sure if two macro-ops per cycle retirement is much of an issue. The whole design pretty much looks like it's optimized for 2 of them per cycle. The requirement that the fast path double macro-ops retire "simultaneously" shouldn't hurt much neither I think (after all they have all the same dependency and latency so they should be able to execute right after each other), but yeah given that the only advantage of avx-256 is lower instruction count / lower instruction size in the first place there's probably not much point for avx-256.

Another thing I noticed is the L2 latency is pretty damn bad.. no less than 25 cycles load to use (I guess that includes the L1 miss cycles as well).. so at best no better than Bobcat, but maybe slightly worse?

Bobcat was usually listed as 17 cycles though afaik not in official documents. But still this looks worse. Bandwidth though is increased a lot as it can deliver 4x16 bytes per cycles vs. the 8 bytes per clock of Bobcat (but, because it is shared, those 4x16 bytes are required to hit all 4 banks - I guess this will hurt latency even more a bit in practice).

I also found the denormal penalties description fairly insightful, that you get a denormal input penalty when you've loaded it with mov instruction but not when the load was issued directly from a float arithmetic instruction is interesting. And you also get similar penalties when using denorms in logic ops - obviously there's some internal conversion going on.
Some instructions also got notably faster. Ok we knew about the integer divider already (roughly twice as fast as previous int-domain divs), but omg 2 popcnt/lzcnt per clock with 1 cycle latency? palignr was also very slow (essentially too slow to be useful except of course for compatibility reasons) and now executes at 2 per clock (I guess though this being so slow before was a result of the simd units being 64bit wide so execution of ops which require both halves of 128bit in a dependent way are not trivial). There's only very few instructions where you'd think they could have done better (e.g. pshufb, pblendvb/blendps aren't quite as fast as ideal but not terrible neither, but otherwise it's got crazy good shuffle-like stuff). It's also got _very_ good horizontal operations like haddps (worlds better than Bulldozer, and in fact better than even Ivy Bridge - even DPPS which is quite a mess on Bulldozer looks quite decent).
And it keeps the incredible 2-cycle simd mul latency - still wondering how AMD do this when they only manage 3 cycles for an add... The simd divide unit got 2 cycles higher latency though than on Bobcat (for scalar case, for 128bit it's twice as fast) but overall for just 2 execution pipes the simd unit looks very good (granted fma would have been nice but can't have everything).

New AMD low power X86 core, enter the Jaguar

Blazkowicz

Raqia

mczak

Raqia

fellix

Blazkowicz

Raqia

Blazkowicz

Raqia

pjbliverpool

B3D Scallywag

Blazkowicz

3dilettante

hkultala

itsmydamnation

Exophase

itsmydamnation

cal_guy

tunafish

Exophase

mczak

Similar threads