New AMD low power X86 core, enter the Jaguar

Hm in firefox I just ran it at 2000 fish and it's 55-60fps, solid 60fps at 1750. That's a 2500K at stock. My 6850 is at 62% activity.
 
probably more browser related?

here win a sandy bridge i3 and a radeon 5k it stays locked at 60FPS with 2000, using IE10.
CPU usage is 33% (of 4 threads)

I was able to get 2500 @ 60fps(occasional dips to mid 50s) using:

AMD Phenom II 955 @ 3.8Ghz
AMD 6850 @ 930 core and 4700mhz memory
Waterfox 18.0.1 as my browser.

I think this test is pretty shit for testing CPUs.
 
its not a CPU test. its a HTML 5/GPU test. what was interesting is I am CPU limited on both firefox and IE.

GPU load here is 80%, but the strange thing is that the GPU clock stays at a lower power state (400MHz, while the regular clock is 755), at around 2200 I start seeing drops to mid 50's
 
My blackberry playbook runs it at 22fps with 1 fish. My 360 manages 55fps with 50 fish. My i7+7970 is locked at 60 with 2000 and every browser.
 
How likely is it that a 15W Kabini would come close to the performance of the 19W Trinity?

I'd guess the cpu would be around 10% slower in single thread and 10% higher in multi thread. Graphics might be ~75% of the performance of the Trinity.
 
Since Kabini has 4 cores at a similar TDP as the 2 core Zacate, does that mean power consumption (and heat output) would decrease significantly if only one core is at load? Or am I not understanding how multiple cores work?
 
It could, but some things may skew it : what if your load is you run an old game or a single-threaded 3D or GPGPU app, and the GPU uses up the TDP target (I don't know if the power management is actually like this!)

One more thing is the lone core has 2MB L2 all for itself. So, all the L2 is still lit up, and the core gets higher performance so uses a bit more power (but this sounds efficient and not bad at all and if fewer memory accesses are made there's power saving to factor in)

Otherwise your intuition still seems really correct to me.
Can cores be fully disabled? if so there's savings from not running power through them.
Do they eat so little power that memory controller/northbridge, GPU, and stuff is very much significant, so that by using only one core but doing quite some memory and peripheral I/O you still use quite some power.

One would need to do the precise bean accounting to answer your question.
 
Thanks, I'd never really considered the implications of 4 cores beyond increased multithreading performance. I'm interested to see the power consumption results in real world usage, and I own an E-450 laptop so significant savings would be pretty exciting.
 
How likely is it that a 15W Kabini would come close to the performance of the 19W Trinity?
I'd guess the cpu would be around 10% slower in single thread and 10% higher in multi thread. Graphics might be ~75% of the performance of the Trinity.
Piledriver core (CPU used in Trinity) is designed for high clock speeds (turbo up to 4.3 GHz, overclock up to 8 GHz). In order to reach such high clocks, several sacrifices had to be made. The CPU pipelines had to be made longer, because there's less time to finish each pipeline stage (as clock cycles are shorter). The cache latencies are longer, because there's less time to move data around the chip (during a single clock cycle). The L1 caches are also simpler (less associativity) compared to Jaguar (and Intel designs). In order to combat the IPC loss of these sacrifices, some parts of the chip needed to be beefed up: The ROBs must be larger (more TLP is required to fill longer pipelines, more TLP is required to hide longer cache latencies / more often occurring misses because of lower L1 cache associativity) and the branch predictor must be better (since long pipeline causes more severe branch mis-predict penalties). All these extra transistors (and extra power) are needed just to negate the IPC loss caused by the high clock headroom.

Jaguar compute unit (4 cores) has same theoretical peak performance per clock as a two module (4 core) Piledriver. Jaguar has shorter pipelines and better caches (less latency, more associativity). Piledriver has slightly larger ROBs and slightly better branch predictor. But these are required to negate the disadvantages in the cache design and the pipeline length. The per module shared floating point pipeline in Piledriver is very good for single threaded tasks, but for multithreaded workloads, the module design is a hindrance, because of various bottlenecks (shared 2-way L1 instruction cache and shared instruction decode). Steamroller will solve some of these bottlenecks (before end of this year?), but it's still too early to discuss about it yet (with the limited information available).

Jaguar and Piledriver IPC will be in the same ballpark. However when running these chips at low clocks (<19W) all the transistors spent in Piledriver design that allow the high clock ceiling are wasted, but all the disadvantages are still present. Thus Piledriver needs more power and more chip area to reach similar performance than Jaguar. There's no way around this. Jaguar core has better performance per watt.

My guess:
- Multi threaded workload (GPU stressed): Jaguar wins by +10% (because of clock difference: 1.6 GHz vs 1.815 GHz)
- Multi threaded workload (GPU idle): Tie(depends highly on how much PD can turbo clock in this case)
- Single threaded workload (GPU idle): PD wins by up to +50% (single core in PD module runs up to 20% faster when the other core is idling + PD can turbo clock to 2.4 GHz when only a single core is active).

GPU performance difference is impossible to estimate at this point, because we do not yet know exact details about the Temash/Kabini APU configurations. However Jaguar core takes (considerably) less die space and consumes slightly less power than a Piledriver core (of similar performance). This means that AMD could equip the Jaguar based APU with a more powerful GPU than Trinity at the same TDP limit. It's all about market segmentation. If AMD plans to place the high end Kabini APUs against Intel's Ultrabook chips, we might see a more powerful GPU (AMD has technology available for this already, just look at the PS4 design). However if AMD sees Kabini as a low end laptop / netbook platform, they are likely going to focus on making a small chip that is cheap to produce, and limit the GPU options to low performance models.
 
GPU performance difference is impossible to estimate at this point, because we do not yet know exact details about the Temash/Kabini APU configurations. However Jaguar core takes (considerably) less die space and consumes slightly less power than a Piledriver core (of similar performance). This means that AMD could equip the Jaguar based APU with a more powerful GPU than Trinity at the same TDP limit. It's all about market segmentation. If AMD plans to place the high end Kabini APUs against Intel's Ultrabook chips, we might see a more powerful GPU (AMD has technology available for this already, just look at the PS4 design). However if AMD sees Kabini as a low end laptop / netbook platform, they are likely going to focus on making a small chip that is cheap to produce, and limit the GPU options to low performance models.
That is the whole issue with the entire AMD APU approach, it is ok if Kabini is a 4 cores + 2CUs as it is likely to end in pretty cheap laptops and in some tablets (lower power versions).
What is not OK, and it is that way since llano, Trinity didn't change that, is that AMD packs quiet some GPU power in their APUs and doesn't give the chip the bandwidth so it can be fed properly.
Usually the chips/APUs end up with less than half the bandwidth than what the GPU has to play with in its discrete incarnation. (I've a Llano + redwood laptops, the redwood has more than twice the bandwidth the IGP has, the difference in perfs is impressive, I actually pass on dual graphics and mostly uses the discrete GPU for games => the integrated redwood is useless...).
I think the same would hold true with a set-up based on trinity+Turk.

AMD has made nothing to maximize the performances of otherwise really impressive chips.
I don't know if they could blend DDR3 and GDDR5 (so having 2 different memory controler on the same chip), it would mean that GDDR5 would need to be soldered on the mobo in both laptop and desktop. Most likely a bit of a headache.
Though they had other options, imo AMD should have moved to 256 bit bus using DDR3 as soon as they shipped their first APUs, so llano.
It was not intended to be low end but ended being so for the reason I stated above, the silicon they invest on the GPU is 'useless' (or close) for gamers, you still need a discrete GPU most of the time. I think that either llano or trinity would have been fine with tiny GPU as in zacate or kabini (1 or 2 SIMDs / CUs).

I hope that they would do something better based on Jaguar cores and their last GPUs architecture. Jaguar cores are tiny and cool, they could pack quiet some GPU power on a reasonably sized chip, what they need is bandwidth => either they invest on a 256bit bus or they should pass on investing lot of silicon on the GPU with not that great return (vs the same silicon backed with proper bandwidth).
I would not have bought my laptop for llano I bought because of the discrete redwood+GDDR5.
Imo there is no market for their APUs right now as they are balanced, not good enough for most gamers, your average costumer just doesn't care (and rightly).

I'm not alone it seems, they had this huge inventory write off on llano based parts, and I'm not sure Trinity parts are popular either (/ or they are sold for a bargain to people that don't really care for the GPU perfs).

Other than that I'm happy that Jaguar cores turned out well, they need it, though I'm not sure they will sell the part with the margins they need to move forward.

I hope that Kaveri would come with a 256 bit bus too, to some extend I wonder if AMD is scared of competing with them selves (ie their discrete GPU sales), imo at this point they should not they are getting nowhere fast :(
 
Last edited by a moderator:
AMD packs quiet some GPU power in their APUs and doesn't give the chip the bandwidth so it can be fed properly.

Perhaps they could use a small amount of eDRAM or something similar to increase bandwidth? even if it's just for the premium version i think it would be interesting.
 
Perhaps they could use a small amount of eDRAM or something similar to increase bandwidth? even if it's just for the premium version i think it would be interesting.
I think the issue with that approach, EDRAM or SRAM as in the next Xbox, is that PC are PC.
A given amount of scratchpad memory may not work with a given game, or only at some resolution, etc.
In the desktop realm, the resolution has been pretty high for quiet some year which implies a hefty amount of scratchpad, an amount that silicon budgets of the time do not allowed.

In the laptop realm it may have been doable as lot of laptop are ~720p. Though even at that resolution I would expect some games to be problematic, / the G-buffer for example would not fit into the scratchpad.
The other issue is "is there a market" for gaming laptop? May be, now the market has to be pretty tiny but the overall high price are not helping.

A move to something like that (scratchpad) would have required AMD to work really closely with games developers so their games work well within the constrain their chips would have.
If they managed that may using such chips could have worked in the desktop replacing their low end gaming parts (ala HD x6xx).
Still it would need lot of developers efforts which would only come if AMD can assure them that that is a significant target (looking at the state of piracy in the PC realm).

Wider buses would have come at costs but working with software as it is done in the PC realm, it would imo have made their parts more worth it for the intended target (budget gamers, occasional gamers, I mean who else cares for redwood and above type of performances? ).
It turned that you buy a llano trinity but most likely even if you are on a tigh budget you are likely to invest on a discrete GPU, that investment is imo what drives down the value of AMD APUs as perceived by their intended target. Ultimately AMD produced +200mm^2 (not cheap to produce) with half that silicon left untouched even by the intended users of the product.

The proposal is like that (I haven't checked the prices for PC parts in quiet a while, I hope I'm in the ball park):
You buy a llano/trinity part: 120$ < x < 150$
You buy a redwood/Turk class of GPU (or just above): 80/90$
Total cost: 200$ - 240$

I'm not sure that AMD makes its best margins on that kind of GPU (redwood/turk class), quiet the contrary, I would think margins are higher on higher end parts.
For the APUs by self, looking at the size of the chip, I would think AMD does not do much money either (if they sell what they produce which did not happen with Llano at least).

Widen the bus would not add that much money (both chip and mobo) but could have ensure that AMD sells the part 180$ and above. So with mostly the same silicon spent 304-60$ that goes into margins, way more than the margins they might be doing by selling a discrete GPU on top of the APU.
In the laptop realm, may be AMD could have started something with relatively budget gaming laptop (imo llano and trinity falls short, average FPS are close to be useless metric, in most PC games if you barely pulls over 30fps on average you are pretty much guaranteed that game expierence is as smooth as a train wreck).

Now that they have Jaguar, the picture could be even better as density seems way better on TSMC process (in llano the IGP is as big as redwood for example but llano is produce on a SOI 32nm lithography whereas redwood use TSMC 40nm process), they need bandwidth either way the edge they have (over competition) with their "high" performance IGP is moot as the those piece of hardware are bandwidth starved.
 
Last edited by a moderator:
Is this a 3.5W Temash?

http://www.cpu-world.com/news_2013/2013030301_AMD_A4-1200_Temash_APU_sighted.html

http://www.amd.com/us/products/notebook/tablets/Pages/tablets.aspx#/4

Power projections based on calculations carried out by AMD Performance Labs measuring total system and individual component power at Windows Idle and under various system loads while web browsing and/or viewing a 9:57 minute online video in h.264 format, viewed at 1080P setting at 100 nits.

The AMD “Larne” reference platform is projected to measure APU power at 1.2 W at idle, 1.40 W during web browsing, 2.35 W during video playback and .02 W during a system S3 “sleep” state.

The power projections are based on the “Larne” reference system with a configuration including the A4-1200 Dual Core 1.0GHz APU, AMD Radeon™ HD 8180 series graphics, 2GB DDR3-1066 system memory and Microsoft Windows 8.
 
Back
Top