The AMD Execution Thread [2007 - 2017]

Status
Not open for further replies.
Another thing is, a lot of even older scene demos are not loading due to some OpenGL or DX functions not properly supported by Intel driver.
In the case of DirectX the vast majority of these are app issues (and you're overstating how often it comes up). You would not believe how many apps I've seen that basically do "if (NV or AMD) { working code; } else { broken code we never test; }". Simply spoofing the device ID often causes it to work just fine on Intel.

Obviously baytrail's GPU isn't anything to write home about but the sole game you mention (Trials Evo) isn't exactly a great PC port to start with. Popping textures and crap like that is all app-side... you're not going to get anything different at the same settings elsewhere. I'm guessing that the newer Trials Fusion - PC version done by Redlynx this time as well - runs better than Evo.

Anyways, I agree that baytrail isn't particularly capable of non-casual gaming (and that AMD should be notably better there), but let's not confuse that it architecture or driver situations. Baytrail is just a tiny, tiny GPU (4 EUs!) and there's not much more to it than that.

In terms of stuff that is playable though, give Broken Age and Defense Grid a try; both have touch interfaces and run well. Civ is playable if you're willing to tolerate a low frame rate. And obviously anything in the Windows store will run great even on baytrail... and there's the rub really. While it's easy for you and I to complain about how bad baytrail's GPU is compared to laptops/desktops, sadly it's still middle of the road compared to most tablets. The reality is that tablets and tablet games are still pretty low-end in terms of requirements and even baytrail will run that level of stuff just fine.

Obviously this will all get better in the next few years and Mullins is a step in the right direction there. Interested to see power/battery numbers for it in any case.
 
In the case of DirectX the vast majority of these are app issues (and you're overstating how often it comes up). You would not believe how many apps I've seen that basically do "if (NV or AMD) { working code; } else { broken code we never test; }". Simply spoofing the device ID often causes it to work just fine on Intel.

Obviously baytrail's GPU isn't anything to write home about but the sole game you mention (Trials Evo) isn't exactly a great PC port to start with. Popping textures and crap like that is all app-side... you're not going to get anything different at the same settings elsewhere. I'm guessing that the newer Trials Fusion - PC version done by Redlynx this time as well - runs better than Evo.

Anyways, I agree that baytrail isn't particularly capable of non-casual gaming (and that AMD should be notably better there), but let's not confuse that it architecture or driver situations. Baytrail is just a tiny, tiny GPU (4 EUs!) and there's not much more to it than that.

In terms of stuff that is playable though, give Broken Age and Defense Grid a try; both have touch interfaces and run well. Civ is playable if you're willing to tolerate a low frame rate. And obviously anything in the Windows store will run great even on baytrail... and there's the rub really. While it's easy for you and I to complain about how bad baytrail's GPU is compared to laptops/desktops, sadly it's still middle of the road compared to most tablets. The reality is that tablets and tablet games are still pretty low-end in terms of requirements and even baytrail will run that level of stuff just fine.

Obviously this will all get better in the next few years and Mullins is a step in the right direction there. Interested to see power/battery numbers for it in any case.

I do know how demo scene productions are done, I was actively involved in 90' on both Amiga and PC side where some of the code wouldn't even run on lets say A500 even though it was developed on another A500 with just slightly different motherboard revision. So I take your point that in many cases this is down to engine code itself. :)
But the point stands that quite a few demos from my collection will not run on Intel hardware, which is a shame :(

Now regarding Trials game I think you misunderstood me. I do not see texture pop-ups due to megatexturing, I see horrible lack of precision when rendering them (Z-Buffer?) causing constant fading in/out of parts of them.

Here is link to YT showing this problem:

https://www.youtube.com/watch?v=nfuaXQz6WFw

(thanks to game build in YT exporter which seems to use only 1 core of CPU to encode video making this process take forever [7 minutes] on Z3770 for the video posted above!)

When playing on low settings these artefacts are a lot more common but rendering to YT happens in HQ, so It's not what I see. Besides it looks like it renders in slow motion, but that is actual game speed at that setting!

BTW thanks for reminding me about DefenseGrid! My favourite tower defence game ever! I must try it on a tablet :)
 
I think there's a decent argument drawing from the results of the revisions to the architecture and AMD's spreading out full design validation across multiple phases of the same basic chip (Trinity->Richland, Jaguar->Puma) that AMD can't really expend the necessary resources to "fix" Bulldozer, or that if it could that it wouldn't have the best ROI.

Bulldozer may have been a poor call at 32nm, but with the fab spinoff and the lack of high-performance MPU-focused processes by AMD's fab partners for the upcoming nodes makes me think Bulldozer is now fundamentally the wrong architecture.
32nm SOI is the high water mark for AMD's CPU circuit performance on a process that basically evolved with the CPU, something Bulldozer seriously needs to justify much of its design.
The future is cell-phone processes that give jack-all consideration to AMD's CPU needs, especially when the GPU is the only realy compelling element.

AMD isn't done degrading the starting point for its CPUs, not with the likely move to even more bog-standard bulk than the semi-specialized process Kaveri uses and a lowered TDP for Carizo.

The resources required for building the infrastructure for and validating a x86 server/desktop/laptop design are also incredibly high for a situation where AMD's fortunes in most of them are so poor and potentially sapped by the declining growth in some of them.
There's simply so much engineering that AMD has fallen behind on, and a number of partners it has burned severely over the years, meaning that even a return to competitive form would be hampered by slowed uptake.

If AMD is to fix Bulldozer to conform to the new realities going forward, I'm not sure it's a fix as much as a replacement or abandonment of its line. BD isn't good for any of the markets it was supposed to address, and I don't think AMD has the chops to produce a single line that can straddle this many segments.

Bulldozer was pretty bad, but is there anything in Steamroller that strikes you as ill-suited to the needs of modern PCs? It seems like a fairly sound design to me. It's hard to make an accurate assessment of the core itself due to the fact that there's only one implementation of it so far, i.e. Kaveri, but given the latter's atrocious memory latency, I'm inclined to believe that Steamroller itself is a decent core.

It's not as wide as Haswell and can't compete with it on a per thread basis, but that's not absolutely necessary to design good PC chips. As for Carrizo's lowered TDP, that's only a potential issue for high-end desktop APUs, which aren't a very big part of AMD's business anyway. Besides, in practice 95W Kaveris seem to draw a good bit less, and aren't much faster than 65W ones, so it might not even be an issue at all.
 
But the point stands that quite a few demos from my collection will not run on Intel hardware, which is a shame :(
Yeah fair enough, although I'd narrow that to don't run well on *baytrail* - unless you've tried them on other Intel hardware). Obviously artifacts or any kind suck, I'm just saying the properly attributing the issue (pointing fingers) would take some debug :) Ivy bridge drivers (roughly what baytrail uses) are fairly solid at this point.

Definitely something weird going on with the texture streaming, etc. Like I said though, I don't have a lot of faith in that PC port. I had no end of issues on my NVIDIA card with it too. Fusion is much better that way.

Besides it looks like it renders in slow motion, but that is actual game speed at that setting!
Yeah Trials currently doesn't frame skip at all... it renders everything as if it was running at 60fps. If the machine can't handle it, you just get slowed down time :) Frankly this isn't a game that is going to be fast enough on that tablet to start with so there being bugs is a bit academic.

BTW thanks for reminding me about DefenseGrid! My favourite tower defence game ever! I must try it on a tablet :)
Yeah it's actually pretty nice - they added a full touch interface/UI and everything! Give it a shot.

Android's got a big library of apps that are optimized for touch interfaces and tablet form factors. This library on Windows pales in comparison.
Indeed, but as I've personally discovered, the vast majority of stuff that I do is on the web, and some in apps like twitter, facebook, etc. (all of which have Windows apps now too). As strange as it is to say this, IE11 on my Venue 8 Pro puts other tablets/browsers to shame... Indeed my 2nd gen Nexus 7 mostly sits on the shelf these days. Beyond a slightly nicer screen and Qi charging, everything else about the V8P is better.

Certainly for the folks that like to play tons of touch games on such on their tablets Windows isn't comparable to Android/iOS, but all the basics are solid these days.
 
What can you do on an Android tablet which cannot be done on Windows?

Doesn't really matter if Windows tablets are superior to Android tablets or not. One is selling boatloads, the other isn't. AMD's failure, for whatever reason, to get in that market will cost them in the future.
 
I think there's a decent argument drawing from the results of the revisions to the architecture and AMD's spreading out full design validation across multiple phases of the same basic chip (Trinity->Richland, Jaguar->Puma) that AMD can't really expend the necessary resources to "fix" Bulldozer, or that if it could that it wouldn't have the best ROI.

Bulldozer may have been a poor call at 32nm, but with the fab spinoff and the lack of high-performance MPU-focused processes by AMD's fab partners for the upcoming nodes makes me think Bulldozer is now fundamentally the wrong architecture.
32nm SOI is the high water mark for AMD's CPU circuit performance on a process that basically evolved with the CPU, something Bulldozer seriously needs to justify much of its design.
The future is cell-phone processes that give jack-all consideration to AMD's CPU needs, especially when the GPU is the only realy compelling element.

AMD isn't done degrading the starting point for its CPUs, not with the likely move to even more bog-standard bulk than the semi-specialized process Kaveri uses and a lowered TDP for Carizo.

The resources required for building the infrastructure for and validating a x86 server/desktop/laptop design are also incredibly high for a situation where AMD's fortunes in most of them are so poor and potentially sapped by the declining growth in some of them.
There's simply so much engineering that AMD has fallen behind on, and a number of partners it has burned severely over the years, meaning that even a return to competitive form would be hampered by slowed uptake.

If AMD is to fix Bulldozer to conform to the new realities going forward, I'm not sure it's a fix as much as a replacement or abandonment of its line. BD isn't good for any of the markets it was supposed to address, and I don't think AMD has the chops to produce a single line that can straddle this many segments.

Look , they need something . If its not a fix bulldozer scalled down then it needs to be a puma scalled up. They have nothing in the market from 2watts up to 100watts at this point.

I'm looking at anand's preview of puma+ and at 4.5 watts its faster than silvermont but still uses more power and we get Merrifield/Moorefield in a few months which will further push the lead towards intel.

They just aren't competitive cpu wise anywhere. They have no high end cpu heck they really have nothing past $150 bucks that's interesting. They have no tablet wins out there either.

They really need to pair up with vizio or someone and create a line of amd tablets from 4.5 watts up to 15 watts .
 
Doesn't really matter if Windows tablets are superior to Android tablets or not. One is selling boatloads, the other isn't. AMD's failure, for whatever reason, to get in that market will cost them in the future.

i think that's coming down to pricing. Most people i know have $200 or under android tablets. The only windows one i know of is the venue pro 7 in that price range .


I know i can run blustack on my surface pro and get to use any android program on it.
 
Bulldozer was pretty bad, but is there anything in Steamroller that strikes you as ill-suited to the needs of modern PCs?
In a vacuum, for the restricted subset of the Bulldozer core's target workloads, there's nothing that is unacceptable for most standard PCs. Steamrollers isn't targeting server that much, so a good chunk of Bulldozer's design choices like the hefty L1 become suspect (the limited associativity and aliasing issues of the L1 make it somewhat suspect for server).

In context, Trinity or Richland are equally acceptable, and an i3 or i5 equally acceptable or superior for roughly the same price or less while being more efficient and profitable.
Implementation changes to the processor broke socket backwards compatibility, so a significant swath of the market that thinks Steamroller-level performance is acceptable would not find Kaveri acceptable.

For gaming rigs, even with Mantle AMD's frame times are obviously more variable than an i5 on games like BF4 that have sized their CPU load to run acceptably with DX11. I don't trust a chip that is looking fit to flop over once games decide to do something with the CPU power freed up by Mantle besides nothing.

It seems like a fairly sound design to me. It's hard to make an accurate assessment of the core itself due to the fact that there's only one implementation of it so far, i.e. Kaveri, but given the latter's atrocious memory latency, I'm inclined to believe that Steamroller itself is a decent core.
Kaveri's biggest improvements are where it can reverse the CMT underpinnings of Bulldozer, and its weaknesses are where it cannot.
The cache subsystem is not that great, although it's not subject to some very bad corner cases like BD's very constrained write throughput or Trinity's unexplained terrible 256-bit load throughput.
The FPU was only ever competitive with Sandy Bridge when in an octo-core model, and with AVX2 the promotion of integer SIMD to full-width leaves that little sliver behind.


It's not as wide as Haswell and can't compete with it on a per thread basis, but that's not absolutely necessary to design good PC chips.
It competes with Nehalem, and cannot justify itself against its predecessors in the desktop.
It is currently a null offering for mobile. It's still a speed racer design with coarse module-level power gating.
Bulldozer-line cores still seem to have problems reaching the market, and they generally have plenty of not-so-good salvage bins.
That may come down to process problems, and Llano is a sign that the alternative was worse, but I think part of it is that Bulldozer cores at a fundamental level require circuit performance in a range that is outside the comfort zone of any process AMD will see for a very long time.

As for Carrizo's lowered TDP, that's only a potential issue for high-end desktop APUs, which aren't a very big part of AMD's business anyway. Besides, in practice 95W Kaveris seem to draw a good bit less, and aren't much faster than 65W ones, so it might not even be an issue at all.
The lack of performance scaling from the ostensible middle-range to high end of the design is a problem itself.
There are designs that do not have this problem, even from AMD itself.

AMD eked out better power efficiency at more modest clocks, and the 28nm process it uses purposefully sacrifices high-frequency scaling. The Steamroller pipeline is still a pipeline that was a disappointment when it could hit 4.2 GHz, much less one that turbos to ~3.
(correction: that is for one of the A8 SKUs, the A10 is up to 4)

I'm really not sure where AMD can go from there with Bulldozer.
Apple's A7 is able to get good performance for its niche because it doesn't pretend that it could someday scale to 4.5 to 5 Ghz.
On the other hand, I'm not sure what AMD can do to justify building a replacement with the same set of costly requirements and cross-purposes.
Its biggest growth areas are semicustom consoles, emerging market small systems, and maybe physicalized dense servers, markets that are cost-conscious and noteworthy for having lower standards in terms of engineering, performance, reliability, and validation.
 
In a vacuum, for the restricted subset of the Bulldozer core's target workloads, there's nothing that is unacceptable for most standard PCs. Steamrollers isn't targeting server that much, so a good chunk of Bulldozer's design choices like the hefty L1 become suspect (the limited associativity and aliasing issues of the L1 make it somewhat suspect for server).

In context, Trinity or Richland are equally acceptable, and an i3 or i5 equally acceptable or superior for roughly the same price or less while being more efficient and profitable.
Implementation changes to the processor broke socket backwards compatibility, so a significant swath of the market that thinks Steamroller-level performance is acceptable would not find Kaveri acceptable.

For gaming rigs, even with Mantle AMD's frame times are obviously more variable than an i5 on games like BF4 that have sized their CPU load to run acceptably with DX11. I don't trust a chip that is looking fit to flop over once games decide to do something with the CPU power freed up by Mantle besides nothing.

Kaveri's biggest improvements are where it can reverse the CMT underpinnings of Bulldozer, and its weaknesses are where it cannot.
The cache subsystem is not that great, although it's not subject to some very bad corner cases like BD's very constrained write throughput or Trinity's unexplained terrible 256-bit load throughput.
The FPU was only ever competitive with Sandy Bridge when in an octo-core model, and with AVX2 the promotion of integer SIMD to full-width leaves that little sliver behind.

No argument there.

It competes with Nehalem, and cannot justify itself against its predecessors in the desktop.
It is currently a null offering for mobile. It's still a speed racer design with coarse module-level power gating.
Bulldozer-line cores still seem to have problems reaching the market, and they generally have plenty of not-so-good salvage bins.
That may come down to process problems, and Llano is a sign that the alternative was worse, but I think part of it is that Bulldozer cores at a fundamental level require circuit performance in a range that is outside the comfort zone of any process AMD will see for a very long time.

Is it really a speed-racer, though? It's still somewhat narrow but the OoO window is now quite deep, pipeline length is moderate, and clock speeds aren't very high anymore. I'd expect future iterations of the design to further maintain that trend. If the various rumors about Excavator being (much) wider are true, there may not be much of Bulldozer left in it, apart from the shared FPUs and fetch stage.

As for the granularity of power gating, I think we should bear in mind that in terms of silicon area, if not transistor count, a Bulldozer module is comparable to a Sandy Bridge core. Still, given what we've seen of Mullins, there's reason to believe that Kaveri is leaving a good bit of performance on the table, for the same reasons that Temash was; or similar ones. Perhaps AMD can pull off a big Mullins in Carrizo, to a lesser extent (Kaveri has acceptable power management, unlike Temash).

The lack of performance scaling from the ostensible middle-range to high end of the design is a problem itself.
There are designs that do not have this problem, even from AMD itself.

Is it really that unusual, though? Once again a Bulldozer module is comparable to a Sandy Bridge core in terms of silicon area. Dual-module BDish designs don't scale from 15W to 100W, but then again neither do dual-core Sandy Bridge (or Ivy/Haswell) ones. Obviously, the latter do much better at just about any envelope, but it takes a quad-core (or quad-module) to get good scaling above 50W or so.

AMD eked out better power efficiency at more modest clocks, and the 28nm process it uses purposefully sacrifices high-frequency scaling. The Steamroller pipeline is still a pipeline that was a disappointment when it could hit 4.2 GHz, much less one that turbos to ~3.
(correction: that is for one of the A8 SKUs, the A10 is up to 4)

I'm really not sure where AMD can go from there with Bulldozer.
Apple's A7 is able to get good performance for its niche because it doesn't pretend that it could someday scale to 4.5 to 5 Ghz.
On the other hand, I'm not sure what AMD can do to justify building a replacement with the same set of costly requirements and cross-purposes.
Its biggest growth areas are semicustom consoles, emerging market small systems, and maybe physicalized dense servers, markets that are cost-conscious and noteworthy for having lower standards in terms of engineering, performance, reliability, and validation.

As you point out, Kaveri does relatively better at lower clock speeds (hence power envelopes). And while it's basically a wash compared to Richland at 95~100W, it does much better at 45W. One would expect that difference to be even more pronounced at 35W, 25W and 15W. In fact, Richland never got below 17W, and never below 19W as a quad-core. Since most PCs sold today are laptops, I'd say that's a pretty good trade-off, even if it's less than exciting to desktop consumers looking for a high-end CPU.

To me, the path for the future is pretty clear: the target is now 15~35W and shifting downwards, so it's all about improving power-efficiency by any means necessary, and managing or lowering costs. This appears to be precisely what Carrizo is meant to do, and while very little is known about it at the moment, its integrated southbridge is a clear sign, I think, that AMD is heading that way.

We'll see what they can actually release, but I'd expect:
  • +0~10% performance/clock in integer,
  • +~50% performance/clock in SIMD FP if they do go 256-bit wide as is rumored, but that's not a given,
  • +5~10% CPU clock speed at 25W, much less at 65W (if anything),
  • +~30% in graphics performance, perhaps (from µ-architectural and design/process improvements alike, plus DDR4),
  • moderate power-efficiency improvements all-around, notably thanks to the integrated southbridge.

Then there's that:
IMG0043802.png

If they manage to fix this so far inxeplicably horrible latency, they should get a nice boost all-around.

There are a number of rather sizable "ifs" above, but even if none of them turn out well, I think we're still looking at an attractive offer for many OEMs looking to build decent, affordable laptops. Adequate CPU performance + good graphics + good battery life + low platform cost = good deal, methinks.
 
Is it really a speed-racer, though? It's still somewhat narrow but the OoO window is now quite deep, pipeline length is moderate, and clock speeds aren't very high anymore.

It's a 16 FO4 design, that is, each stages has a maximum of 16 fan out four delays. Not as aggressive as the 14-FO4 used for IBM's Power 6, but still pretty bonkers. AMD can clock their CPUs as high as Intel's although they have a massive process handicap, the price they pay, is a (effectively) narrower design and simplified decision making in each stage compared to Intel, - and even old K8 based designs.

Cheers
 
Android's got a big library of apps that are optimized for touch interfaces and tablet form factors. This library on Windows pales in comparison.

Which isn't a factor for me. I actually went into the Windows Store and there's just too many applications there. And it's even worse when I was looking for apps on my Android phone. Too many applications. Too little quality assurance.

I spend more time researching whether an app is good and well implemented than I do actually using the Gdamn app.

I'm about to give up using apps on Mobile unless it's an extremely well known and established app.

It's just not worth my time to bother figuring out which of the 100's of thousands of apps are worth my time.

Regards,
SB
 
I think what sells Android is mostly the endless torrent of casual games and social apps, and the peer pressure lock-in aspect connected to all of that. I'm not sure what MS can do to change this other than get more devices out there, including full x86 phones with real 8.1.
 
Look , they need something . If its not a fix bulldozer scalled down then it needs to be a puma scalled up. They have nothing in the market from 2watts up to 100watts at this point.
I agree that AMD would be in a better position with something more competitive.
Whether something founded with the base architecture of Bulldozer can be that something, whether AMD can make something better, or if there is enough commercial upside for that effort to justify doing more than "leveraging" the IP and treading water are separate from the question of whether it sucks to have a Bulldozer design as the best it can offer.

Is it really a speed-racer, though? It's still somewhat narrow but the OoO window is now quite deep, pipeline length is moderate, and clock speeds aren't very high anymore.
It's quite narrow, and its misprediction penalty is on the order of Northwood. It's not the most extreme speed demon, which I have sort of waffled on by calling a speed racer (not really a consensus term).

Failing to hit high clocks is not the same thing as not being a clock-optimized architecture. The pipeline is physically long, the caches small and miss-prone, and latencies out of whack because the design assumed it could physically reach the clocks that would compensate for the stripped down stages.


As for the granularity of power gating, I think we should bear in mind that in terms of silicon area, if not transistor count, a Bulldozer module is comparable to a Sandy Bridge core.
However, in terms of actual gating effectiveness and architectural power efficiency, Bulldozer clearly burns more power for the same area and fewer transistors and for worse performance.
Even if nominal area is equivalent, Intel made up for it with superior power management, process quality, and higher performance within that area.
The other side of that situation is that if you cannot manage equal quality in the same amount of active area, your design probably shouldn't create fundamental architectural reasons for keeping twice as much of it active.


In fact, Richland never got below 17W, and never below 19W as a quad-core. Since most PCs sold today are laptops, I'd say that's a pretty good trade-off, even if it's less than exciting to desktop consumers looking for a high-end CPU.
But Kaveri is a laggard for its mobile launch.
The supposedly more efficient A8 SKUs were paper-launched.
Should I blame the GCN architecture that has been able to hit mobile products or the latest iteration of a CPU architecture that has proven hostile to power-efficient implementation?

This appears to be precisely what Carrizo is meant to do, and while very little is known about it at the moment, its integrated southbridge is a clear sign, I think, that AMD is heading that way.
This is the only direction AMD can go, with the modification that adding the south bridge to a non-SOC process chip is more impactful than it is to add it to the already highly synthesized and SOC-oriented Jaguar implementations.
That AMD is polishing things again doesn't mean they aren't fighting their architecture for every step they're taking it in the opposite direction of what was intended years ago.
Cost-wise, the chip is going to gain area due to the southbridge, and it's a decent chunk of frequently poorer-density transistors. One possibility is mediocre per-die cost improvement, but incrementally higher ASP (relative to what it could command otherwise) due to platform-level cost savings.
Area could be saved, if the teased high density libraries get used, but that would also have profound impacts on the CPU once again.

I also saw rumors of integrated power regulation circuitry, which has an area cost and extra power consumption on-die but lower platform power. A well-implemented version can use the more responsive circuitry to increase the effectiveness of DVFS and gating. But that would be a complex thing to master and AMD has taken two silicon launches of the same chip for two different designs to validate less complicated things.

Making Carizo a true SOC and potentially porting it to completely standard 28nm, and possibly using HDL on the CPUs are all very different from what Bulldozer was expected to be implemented on.
There's probably going to be another clock hit, and we'll have to see how well the design can be manufactured.
It doesn't seem out of the question that clocks could be lower even at the lower TDP ranges, with the low end problematic because that is close to the upper range of architectures that are smaller, cheaper, and never had bespoke manufacturing processes to lose.
 
Cost-wise, the chip is going to gain area due to the southbridge, and it's a decent chunk of frequently poorer-density transistors.

That will be compensated to some degree by the removal of one third of the on-die PCI-E controller, and it would make sense for the unused half of Kaveri's memory controller to be removed as well.
 
Last edited by a moderator:
It's quite narrow, and its misprediction penalty is on the order of Northwood. It's not the most extreme speed demon, which I have sort of waffled on by calling a speed racer (not really a consensus term).

Failing to hit high clocks is not the same thing as not being a clock-optimized architecture. The pipeline is physically long, the caches small and miss-prone, and latencies out of whack because the design assumed it could physically reach the clocks that would compensate for the stripped down stages.

Fair points, but small caches? I guess the L1D is quite small (16KB, per core) but the L1I and L2 are respectively 96KB and 2MB in Steamroller, both shared by the entire module. I haven't seen any miss rate data, but it could be that I've missed something.

However, in terms of actual gating effectiveness and architectural power efficiency, Bulldozer clearly burns more power for the same area and fewer transistors and for worse performance.
Even if nominal area is equivalent, Intel made up for it with superior power management, process quality, and higher performance within that area.
The other side of that situation is that if you cannot manage equal quality in the same amount of active area, your design probably shouldn't create fundamental architectural reasons for keeping twice as much of it active.

I wonder if those fundamental reasons are still all that problematic now as they were with Bulldozer/Piledriver. There's not all that much hardware shared anymore, so I wonder how difficult it would be to have per-core power gating. Obviously, anything shared would have to remain on a per-module basis.

But Kaveri is a laggard for its mobile launch.
The supposedly more efficient A8 SKUs were paper-launched.
Should I blame the GCN architecture that has been able to hit mobile products or the latest iteration of a CPU architecture that has proven hostile to power-efficient implementation?

Kaveri is kind of a laggard everywhere. AMD claims it's because of strong demand for current SKUs in China. I'm not sure I believe that. I think the currently available SKUs were launched because they were the crappiest bins so they were easy to get out the door, and AMD wanted to ship something in 2013.

Anyway, HP has just leaked a mobile Kaveri SKU, which I compare here with the closest Richland equivalent:

A10-7300: 4 cores. 2.00/3.20 GHz (Base/Turbo) — 19W
A8-5545M: 4 cores. 1.70/2.70 GHz (Base/Turbo) — 19W
http://www.cpu-world.com/news_2014/...0_mobile_Kaveri_CPU_spotted_in_HP_laptop.html

No work on graphics yet, but that bodes pretty well for Steamroller at low TDPs. Combined with the increase in IPC, it should net a substantial performance boost.

It also raises questions about 15W SKUs, since at that TDP, Beema manages 2.00/2.40GHz. If AMD means to launch 15W quad-core Kaveris, they'll probably have lower base clocks and higher Turbo clocks. Steamroller's higher IPC should make up for the clock deficit in most cases but Beema might occasionally be faster. So yeah, that's kind of awkward. Or maybe this A10-7300 is not the best 19W bin, and AMD can actually do a little better, but I doubt it.

This is the only direction AMD can go, with the modification that adding the south bridge to a non-SOC process chip is more impactful than it is to add it to the already highly synthesized and SOC-oriented Jaguar implementations.
That AMD is polishing things again doesn't mean they aren't fighting their architecture for every step they're taking it in the opposite direction of what was intended years ago.
Cost-wise, the chip is going to gain area due to the southbridge, and it's a decent chunk of frequently poorer-density transistors. One possibility is mediocre per-die cost improvement, but incrementally higher ASP (relative to what it could command otherwise) due to platform-level cost savings.
Area could be saved, if the teased high density libraries get used, but that would also have profound impacts on the CPU once again.

I also saw rumors of integrated power regulation circuitry, which has an area cost and extra power consumption on-die but lower platform power. A well-implemented version can use the more responsive circuitry to increase the effectiveness of DVFS and gating. But that would be a complex thing to master and AMD has taken two silicon launches of the same chip for two different designs to validate less complicated things.

Making Carizo a true SOC and potentially porting it to completely standard 28nm, and possibly using HDL on the CPUs are all very different from what Bulldozer was expected to be implemented on.
There's probably going to be another clock hit, and we'll have to see how well the design can be manufactured.
It doesn't seem out of the question that clocks could be lower even at the lower TDP ranges, with the low end problematic because that is close to the upper range of architectures that are smaller, cheaper, and never had bespoke manufacturing processes to lose.

I agree that AMD is probably fighting against the original BD concept just about every step of the way, and that's certainly a less pleasant situation than starting from a solid design, but as long as they can pull it off they should be OK. And so far, they're making progress (PD was much better than BD, SR is much better than PD up to 3~3.5GHz or so).

As far as HDLs are concerned, AMD claims 15-30% lower energy per operation for power-constrained designed, which I guess means at iso-frequency when said frequency isn't too high. There may well be a top clock speed hit once again, possibly compensated by IPC, but a 15~30% energy reduction in power constrained designs can actually mean a clock speed gain, so that sounds like a win to me.

About Carrizo, I should point out that (to my knowledge) AMD has yet to specify whether the southbridge is integrated on die or just on package. Since the southbridge is apparently quite limited, I would guess the former, but it's not known yet.

I also wonder about the collision with the small core family at low TDPs (even with Kaveri, cf. above). But Kabini went all the way up to 25W (sort of) whereas Beema seems limited to 15W, at least for now. I think we'll probably see the Excavator's TDP range shift down and Puma v2's shift along with it. The dividing line below which big cores don't make sense anymore should just follow the same course.
 
Anyway, HP has just leaked a mobile Kaveri SKU, which I compare here with the closest Richland equivalent:

A10-7300: 4 cores. 2.00/3.20 GHz (Base/Turbo) — 19W
A8-5545M: 4 cores. 1.70/2.70 GHz (Base/Turbo) — 19W
http://www.cpu-world.com/news_2014/...0_mobile_Kaveri_CPU_spotted_in_HP_laptop.html

No work on graphics yet, but that bodes pretty well for Steamroller at low TDPs. Combined with the increase in IPC, it should net a substantial performance boost.

It also raises questions about 15W SKUs, since at that TDP, Beema manages 2.00/2.40GHz. If AMD means to launch 15W quad-core Kaveris, they'll probably have lower base clocks and higher Turbo clocks. Steamroller's higher IPC should make up for the clock deficit in most cases but Beema might occasionally be faster. So yeah, that's kind of awkward. Or maybe this A10-7300 is not the best 19W bin, and AMD can actually do a little better, but I doubt it.
I think that will be the slowest quad core mobile kaveri, its GPU is likely to have only 128 cores.
 
Last edited by a moderator:
Fair points, but small caches? I guess the L1D is quite small (16KB, per core) but the L1I and L2 are respectively 96KB and 2MB in Steamroller, both shared by the entire module. I haven't seen any miss rate data, but it could be that I've missed something.
The L1 data caches are tiny and prone to bank conflicts.
The L1 instruction cache is large, but prone to aliasing problems because of its size and low associativity. The low associativity has been a source of curiosity all on its own.
The write-combining cache is small, and a bottleneck.
The L2 is better in terms of size and associativity, the latency borders that of an Intel L3.

Agner Fog's optimization guide also outlines some kind of multithreading penalty for cache throughput where the L1 throughput is reduced when multiple threads are running, even though they are considered non-shared resources. He was not sure what the reason was for that.


I wonder if those fundamental reasons are still all that problematic now as they were with Bulldozer/Piledriver. There's not all that much hardware shared anymore, so I wonder how difficult it would be to have per-core power gating. Obviously, anything shared would have to remain on a per-module basis.
Power gating is supposed to electrically isolate the gated core from the rest of the system.
As long as something is shared, it keeps the cores from being isolated.

About Carrizo, I should point out that (to my knowledge) AMD has yet to specify whether the southbridge is integrated on die or just on package. Since the southbridge is apparently quite limited, I would guess the former, but it's not known yet.
My non-authoritative interpretation is that it is on-die, hence why it was discussed that the southbridge is inactivated if Carrizo is in an FM2+ socket and the processor relies on the motherboard's chipset.
No need to inactivate it if it isn't there, and no reason to put it on a package destined for a socket that won't use it.
 
Anyway, HP has just leaked a mobile Kaveri SKU, which I compare here with the closest Richland equivalent:

A10-7300: 4 cores. 2.00/3.20 GHz (Base/Turbo) — 19W
A8-5545M: 4 cores. 1.70/2.70 GHz (Base/Turbo) — 19W
http://www.cpu-world.com/news_2014/...0_mobile_Kaveri_CPU_spotted_in_HP_laptop.html

No work on graphics yet, but that bodes pretty well for Steamroller at low TDPs. Combined with the increase in IPC, it should net a substantial performance boost.

Agreed, it seems like it shall be a good improvement over Richland.
However, like Beema & Mullins, this is what AMD needed to have out a year ago.


I think that will be the slowest quad core mobile kaveri, its GPU is likely to have only 128 cores.
I highly doubt that. The A8-5545M it succeeds has 384 shaders, and an A10 chip with 3/4 of the die's shaders fused off would simply be bizarre.

I'd expect it to have 384 shaders clocked similarly to the A8-5545M, or possibly all 512 shaders clocked a bit lower if it is the top of the range Kaveri 19W part.
 
Last edited by a moderator:
The L1 instruction cache is large, but prone to aliasing problems because of its size and
low associativity. The low associativity has been a source of curiosity all on its own.
isn't that effectively fixed in steamroller?
The write-combining cache is small, and a bottleneck.
isn't that also fixed in steamroller?
The L2 is better in terms of size and associativity, the latency borders that of an Intel L3.
its also the same size so that kind of makes sense.
 
Status
Not open for further replies.
Back
Top