AMD Carrizo / Toronto

If the claims pan out, things like cutting half of Kaveri's oversized memory IOs would be a decent area saver. The IOs for HBM should be very small, if they are there.
HDL might be a factor, but so can other things like if the 28nm process is different than the semi-custom one Kaveri is on. A full transition to a density-optimized node could allow for the GPU to shrink further, and also encourage a cap on CPU clocks.

Should HBM be true, however, then AMD can be fighting for area savings because the dimensions of the chip and the HBM modules have an impact on the interposer and the package footprint. Even if the cheapest products can't justify the cost-adder, unless AMD has two different implementations, the chip has to be suitable for both.

Good point, I hadn't thought of that.

It could still be a corner case for the architecture, if it takes serious binning to get there.

Well, I wouldn't know about that, but my guess is that the target for Carrizo is 15-35W. Anything below or above that is probably taking the design out of its optimal range.

It will be interesting to see whether Carrizo or Nolan/Amur performs better at 15W. I'm leaning towards the latter because Beema already has the advantage (albeit a small one) over Kaveri and they're both on 28nm, while Nolan/Amur should be shrunk made on 20nm.
 
excavator is supposed to be getting 256bit FMA units. If it does then there should be heaps of cases ( assuming load store bandwidth is sufficient) where they could exceed 30% performance improvement, maybe not perf per watt, but absolute perf.

also, in the low power field 19watt area, kaveri isn't that bad at all. I have an X1 carbon with a 15watt i5 and i have to use throtlestop to get non spastic performance given the massive roll off on clocks as you load/use threads.
 
Last edited by a moderator:
Excavator is a Bulldozer/Piledriver follow on not Jaguar right?
Do we know much about the architecture?

If they can pull off 30% performance improvement on reduced power & smaller die area could we be looking at something potentially able to foot it with Intel for Desktop CPU again?
 
Excavator is a Bulldozer/Piledriver follow on not Jaguar right?
Do we know much about the architecture?

If they can pull off 30% performance improvement on reduced power & smaller die area could we be looking at something potentially able to foot it with Intel for Desktop CPU again?

It's Bulldozer-descendant and no, it can't.
 
Yeah, looks like its not till the next one after Excavator that we might see something much better :(
 
If they can pull off 30% performance improvement on reduced power & smaller die area could we be looking at something potentially able to foot it with Intel for Desktop CPU again?
No matter the architecture, Excavator is still 28nm while its competitors will be manufactured on 14 nm. 2015 won't be a very nice year for AMD's CPU business unit.
 
Excavator is a Bulldozer/Piledriver follow on not Jaguar right?
Do we know much about the architecture?

If they can pull off 30% performance improvement on reduced power & smaller die area could we be looking at something potentially able to foot it with Intel for Desktop CPU again?

Yes, it's an evolution of Steamroller, but +30% is at 15W. We're unlikely to see anything close to that on 65W desktop chips, except for graphics if HBM is indeed included.
 
Given that GCN in Kaveri would not cache any coherent lines, I wouldn't hope much hope on this, but simply pushing the bar of bandwidth efficiency up. It would be interesting to know how the system architecture is compared with the SkyBridge APUs, as both fully support HSA as claimed.

Filters and directories would be more useful in the case where there are accesses to coherent memory that aren't actually cached. There was a paper concerning region-based tracking to reduce unnecessary snooping and arbitration.
For example, if an Onion+ access was able to reference a table of memory regions the CPU doesn't have in cache, it might be able to be shunted to Garlic--which would be nice if any high-bandwidth memory solutions present themselves.
Even without high bandwidth, it saves spamming the CPU caches with snoop broadcasts.

For the current solutions, it is less likely to be a major change for the other direction. Theoretically, CPU accesses could check what regions might have in-flight accesses over Onion or Onion+, or possibly what mappings the GPU currently can cache. If in the clear, the APU could provide a CPU fastpath that could claw back some of that massive additional memory latency Kaveri heaped on everything. That could make up for a smaller L2.

Actual cache conflicts would still be slow, but that's a case that's not as interesting because getting a lot of hits in a scrawny L2 with a GPU's bandwidth means you're thrashing the CPU cache hierarchy completely. The exception might be HSA signal variables, but there could be mitigating circumstances for that.
 
Filters and directories would be more useful in the case where there are accesses to coherent memory that aren't actually cached. There was a paper concerning region-based tracking to reduce unnecessary snooping and arbitration.
For example, if an Onion+ access was able to reference a table of memory regions the CPU doesn't have in cache, it might be able to be shunted to Garlic--which would be nice if any high-bandwidth memory solutions present themselves.
Even without high bandwidth, it saves spamming the CPU caches with snoop broadcasts.

For the current solutions, it is less likely to be a major change for the other direction. Theoretically, CPU accesses could check what regions might have in-flight accesses over Onion or Onion+, or possibly what mappings the GPU currently can cache. If in the clear, the APU could provide a CPU fastpath that could claw back some of that massive additional memory latency Kaveri heaped on everything. That could make up for a smaller L2.

Actual cache conflicts would still be slow, but that's a case that's not as interesting because getting a lot of hits in a scrawny L2 with a GPU's bandwidth means you're thrashing the CPU cache hierarchy completely. The exception might be HSA signal variables, but there could be mitigating circumstances for that.
I thought you mean for GCN's caches themselves, but it makes sense if those structure oversees the entire system. I guess you have read the AMD papers of region coherence plus the integration with GPU, right? Those are clearly possible and likely the future direction of AMD, but I would still hold my little doubt of these happening on Excavator, as we are talking about huge changes in cache hierarchy which are usually tied with the CPU core family. Though speaking of the cost, implementing the region directory should be acceptable on a chip like Carrizo, I assume.

P.S. There is a lightweight notification protocol in the PCIe standard, which fits the description of the doorbell mechanism of HSA.

P.S.2. HSA got Component Local Memory for applications to bypass the coherent mechanism and potentially achieve higher performance. This would be the second exception.
 
The integrated southbridge probably means we can kiss FM2+ compatibility goodbye, but on the upside, it should save a bit of power and enable more compact form factors. The chip might even make it into a few tablets, if that's your cup of tea. I wonder how well all the southbridge I/O and ~4GHz logic are going to play together on the same die.

No, there was some news a few monthes ago regarding Carrizo and how it has fewer PCIe lanes (for graphics at least, 8x to a GPU card). It was said that when plugged into an FM2+ socket, the on-board southbridge is by-passed and the motherboard's southbridge is used instead. Which sounds wasteful but this way a single die will cover all use cases.
 
The interesting part is that there's only 2MB of L2, vs. 4MB on Kaveri. One wouldn't expect this unless there were some way of compensating for the loss of cache, namely HBM.

Maybe the 2MB L2 per module from Bulldozer was crappy and slow anyway, made some sense somewhat when paired with the L3 and used in server workloads. Bulldozer and Piledriver were server chips, sometimes used to build a 64-thread server (with L3 assisting in communication between the CPUs/dies).

But if you're not even trying to build a CPU with lots of cores and L3, maybe it's time to junk that big but slow L2. I think AMD got their act and made a decently lower latency L2. If the L2 is half as big but much better, your performance might well increase in some situations you care about ?!
 
It is true that losing the L3 for the AMD chips wasn't as devastating, in part because the L3 wasn't that good for the desktop and in part because the L2's size helped reduce trips to the L3 or memory.

The per-core L2 capacity has not been this small since 2003, and it would take a significant change in L2 architecture to make up for that.
The last time I can recall AMD halving the L2 and getting an overall improvement was moving the L2 on-die with Thunderbird.
Aside from that, the evolution of the L2s has been more modest, with big changes waiting on larger architectural refreshes to take place.

Absent that, the L2's design usually has knock-on effects on the memory pipeline it is integrated with, so I'm not sure it's worth the effort having to retool other portions of the design they're abandoning anyway.

I suppose AMD could be trying out a few new tweaks, but they could just as easily (ok more easily) just shrug and eat the CPU performance hit. Being a few percent worse on average really can't be a deal-breaker at this point if you're buying AMD.
 
Athlon II X4 would like to have a word with you.
Dual core (single module) APU do ship with 1MB L2 currently.

Anyway the performance will speak.
Having a slightly slower or equal CPU than Steamroller would suck though - latter was not that much faster than what was before it. (like what, 5% counting the lower clocks?)

You can have an upgrade path if you buy dual core Richland now (cheap, and 3.9GHz to 4.1GHz) then switch to a quad core Carrizo later.. but if the CPU performance completely stagnates, that sucks a bit.
 
I still have my Regor. It was great value but when I tried high end gaming I felt CPU limited. Can't have smooth firefights in Crysis at 800x600? Damn. Maybe something's wrong with my system (ddr2, motherboard that fails at any overclocking..)

I became disheartened. So I sure hope AMD can get as high single-thread performance as they can manage.
 
A significant L2 rework would be a good thing if they have done that.
If I recall correctly the current L2 is still basically same as Athlon64 generation?
 
Could the 1 MB cache be a common feature with an upcoming server based part?

Not heard of anything coming up, but if you had a L3 then a smaller L2 might make sense.
 
Since Carrizo is rumored to be smaller than Kaveri, I doubt it. It might happen at 14nm (Intel sure seems to think it's a great idea) but at 28nm it seems impractical.
 
AMD will use Carrizo as-is for servers? Together with their upcoming ARM stuff, they will in effect only have SoCs as server parts (and the G34 socket for "legacy").

It will be a niche product.. depends on specific HSA server applications (e.g. media or speech processing tasks) else if you just wanted a low footprint generic box, there's always ARMv8 server SoCs, the GPU-less 8 core Atom or just VMs on a dual Xeon box.
I'm thinking of DSP appliances : if you're constantly doing very large scale H264 encoding for instance, high grade dedicated encoders might be what you need (despite high upfront cost)

So we could see vendors making "appliances" from HSA hardware, you drop a low footprint perhaps ITX-based machine in your datacenter or cabinet and here is it, doing your particular application. Software is critical, it needs a few killer apps.
 
AMD will use Carrizo as-is for servers? Together with their upcoming ARM stuff, they will in effect only have SoCs as server parts (and the G34 socket for "legacy").

It will be a niche product.. depends on specific HSA server applications (e.g. media or speech processing tasks) else if you just wanted a low footprint generic box, there's always ARMv8 server SoCs, the GPU-less 8 core Atom or just VMs on a dual Xeon box.
I'm thinking of DSP appliances : if you're constantly doing very large scale H264 encoding for instance, high grade dedicated encoders might be what you need (despite high upfront cost)

So we could see vendors making "appliances" from HSA hardware, you drop a low footprint perhaps ITX-based machine in your datacenter or cabinet and here is it, doing your particular application. Software is critical, it needs a few killer apps.

Yes, they will use it as-is, but at least one variant (Toronto) will make use of the integrated southbridge, unlike the desktop version. The target is dense, embedded machines.

There may also be a version based on the FM2+ variant with an external southbridge, but apart from OpenCL/HSA applications I'm not sure what the point of it would be compared to a Xeon, except maybe cost.
 
Back
Top