AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
The HPC APU scheduled for 2017 may have a TDP of up to 200-300 W.

Is it too much to hope for a consumer variant of that APU?

Would they actually need to do anything here? I would hope the minimum needed is a mobo with the appropriate form factor, and perhaps some modification to existing after market cooler designs. They could charge a pretty penny and still come out cheaper than the price of a high-end discrete GPU + DDR4 + high-end Intel desktop CPU. And they'd presumably offer ECC, which isn't available from Intel in any consumer product as far as I know.

Going further - this would also likely be very attractive for the high performance in a small form factor crowd. Personally, I would love a system with support for one or two NVMe SSDs (M.2 or 2.5"), a high performance/wattage APU with 16 or 32GB HBM, the usual outputs, an external power brick, ~250-300W, stuffed in a box not much larger than the double-height PCIe cards we see today.

Anyway whatever :). I think they'd be idiots not to have some (relatively but not outrageously expensive) way for the high-end consumer market to get into one of these. Make it happen AMD!
 
Would they actually need to do anything here? I would hope the minimum needed is a mobo with the appropriate form factor, and perhaps some modification to existing after market cooler designs. They could charge a pretty penny and still come out cheaper than the price of a high-end discrete GPU + DDR4 + high-end Intel desktop CPU. And they'd presumably offer ECC, which isn't available from Intel in any consumer product as far as I know.

Going further - this would also likely be very attractive for the high performance in a small form factor crowd. Personally, I would love a system with support for one or two NVMe SSDs (M.2 or 2.5"), a high performance/wattage APU with 16 or 32GB HBM, the usual outputs, an external power brick, ~250-300W, stuffed in a box not much larger than the double-height PCIe cards we see today.

Anyway whatever :). I think they'd be idiots not to have some (relatively but not outrageously expensive) way for the high-end consumer market to get into one of these. Make it happen AMD!

It would be pretty great as a SteamBox as well.
 
HBM plus quad-channel DDR4 controller? What's the point?
 
It says that it supports 256GB DDR4 per channel and 4 total channels, so I guess to achieve parity between DDR4 and HBM you would need something like 5nm production for the memory chips and HBM5 generation ;) On the other hand, at least from the schematic, it seems that the L3$ is shared only between the CPU cores, so I am thinking - is it possible that the HBM would act as both GPU memory and as a form of L4$ (since, again from the schematic, it seems that the HBM memory is considered a part of the APU die) that would be shared between the CPU and the GPU? Perhaps a part of it would be reserved for caching purposes?
 
The blocks are generic enough on that slide, so nothing sticks out as being egregiously wrong at first glance.
I am not sure how to reconcile the specifics for the GPU architecture's name versus the other data points. The next-gen GPU in the prior slides that feeds into the HPC GPU would seem to be Fiji, with the next gen that might fit the Arctic Islands theme being the box after.
The bandwidth numbers for the HBM look better if the APU is not put off until 2017 or later. Knights Landing has that bandwidth and Volta is set to double it by the time Volcanic Islands is replaced. The PCIe IO may be outmoded by then as well.

The rest of the diagram's details are pretty non-specific. I may go into more of a comparison with Knight's Landing in that thread, but the design wins for Intel and IBM+Nvidia for upcoming supercomputers show how broad a set of technologies is important in that space. Intel is pushing a more full interconnect, software-defined networking, photonics, a file system, and so on. Nvidia and IBM have a number of techs outside of the chip as well.

If this set of details is all AMD is (at best leaking) talking about, the HPC APU may have a harder time of things.
 
It says that it supports 256GB DDR4 per channel and 4 total channels, so I guess to achieve parity between DDR4 and HBM you would need something like 5nm production for the memory chips and HBM5 generation ;) On the other hand, at least from the schematic, it seems that the L3$ is shared only between the CPU cores, so I am thinking - is it possible that the HBM would act as both GPU memory and as a form of L4$ (since, again from the schematic, it seems that the HBM memory is considered a part of the APU die) that would be shared between the CPU and the GPU? Perhaps a part of it would be reserved for caching purposes?
Guess that any HBM pool coming with an APU is likely software managed, independently addressable pool as today. There are plenty of system memory bandwidth, and CPU gets loads of cache. So DRAM caches are just something marginally better for CPU in some niche cases, but at a cost of DRAM bandwidth (in-memory tags, locality, latencies, etc) which is against what you would expect for a GPU.
 
The blocks are generic enough on that slide, so nothing sticks out as being egregiously wrong at first glance.
I am not sure how to reconcile the specifics for the GPU architecture's name versus the other data points. The next-gen GPU in the prior slides that feeds into the HPC GPU would seem to be Fiji, with the next gen that might fit the Arctic Islands theme being the box after.
The bandwidth numbers for the HBM look better if the APU is not put off until 2017 or later. Knights Landing has that bandwidth and Volta is set to double it by the time Volcanic Islands is replaced. The PCIe IO may be outmoded by then as well.

The rest of the diagram's details are pretty non-specific. I may go into more of a comparison with Knight's Landing in that thread, but the design wins for Intel and IBM+Nvidia for upcoming supercomputers show how broad a set of technologies is important in that space. Intel is pushing a more full interconnect, software-defined networking, photonics, a file system, and so on. Nvidia and IBM have a number of techs outside of the chip as well.

If this set of details is all AMD is (at best leaking) talking about, the HPC APU may have a harder time of things.
This is an integrated solution after all. All the names you brought up here are a discrete part (KNL too if one treats it as accelerator) of a system that burns maybe way more power than this integrated one. Moreover, AMD would still release discrete parts to compete with these. That's said the bullet point of an APU should have been the integration. If there is nothing so different about it (oh, we already have tons of not-so-different parts today), I would agree on your cloudy forecast.

Perhaps AMD has something to do with its, hem, SeaMicro IPs.
 
The PCIe IO may be a little out of date, but I think 1GbE in 2017 (?) is pretty egregious. I was expecting to see at least 10GbE, maybe even support for (one or more channels) of the upcoming 25GbE standard.

The CPU core count seems high. An 18 core Haswell Xeon EP is pretty huge and power hungry - 662mm2 for the 18core variant on 22nm, at 145W according to Anandtech (granted, the Xeon also has almost 6x the L3 of this rumored APU). Anyway, the less than amazing HBM2 bandwidth (given the time frame) may simply be what makes sense given the amount of streaming compute they have space for after laying down 16 cores. Still, for a higher-end gaming or HPC focused APU, I would have expected more emphasis on streaming compute and HBM bandwidth than on so much general purpose CPU.
 
This is an integrated solution after all. All the names you brought up here are a discrete part (KNL too if one treats it as accelerator) of a system that burns maybe way more power than this integrated one.
We do not know the power of this solution, nor where that would go in the continuum of perf/W to know if it is a win.
The high level of integration could help at a node level, but inter-node connectivity and platform organization is a big factor in maintaining scaling and power-efficiency for large HPC systems, which KNL and POWER9+Volta are going into.
Intel and Cray, and IBM and Nvidia have various forms of interconnect and high-end infrastructure between compute cards and between blades/racks.
Utilization and power consumption can suffer if scaling to large numbers of nodes is poor.

Perhaps AMD is counting on third-party solutions to leverage its PCIe, but this does make the AMD less of a one-stop shop compared to the other solutions. I would be curious to see if AMD expects a vector into that space, since oft-burned former partner Cray is hitched to KNL currently.

Moreover, AMD would still release discrete parts to compete with these.
Those discretes will need something to corresponding to the expanded interconnects the competing discrete solutions have.

Perhaps AMD has something to do with its, hem, SeaMicro IPs.
SeaMicro's shared-nothing nodes do not quite fit the echelon of systems the others can scale to, and AMD has apparently flubbed the integration of it once already. I have not seen much on what the future versions would entail. If it's integrated in that diagram, there doesn't seem to be as much network capability.
 
We do not know the power of this solution, nor where that would go in the continuum of perf/W to know if it is a win.
While the exact number is not known yet, AMD said their target for the HPC APU is 200 to 300 watts of TDP in their Japan Update event. I assume a top-of-the-line Volta or a KNL should consume similar level (or at least the lower bound) of power.

The high level of integration could help at a node level, but inter-node connectivity and platform organization is a big factor in maintaining scaling and power-efficiency for large HPC systems, which KNL and POWER9+Volta are going into. Intel and Cray, and IBM and Nvidia have various forms of interconnect and high-end infrastructure between compute cards and between blades/racks. Utilization and power consumption can suffer if scaling to large numbers of nodes is poor.
Agreed. That's said it is interesting to know if the APUs can scale vertically (MP) as a node, particularly when the alleged APU has a large number of PCIe lanes. Those could have served a double duty as ccNUMA interfaces in the same way of the cancelled G2012 platform. Volta has the NVLink to accomplish this together with the MP-scalable POWER9, while KNL as an accelerator can scale alongside the Xeon MP systems.

Perhaps AMD is counting on third-party solutions to leverage its PCIe, but this does make the AMD less of a one-stop shop compared to the other solutions. I would be curious to see if AMD expects a vector into that space, since oft-burned former partner Cray is hitched to KNL currently.

Those discretes will need something to corresponding to the expanded interconnects the competing discrete solutions have.

SeaMicro's shared-nothing nodes do not quite fit the echelon of systems the others can scale to, and AMD has apparently flubbed the integration of it once already. I have not seen much on what the future versions would entail. If it's integrated in that diagram, there doesn't seem to be as much network capability.
That's an ageing piece of IP even at the time of acquisition, and it is fair to assume that there is something new in the pipeline. Recently filed patents suggests this, too. Moreover, the SeaMicro has an integrated network there, so MPI sounds not to be an alien client there. That's said, as far as I understand, Intel's OmniScale fabric is similarly shared-nothing.
 
Hopefully this is the right place to post this, on AMD's "high performance server APU":

http://techreport.com/r.x/amd-fad-2015/slide-datacenter.jpg

It says multi-tflops, as in >=2. Kaveri (7850k) was 856 total, about 736 GPU 120 CPU (512 cores, 720MHz):

http://images.anandtech.com/doci/7507/KaveriPerf_575px.jpg

I'm going to say triple the CPU part because of Zen ~1.4x excavator, assume that's 5-10% more than steamroller then double the cores which again I assume could happen for a big server part. 2x cores, 1.5x perf per core (assuming same clocks) would make about 360gflops CPU side. So that would leave 1640gflops, so either 768 cores at roughly 1070MHz which could be possible, or more likely 896 cores at ~925MHz. Only problem with that is for 2 tflops at 200 watts (assumption) consumes a lot more power than I'd assume it should with that "double perf/w" claim they've made for GPUs and that's a lot of die gone for a probably large server part. For example the 270 is ~2.3Tflops, 212mm^2 and consumes about 150W. Assume ~80W with a shrink (little less than the claimed doubling of perf/w) and maybe 150mm^2 you're looking for a lot left over in a big, 200W server part. What would be a reasonable assumption of die size, 400mm^2 for the APU? Idk maybe I'm underestimating the CPU core count, could be 16 and them being clocked lower, usually happens having more slower cores in server stuff doesn't it instead of more, fast stuff (perf/w and all).
 
There's way too little information to really conclude anything. Given the vague timing (2016–2017) it might just be a plain old regular APU with an Opteron sticker. Think 4 Zen cores, plus 16 CUs (1024SPs) running at ~1GHz. You'd get over 2TFLOPs from the GPU alone.

But if it's an APU designed specifically for the HPC market, I'd expect something much bigger. Probably 4~8 Zen cores and 32~64 CUs (2048~4096 SPs).
 
There's way too little information to really conclude anything. Given the vague timing (2016–2017) it might just be a plain old regular APU with an Opteron sticker. Think 4 Zen cores, plus 16 CUs (1024SPs) running at ~1GHz. You'd get over 2TFLOPs from the GPU alone.

But if it's an APU designed specifically for the HPC market, I'd expect something much bigger. Probably 4~8 Zen cores and 32~64 CUs (2048~4096 SPs).
Frankly, AMD might not need to "specifically" roll out one new die design for HPC. If AMD is perfectly fine to bring 2.5D packaging elsewhere (and they hinted it), they may go the multi-die route, where dies can form multiple SKUs facing different CPU, APU and GPU segments.

Let's say if they would introduce a couple of high-end HBM GPUs and are okay with the overhead of a multi-die interface, they can bring in a CPU die and this already gives you two combinations of APU without bringing up a new monolithic SOC from zero. Then the CPU die or GPU die (Multiadapter, yah!) can have multi-die variants of itself, if the interface supports it and the spec is meaningful.

Moreover, as mentioned earlier in the thread, AMD's target is 200W to 300W TDP... per the report of the Japan HPC presentation. I am a bit doubtful about such TDP possible to be contributed by just one monolithic die on 14nm.

Having said that, it all depends on the economies of 2.5D packaging...
 
Frankly, AMD might not need to "specifically" roll out one new die design for HPC. If AMD is perfectly fine to bring 2.5D packaging elsewhere (and they hinted it), they may go the multi-die route, where dies can form multiple SKUs facing different CPU, APU and GPU segments.
A confirmed interconnect adjusted to this reality would need to be fleshed out. The connection speeds and power consumption, while better than PCB traces, would not match going over the same die.

Another unanswered question is where is the cost going to be eaten, besides the interposer.
Will there be two variants of each die, a non-interposer CPU and normal CPU and the same split for the GPU? Or is each going to bump up complexity and die size so that they can do both.
Or is AMD going to 2.5D almost universally (seems far off).

Moreover, as mentioned earlier in the thread, AMD's target is 200W to 300W TDP... per the report of the Japan HPC presentation. I am a bit doubtful about such TDP possible to be contributed by just one monolithic die on 14nm.
I don't see 200-300W from one die as a challenge. Large GPUs can do that readily, and the largest CPUs like the upper bins of Intel's EX Xeons have TDPs that an extra bin above could probably hit.
Modern devices do not have a problem drawing power if allowed.
 
A confirmed interconnect adjusted to this reality would need to be fleshed out. The connection speeds and power consumption, while better than PCB traces, would not match going over the same die.
Since they are introducing a new SOC interconnect and how long their die stacking program has been running, I guess they would have been aware of the extensibility.

Another unanswered question is where is the cost going to be eaten, besides the interposer.
Will there be two variants of each die, a non-interposer CPU and normal CPU and the same split for the GPU? Or is each going to bump up complexity and die size so that they can do both.
Or is AMD going to 2.5D almost universally (seems far off).
For GPUs, if the GPU can make use of HBM, the cost is always there and a "normal" version is off the table. For CPUs, that's my question too - say if they would still give the single die part an interposer, make another non-interposer version, or use flip-chip bumps directly (but still 2.5D, so mixed bump size for the interposer...).

I guess AMD might be fine with the fusing-off approach, and the overhead might not be too high in die area, since TSV enables smaller bump size which in turns shrinks the PHY sizes. Moreover, weighting redundancy in the single-die variants over scaling of more product SKUs (that may eat into the higher margin markets) sounds a nice investment IMO. At least it sounds more solid than designing multiple monolithic SOCs and hoping for profit "the old way".

By the way, I bet AMD would still make (but not with high priority) low-end GPUs and APUs with low-end graphics that use external memory, since the market is still there anyway, and what AMD lacks in competency in the first place is not the graphics, but the CPU piece.


I don't see 200-300W from one die as a challenge. Large GPUs can do that readily, and the largest CPUs like the upper bins of Intel's EX Xeons have TDPs that an extra bin above could probably hit.
Modern devices do not have a problem drawing power if allowed.
GPU in that range is often a really huge die... Anyway, one bullet point of 2.5D is to break down monolithic SOCs, and since the GPU is likely getting HBM, it seems a broken-up one is fairly natural move.
 
Status
Not open for further replies.
Back
Top