AMD Exascale Heterogeneous Processor

Wouldn't this part, "the EHP is coupled to a second level of off package memory" hint at more than one APU per node?

I think it most definitely means the APU can access more memory than the on-package HBM.
 
As an example, Xbox One has one APU with two pools, 32MB + 8GB.

Next-gen Xeon Phi does use HBM or an equivalent + six channels of DDR4, and explicitly goes into multi-socket motherboards.
 
Like I said, this just looks like an advertisement for AMD to be a part of the US Government's new exascale computer project, rather than any actual project. "Hey look what we could do if we had money!" and the title of the paper says EXASCALE COMPUTING in nice big letters so the government wonks can understand it. Considering the amount of press this got, I'd say it's a pretty cheap marketing success for AMD so far.
It is highly unlikely that this is in any way a valid way to "advertise" yourself for a governmental project (or any non-hobbyist work really). I know it is commonly accepted wisdom in the great Internet echo-chamber that governmental purchases are done by barely literate navel-gazers that need NICE BIG LETTERS TO UNDERSTAND ANYTHING, but the purchasing/financing process is a bit more involved. You'd better hope that AMD "advertised" itself actively for a pretty long time to even be in the running, and not by way of fluffy papers. In no small part because Bill Dally and his guys have been talking about pretty similar things for many years now. And Intel's Xeon Phi is narrowing down on those ideals while being an in-the-silicon product, as opposed to a vague "this is awesome" theoretisation.
 
Like I said, this just looks like an advertisement for AMD to be a part of the US Government's new exascale computer project, rather than any actual project. "Hey look what we could do if we had money!" and the title of the paper says EXASCALE COMPUTING in nice big letters so the government wonks can understand it. Considering the amount of press this got, I'd say it's a pretty cheap marketing success for AMD so far.

Another possible target is AMD's creditors and sources of investment/debt.
In the face of being rated as being a default risk, one has to at least provide the picture of having a future when asking for more money, which has been indicated as a possible next step in the last financial call.


The vision put forward in the EHP paper does show some of AMD's assumptions.
On-interposer memory is the assumed source of high bandwidth, and the memory device count shows AMD hopes to get a quadrupling of per-stack bandwidth, and if Fiji and the Network on Chip paper that was cited are any guide, AMD is hoping for a significant improvement in interposer complexity and bump pitch. What Fiji showed is an interposer solution that lost enough area for a naive packing of 2 additional stacks, and the NOC paper wants tighter pitches and discusses a potentially active interposer. Without such improvements, it would be interesting to know where AMD would fit 8 stacks with what Fiji showed in terms of ASIC size and interposer area.

AMD's using an 8-stack interposer does go against what the PIM proposal had, for a number of reasons. Part of the PIM's benefit was that it adopted an HMC-like interface with the GPU and negated the need for an interposer.
Having an interposer wouldn't necessary mean you couldn't have PIM, but it takes away a benefit that might have pushed it over the top if other problems like software complexity and cost to implement outweighed its performance benefit in the workloads it worked well in.
Additionally, the PIM's projected power ceiling, coupled with there being 8 stacks, would devote a huge chunk of the node's power budget to the PIM stacks.

Power-wise, AMD's proposed node count is high enough that it probably could not meet the DOE's original 20MW ceiling, although the Obama administration's 30MW ceiling might still work. If each node and the hardware attached to it drew 200W, the proposed node count would leave 0 Watts for anything else, like that top-of-rack network and undefined storage nodes.

AMD provided barely anything to describe what it would do about the networking side, although it said it might be neat to make the NIC an HSA device.
 
AMD's using an 8-stack interposer does go against what the PIM proposal had, for a number of reasons. Part of the PIM's benefit was that it adopted an HMC-like interface with the GPU and negated the need for an interposer.
Having an interposer wouldn't necessary mean you couldn't have PIM, but it takes away a benefit that might have pushed it over the top if other problems like software complexity and cost to implement outweighed its performance benefit in the workloads it worked well in.
Additionally, the PIM's projected power ceiling, coupled with there being 8 stacks, would devote a huge chunk of the node's power budget to the PIM stacks.
As far as I have read, PIM in this concept was meant for the high-capacity, off-package layer of memory in the first place.
 
As far as I have read, PIM in this concept was meant for the high-capacity, off-package layer of memory in the first place.
Clarification: The EHP paper itself didn't explicitly tell, but the prior work is AFAIR going along this direction.
 
As far as I have read, PIM in this concept was meant for the high-capacity, off-package layer of memory in the first place.
The TOP-PIM paper placed PIM in opposition to HBM and WideIO, and it was originally evaluated as the primary memory pool.
It was also evaluated in terms of being placed under a stack of DRAM, which was a significant constraint on its power ceiling.
A non-volatile standard might be able to accept a higher power budget, although if that happened it would increase the amount of power budget taken away from the central APU.

The original proposal is years-old at this point, so it might predate AMD's current belief that it needs a tiered memory pool. It does seem to predate the idea that the off-package memory would be non-volatile.
 
The TOP-PIM paper placed PIM in opposition to HBM and WideIO, and it was originally evaluated as the primary memory pool.
It was also evaluated in terms of being placed under a stack of DRAM, which was a significant constraint on its power ceiling.
A non-volatile standard might be able to accept a higher power budget, although if that happened it would increase the amount of power budget taken away from the central APU.

The original proposal is years-old at this point, so it might predate AMD's current belief that it needs a tiered memory pool. It does seem to predate the idea that the off-package memory would be non-volatile.
https://asc.llnl.gov/fastforward/AMD-FF.pdf
This also worths a read.
 
That could be an evolution of the concept, although the paper cites the original concept.
The EHP paper that gives each node 10 TFLOPS also gives the GPU in the node 10 TFLOPs as a baseline, which does not leave much room for processing in memory. The fastforward pdf diagram has enough PIM stacks that if they drew what AMD originally proposed, they would significantly constrain the GPU.
Possibly, the programmable logic patent that was posted in some of the other AMD threads might be more consistent with the NVRAM stacks, which might live with much lower power consumption since it deals data movement and basic manipulation rather than computation.
This may well be an area where AMD's vision is not fully nailed down.

I also wouldn't know about how the fastforward diagram puts an arrow with optics and high-speed IO into the APU (is the APU stacked?). Optical in particular seems like it would have trouble fitting. Some of the ways optical waveguides are implemented is separate from digital logic via some kind of mounting of heterogenous materials, e.g. an interposer.
 
Back
Top