AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
480X = Polaris 11x2 4GB HBM1 2560cores 246mm2 This should best a 390X and trade blows with Fiji "BAFFIN XT G5 4GB CHANNEL P/N 102-C98101-00 (FOC)"
G5 tells pretty straight out it's GDDR5, not HBM. Also, fitting pads for both GDDR5 and HBM in 246mm2 chip simply isn't happening, there isn't enough room
 
Navi may be the point at which we see multiple processors on an interposer. As I wrote in the Navi thread, that might use compute logic in the base dies of the memory stacks.
I assumed the scalability was putting the memory directly on top of the dies or literally stacking the dies. Alternatively each CU being it's own die on an interposer.

G5 tells pretty straight out it's GDDR5, not HBM. Also, fitting pads for both GDDR5 and HBM in 246mm2 chip simply isn't happening, there isn't enough room
That was my initial thinking as well, but that part is listed at 4-5x the cost of every other Polaris 11 variant on the Zauba manifest and a part number more in line with the Fury line. Also the S3, S3, X3, G5? Those look more like product tiers than memory configuration. Cost wise it's also 50% higher than Antigua on the manifest. Yeah the numbers can't be relied upon, but it's an interesting pattern. The likely performance also lines up really well.
C98101 BAFFIN XT G5 4GB CHANNEL
C72851 BANKS PRO S3 2GB 4.5GBPS GDDR5
C72951 WESTON PRO S3 2GB 4.5GBPS GDDR5
C72951 WESTON X3 2GB 4.5GBPS GDDR5
C88202 FIJI NANO
C76010 ANTIGUA PRO G5 4GB OEM
As for the pads, they either all use an interposer to expand things out, granted a small one for GDDR5, or there are a bunch of tiny pads that only get used with an interposer. Might just be a packaging issue.
 
How likely is it though, considering Raja Koduri was talking about developers working with mGPU and the need for new and better mGPU rendering algorithms.

This to me does not sound like many GPUs on interposer working as one.
 
How likely is it though, considering Raja Koduri was talking about developers working with mGPU and the need for new and better mGPU rendering algorithms.

This to me does not sound like many GPUs on interposer working as one.
That could be disparate GPUs though. Take advantage of an APU in addition to a discrete card. The limit to what I suggested would likely be 2 GPUs, maybe a third, before you simply run out of space. They are developing GMI (I think that's the name) which could be used to connect everything. Roughly 100GB/s per link. Two of those routed through the interposer would likely be sufficient to pull this off. Kind of like putting 9-10 memory channels on the chip and routing some of them to other GPUs. 8 for HBM, 1 for PCIE, 1 for other chip? Not sure what the upper limit of IO on an interposer would be exactly. it's definitely pushing the technology, but considering some of the technology that Zen appears to have it doesn't seem unreasonable.
 
@Anarchist4000 There won't be a HBM2 Polaris chip. GDDR5(X) only. HBM2 doesn't fit with the price range Polaris is targeting, that stuff is expensive. It looks like there WAS a big Polaris with HBM(2?) planed in the lineup, originally, but its gone. I'm talking about that missing 4096 cores model, call it "Greenland" if you want. Polaris 10 is already a "small" chip.

I wouldn't assume all too many architectural improvements for Polaris either. It's still a mostly GCN based architecture, with focus on efficiency and low cost. Replacement for all sub-Fiji chips. And most likely the last of the GCN family.

Vega should be the one to receive the new architecture, and will power all the new high end models. Fiji will stay high-end until Vega release.

Not sure if Vega will be scaled down, or if GCN (Polaris) and Vega's arch will co-exist until Navi replaces GCN for good. My guess? Navi will introduce a new low cost, integrated, planar(!) memory type. Mostly Vega's arch, but optimized for low cost and efficiency again. Until then, Polaris is to stay.
 
Hmm polaris 10 and 11 having the same core counts, but die size differences? Polaris 10 having more than 3k shader units? Nope.
 
@Anarchist4000 There won't be a HBM2 Polaris chip. GDDR5(X) only. HBM2 doesn't fit with the price range Polaris is targeting, that stuff is expensive. It looks like there WAS a big Polaris with HBM(2?) planed in the lineup, originally, but its gone. I'm talking about that missing 4096 cores model, call it "Greenland" if you want. Polaris 10 is already a "small" chip.

I wouldn't assume all too many architectural improvements for Polaris either. It's still a mostly GCN based architecture, with focus on efficiency and low cost. Replacement for all sub-Fiji chips. And most likely the last of the GCN family.

Vega should be the one to receive the new architecture, and will power all the new high end models. Fiji will stay high-end until Vega release.

Not sure if Vega will be scaled down, or if GCN (Polaris) and Vega's arch will co-exist until Navi replaces GCN for good. My guess? Navi will introduce a new low cost, integrated, planar(!) memory type. Mostly Vega's arch, but optimized for low cost and efficiency again. Until then, Polaris is to stay.
The intent was only HBM1 on higher SKUs of the Polaris chips, with HBM2 coming with Vega. Parts that should be readily available. Which is exactly what their roadmap shows, it just omits HBM1. The 490 level stuff would be 2 chips, each with 4GB and some sort of bridge. Maybe only 4/6GB, reducing the number of stacks. Even if the dual chip thing didn't work, there would still be Vega for the high end later on. Maybe a HBM1 based Polaris 10 at a low level. AMD did work out that priority deal on HBM1, but the impression was it only provided chips for Fiji. What if it was to source enough chips for Polaris and consoles instead?

A Case for a Flexible Scalar Unit in SIMT Architecture said:
The most related solution is the
AMD GCN architecture, in which the scalar unit and the
SIMT unit share the same instruction stream and the
compiler is responsible to identify the scalar operations.
Although compiler algorithms [27] have been proposed to
find more scalar operations, the capability of the scalar unit
is still limited because of the shared instruction stream with
the SIMT unit. Furthermore, our proposed architecture also
has the flexibility to be configured as a GCN processor.

In this paper, we propose to extend the scalar unit in the
recent GCN architecture such that it can either share the
instruction stream with the SIMT unit or have its own
instruction stream.
Here's the thing, all they had to do was get an independent instruction stream for the scalar processors and maybe 4 bytes/thread for an ID. That alone would yield a significant performance boost as they demonstrated in the paper using a Kepler-like architecture. One of the authors was promoted by AMD to Corporate Fellow a year later. GCN would likely benefit more than Kepler from the change. The change would arguably remove waves entirely as they would now operate on giant thread blocks with the scalar units grouping them. It's possible they didn't get the reorganization part working, just opted to disable ALUs instead.

The interconnect/bridge would be the biggest catch to what I proposed. It doesn't seem that different than what they appear to be doing with Zen though. I just don't think it would take any revolutionary changes to the architecture to pull it off. It's entirely possible their interconnect is compatible with a HBM channel as well. So design a chip with 10 channels and route 2-4 to the other chip. Limits you to 3 stacks of HBM, but that's likely enough.
 
@Anarchist4000 There won't be a HBM2 Polaris chip. GDDR5(X) only. HBM2 doesn't fit with the price range Polaris is targeting, that stuff is expensive. It looks like there WAS a big Polaris with HBM(2?) planed in the lineup, originally, but its gone. I'm talking about that missing 4096 cores model, call it "Greenland" if you want. Polaris 10 is already a "small" chip.

I wouldn't assume all too many architectural improvements for Polaris either. It's still a mostly GCN based architecture, with focus on efficiency and low cost. Replacement for all sub-Fiji chips. And most likely the last of the GCN family.

Vega should be the one to receive the new architecture, and will power all the new high end models. Fiji will stay high-end until Vega release.

Not sure if Vega will be scaled down, or if GCN (Polaris) and Vega's arch will co-exist until Navi replaces GCN for good. My guess? Navi will introduce a new low cost, integrated, planar(!) memory type. Mostly Vega's arch, but optimized for low cost and efficiency again. Until then, Polaris is to stay.
Greenland isn't missing, it's Vega.
Polaris is 4th gen GCN, aka "new architecture" (this has been stated on several different occasions), even if Vega would be again "new architecture" I don't see any reason to think AMD would be going away from GCN, it would just be 5th gen GCN, further refining what changes Polaris brings. More likely it's just Polaris with HBM2 (and bigger chips).
Only thing suggesting Vega is anything else than Polaris+HBM is that one claimed LinkedIn profile, which suggested "4096 stream processor chip" is first with gfx ip 9
 
Just don't use the Zauba price as an argument for anything. I already explained why.
It's not like I was taking the price at face value, but even the description was interesting. Maybe it's nothing, or maybe key details were omitted. The naming alone would suggest it's analogous to Antigua. An added 2GB of memory bumping it from what looks like low to upper-mid tier seems significant.
 
And any solution that requires exotic configurations (that includes HBM) is out as well: it defeats the whole purpose of being price efficient for that market. It's time for AMD to start making money for a change.

Yes, similar to how DDR was traditionally used to serve the budget GPU boards, GDDR will likely remain dominant in the midrange for at least the near future if not longer. That is unless there's a compelling reason to use HBM on midrange parts. For example, there might be a compelling case made for mobile parts where midrange desktop parts are often used as enthusiast mobile parts. Combined with space and power savings, it might be compelling to use HBM in combination with Polaris in that case, while still being highly unlikely for desktop Polaris.

Not saying it will happen even in mobile, but it's a use case where it could make sense as enthusiast level gaming laptops command a price premium that would potentially give it a niche. All depends on how difficult/costly it would be to have an HBM variant just for mobile.

A custom console APU might also be a compelling use case. Highly dependent on whether the whole console would still fit into a console price bracket with HBM, however. Not sure it's feasible in the near future (1-2 years).

Regards,
SB
 
AMD has been using many IP blocks supplied by external parties for a few years. This trend seems to be in line with their really thin RnD budget.

For instance, their recent APUs use non-inhouse mem. controllers. Why wouldn't they outsource the GPU PHY too?

The possibility exists that for even longer some of the GPU's internal controllers could be customizable processors that AMD might not have fully developed in-house. The overall win trying to reinvent the wheel for the controllers that run custom microcode might not be that much, and licensable or modifiable cores are available. The command processor was described for earlier generations as being multiple custom processors.

The limit to what I suggested would likely be 2 GPUs, maybe a third, before you simply run out of space. They are developing GMI (I think that's the name) which could be used to connect everything. Roughly 100GB/s per link.
https://forum.beyond3d.com/posts/1893627/
From the image in this post, it's 100GB/s over 4 links. It brings into question whether the "package" in this case is an MCM or a CPU and GPU sharing a PCB--further weakening the claim that this is an "HPC APU" as some are interpreting it.
If it were a shared package, the question would be why it seems like it's inferior to NVLink with a massively shorter transmission range.
 
https://forum.beyond3d.com/posts/1893627/
From the image in this post, it's 100GB/s over 4 links. It brings into question whether the "package" in this case is an MCM or a CPU and GPU sharing a PCB--further weakening the claim that this is an "HPC APU" as some are interpreting it.
If it were a shared package, the question would be why it seems like it's inferior to NVLink with a massively shorter transmission range.
The fact that it would be first used on "HPC APU" (It's just the most likely candidate for it, even if it's MCM and not true APU) doesn't mean it couldn't scale to higher transmission ranges
 
Yes, similar to how DDR was traditionally used to serve the budget GPU boards, ...
I thought DDR is still used in budget GPU boards.

That is unless there's a compelling reason to use HBM on midrange parts. For example, there might be a compelling case made for mobile parts where midrange desktop parts are often used as enthusiast mobile parts. Combined with space and power savings, it might be compelling to use HBM in combination with Polaris in that case, while still being highly unlikely for desktop Polaris.
I don't think this market exists.

Except for the ultra high end gaming laptops (not worth designing a chip for), pretty much all laptops with discrete GPU are in a hybrid configuration like Optimus. Since laptops are primarily used for non-graphics work use, they use iGPU and the discrete is disabled.

For area, it's a similar argument, it already is possible to design good looking laptops with discrete GPUs and GDDR5, so you're talking a segment where you want MacBook Air 12 thin with discrete. Does that segment exist?

A custom console APU might also be a compelling use case. Highly dependent on whether the whole console would still fit into a console price bracket with HBM, however.
Current consoles are not size limited, not power limited, and aren't ambitious wrt performance. I don't see that changing.

HBM on a very high margin HPC APU, yes. That's the same market as P100. Other than that, it doesn't make a lot of sense, especially if GDDR5X can deliver rates up to 16Gbps.
 
The fact that it would be first used on "HPC APU" (It's just the most likely candidate for it, even if it's MCM and not true APU) doesn't mean it couldn't scale to higher transmission ranges

The possibility should exist, if the thing were even stretching the APU definition past the breaking point with an MCM. Something else is able to beat it numerically while crossing over a much wider PCB, and will most likely do so first to market.
Bulldozer was capable of that kind of aggregate bandwidth going through a socket, so it's curious why a truly short-reach connection needs that many links to match 2010 performance.

The capabilities look like a density server card composed of a CPU and GPU on PCB.
 
https://forum.beyond3d.com/posts/1893627/
From the image in this post, it's 100GB/s over 4 links. It brings into question whether the "package" in this case is an MCM or a CPU and GPU sharing a PCB--further weakening the claim that this is an "HPC APU" as some are interpreting it.
If it were a shared package, the question would be why it seems like it's inferior to NVLink with a massively shorter transmission range.
Going to take a really wild stab at that, but perhaps switching speed of programmable logic applied to a 128 bit bus? May go along with why overclocking a fiji is so difficult. The ACEs are programmable to route stuff around, it wouldn't be surprising if they did the same thing with their memory controller or the rest of the control logic for that matter. That would let them adapt network topology to whatever environment the chip was placed.
 
Colour me impressed for more than twice the added GPU power. I assume Microsoft will also follow pace?
 
Status
Not open for further replies.
Back
Top