AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
Baffin is most likely confirmed as the smaller Polaris chip and on the shipping manifests as well.
Might just be coincidence, but on Dec 28, 2015 it looks like they ordered some parts.
PRINTED CIRCUIT BOARD ASSEMBLY (VIDEO GRAPHIC CARD)C88202-00 FIJI NANO P/N:102-C88202-00 (FOC)
PRINTED CIRCUIT BOARD ASSEMBLY FOR PERSONAL COMPUTER(VIDEO/ GRAPHICS CARD) P/N .102-C98101-00 (FOC)
Feb 16 they put a name to one of those.
BAFFIN XT G5 4GB CHANNEL P/N 102-C98101-00
So in December, on the same day, they decided they needed 2 BaffinXTs (Polaris 11) cards and 10 Fiji Nano boards likely without chips. Reference Nano is a 175W part. September last year they also received 4 nano boards. Nano boards are priced very similarly to that D00001 part you pointed out so it's likely they have similar features.

They also have the "G5" on Antigua which was the 380X. It's only shown up twice as Baffin XT(Polaris 11) and Antigua Pro(380X).

For whatever reason, they seem to think they need a lot of interposer based boards capable of 175W output. So either Vega is coming soon or there are a lot of Polaris variants using it.
 
Question in regards to HBM... does the ultra wide but slower clocks of HBM require an architectural rethink of some sort or another to get performance as higher clocked memory? I don't know more CU's or something?
Remember that HBM is only ultra-wide on paper; from what I recall, one HBM device is actually made up of like 8 separate memory channels (that can be accessed individually.) Now, I don't have the first clue what burst length those 128-bit wide memory channels have - if such a workaround is needed for HBM at all that is - but assuming it's 1 or 2, it compares well with GDDR5/X having a minimum transfer length of 16/32 bytes. :)
 
Might just be coincidence, but on Dec 28, 2015 it looks like they ordered some parts.


Feb 16 they put a name to one of those.

So in December, on the same day, they decided they needed 2 BaffinXTs (Polaris 11) cards and 10 Fiji Nano boards likely without chips. Reference Nano is a 175W part. September last year they also received 4 nano boards. Nano boards are priced very similarly to that D00001 part you pointed out so it's likely they have similar features.

They also have the "G5" on Antigua which was the 380X. It's only shown up twice as Baffin XT(Polaris 11) and Antigua Pro(380X).

For whatever reason, they seem to think they need a lot of interposer based boards capable of 175W output. So either Vega is coming soon or there are a lot of Polaris variants using it.
PS4K and Mike Mantor's presentation 2016 new Zen APU.
 
PS4K and Mike Mantor's presentation 2016 new Zen APU.
I'll need to go read up on him a bit more. I've seen the Zen stuff including HBM and that looks really interesting. Console refreshes using Polaris wouldn't be a surprise either. However I doubt these boards are related, as a console would integrate the interposer on the board and not have a separate card. For a console, a Zen APU and small form factor might be all they require.
 
Strange, I thought the rest of the fiji line was faster than the nano. Didn't realize they were also using fiji for mid range parts... Go figure, the parts with the lowest clocks and voltages win the perf/watt metric. A mark that would almost always be the minimum performance spec for a chip without a power constraint. I don't see why Polaris would be any different. Every chip except maybe some of the mobile parts have worked that way.

It also depends on resolution and settings, if they are targeting 4k systems, I would think they would design chips that have the most perf/watt at that resolution. It makes no sense to make something outside of that resolution wouldn't it?
 
It also depends on resolution and settings, if they are targeting 4k systems, I would think they would design chips that have the most perf/watt at that resolution. It makes no sense to make something outside of that resolution wouldn't it?
Size, speed, or power; pick two. Mobile will likely be constrained by power, but reusing that part for other applications would be ideal. So the mobile part would be designed to maximize perf/watt while the desktop part is whatever performance gets extracted without melting it.
 
Few things I dug up likely related to that patent. Very interesting read that likely explains a lot of what Polaris is doing for efficiency. Especially coupled with the power gating.

A Case for a Flexible Scalar Unit in SIMT Architecture
  • Use scalar unit to prefetch data for SIMT units
  • Use scalar unit to eliminate/reduce control divergence in SIMT units by reorganizing/remapping threads
  • Use scalar unit to execute scalar sections of SIMT programs
  • Thread blocks dispatched in groups of 64+1. Programs consist of SIMT+Scalar kernel
  • Significantly reduce cache miss rates
  • No bank conflicts from remapping
  • Combine partial warps
 
Remember that HBM is only ultra-wide on paper; from what I recall, one HBM device is actually made up of like 8 separate memory channels (that can be accessed individually.) Now, I don't have the first clue what burst length those 128-bit wide memory channels have - if such a workaround is needed for HBM at all that is - but assuming it's 1 or 2, it compares well with GDDR5/X having a minimum transfer length of 16/32 bytes.
I know that HBM is split up into multiple channels but the question still stands. If you have all those extra channels you have to keep them busy. Given that GPU "cores" are in order wouldn't you need more "width" in terms of processors in order to keep the memory requests flowing? For example if a GPU has only 16 SIMD's and 32 channels can it saturate the bandwidth provided such a wide (channel wise) bus?

edit - I guess not though if the access's in a SIMD aren't to much in a single locality.
 
From AnandTech forum.

That 14nm chip most likely Vega?

6pA79OG.jpg
 
Size, speed, or power; pick two. Mobile will likely be constrained by power, but reusing that part for other applications would be ideal. So the mobile part would be designed to maximize perf/watt while the desktop part is whatever performance gets extracted without melting it.
so what you do concider the fury vs the Nano vs the Fury X, you do know the nano is the same as fury x without water cooling and same clocks, just that its boast clocks don't get as high.

Why did the Fury with cut down parts have such a horrible perf/watt as did FuryX? You think these guys making these GPU's are sitting there thinking oh we just have to down clock it or cut out parts to make it fit into what power envelopes are amiable per system? I think that would be a disastrous way of making a suitable product for a market. Why do they take a top GPU like Fury and down clock it and replace the r390 entirely? That is what you are saying right? Just take polaris and up the clocks?

There is a lot more to it then just blah, its a smaller chip, it has less frequency it will have higher perf/ watt.

Yeah and using a mobile part and up clocking it really has worked well to get the best for desktop components, or taking a desktop card and just down clocking it gave it the best mobile perf/watt too. You are generalizing something that is much trickier to do than that. We have seen disabled parts with higher clocks on mobile or vice versa too. Its a combination of many things to get the perf/watt to specific needs.

You want to do a perf/watt of two different chips on on the same node. Take the FX and 6800 line perfect example.
 
If you have all those extra channels you have to keep them busy. Given that GPU "cores" are in order wouldn't you need more "width" in terms of processors in order to keep the memory requests flowing?
If you have fewer but faster channels, or slower, yet compensatingly more channels, the amount of work needed to saturate either would be the same... You need more work to saturate more total bandwidth of course, but the specifics of the layout of the memory should not matter methinks. :)
 
If you have fewer but faster channels, or slower, yet compensatingly more channels, the amount of work needed to saturate either would be the same... You need more work to saturate more total bandwidth of course, but the specifics of the layout of the memory should not matter methinks. :)
It matters a great deal actually: the more channels, the lower chance that a channel is blocked due to some DRAM thing like refresh, the higher the achievable bandwidth.
 
I know that HBM is split up into multiple channels but the question still stands. If you have all those extra channels you have to keep them busy. Given that GPU "cores" are in order wouldn't you need more "width" in terms of processors in order to keep the memory requests flowing? For example if a GPU has only 16 SIMD's and 32 channels can it saturate the bandwidth provided such a wide (channel wise) bus?
Keeping a DRAM busy depends on the master's need for data.
It doesn't matter if you have one channel with BW X or 8 channels with BW/X. You just spread the requests evenly across the different channels.
The big advantage of multiple channels is that you have more freedom to schedule requests. If channel 1 is blocked due to refresh, you can give priority to requests that will go do channel 2 and still keep the compute units busy. If you have only 1 channel, you are blocked.
 
No clue, stumbled upon it, most likely wrong.

Taken from here I think:

eSilicon is a full service ASIC design house that takes a design from RTL (or earlier) all the way to production.
Their typical customers are those who need specialty solutions but don't have all the expertise in-house. (Similar to old school LSI Logic.)
Northwest Logic is a pure RTL IP provider (memory controllers, high speed interfaces etc.)
It makes sense for them to have offer a solution for companies that otherwise don't have the expertise to do it themselves. Think networking companies etc. I don't think AMD would outsource this kind of key knowledge, though AMD transferred some IP over to Synopsys recently...
 
Why would AMD use someone else's PHY?
AMD has been using many IP blocks supplied by external parties for a few years. This trend seems to be in line with their really thin RnD budget.

For instance, their recent APUs use non-inhouse mem. controllers. Why wouldn't they outsource the GPU PHY too?
 
AMD has been using many IP blocks supplied by external parties for a few years. This trend seems to be in line with their really thin RnD budget.

For instance, their recent APUs use non-inhouse mem. controllers. Why wouldn't they outsource the GPU PHY too?
^^THIS
Another example is their TrueAudio thing. It's Tensilica HiFi 3 DSP IP
 
Been pondering some of the patent concepts and that paper on the flexible scalar processor. Using the scalar processor for prefetching and a coherent interconnect it shouldn't be difficult(relatively speaking) to make 2 chips on a single MCM act as one. If that's the case Polaris 10/11 can represent a replacement for AMD's entire lineup. The big sticking point is the scalar processors being able to prefetch data into L3 cache(HBM attached to each die) from system memory. The alternative to that is a bridge approaching half the chip's bandwidth on the interposer.

Anyways, the dual chip part should help with yields. If they're already using an interposer for HBM, sticking another die on there isn't too much more to ask. Efficiency wise, they will extensively be using one or more scalar processors within each CU for both scalar compute and prefetching. These will rearrange waves to reduce/eliminate divergence issues and execute scalar code sections. SIMDs are variable sized during runtime with ALUs being disabled to save power. Scalar unit helps line things up there as well. Likely an independent boost clock for SIMDs as well providing a large potential compute increase.

I've been assuming 1.3x for efficiency improvements(it may in fact exceed this slightly based on that SIMT research paper on the scalar processors) and 1.5x for boost clocks(Nvidia currently manages 50%, so doesn't seem unreasonable). That would make each core nearly double a comparable core by today's standards. Throw in 60% less power from FINFET and you get the 2.5x perf/watt. It won't be any higher because the efficiency and boost numbers likely increase power consumption. They'd increase capacity, but lower perf/watt while in use.

Async would also be hugely important, as the compute capacity could ramp up when a load was encountered.

460 = Cut Polaris 11 2GB GDDR5 1152cores 123mm2
460X = Cut Polaris 11 4GB GDDR5 1152cores 123mm2

470 = Polaris 11 4GB GDDR5 1280cores 123mm2
470X = Polaris 11 2GB HBM 1280cores 123mm2

480 = Cut Polaris 10 8GB GDDR5 2304cores 232mm2 Should be ~$200 and meet the VR spec
480X = Polaris 11x2 4GB HBM1 2560cores 246mm2 This should best a 390X and trade blows with Fiji "BAFFIN XT G5 4GB CHANNEL P/N 102-C98101-00 (FOC)"
480M = Polaris 10 4GB HBM1 2560cores 232mm2 Likely a mobile variant, although could be a discrete card.

490 = Cut Polaris 10 x2 8GB HBM1 4608cores 464mm2
490X = Polaris 10 x2 8GB HBM1 5120cores 464mm2
490M = Small Vega 8GB HBM2 3840cores ~350mm2

Fury = Big Vega 8GB HBM2 5760cores ~525mm2
FuryX = Dual Small Vega 16GB HBM2 7680cores ~700mm2 Might be like the Pro Duo since HBM2 should be larger
FuryDuo = Dual Big Vega 32BM HBM2 11520cores ~1050mm2 This would be similar to Fury Pro Duo, can you say enthusiast part?

Anyone see any issues with this?
 
Last edited:
Navi may be the point at which we see multiple processors on an interposer. As I wrote in the Navi thread, that might use compute logic in the base dies of the memory stacks.
 
Status
Not open for further replies.
Back
Top