AMD: Speculation, Rumors, and Discussion (Archive)

Jawed · Oct 24, 2015

3dilettante said:
The Fury example showed possibly half a watt or more per degree C above 40C, on a Fury X with a water cooler and a temp ceiling that had dropped 30C below that of Hawaii. Extrapolating from 18W from a 25C swing in a 40C to 65C test to the 290X's 95C seems like an appreciable amount of power budget to play with, although it may be the case that HBM will constrain that upper range anyway.

Nano demonstrates that Fiji is substantially more hastened by clocks (hence voltage) than leakage power (though we can't isolate the leakage power component of the differences seen).

Additionally, 290X using Fiji "design rules" would suffer substantially less if for no other reason than that it is smaller (74% of the area, prolly substantially less than that being the active high-leakage circuitry due to the difference in memory PHY area usage).

Finally, ~300W GPUs in an era where boost clock margins over base of as much as 20% would seem to imply that liquid cooling is here to stay.

If anything, Nano is proof that chips should be 600mm² before they are 10%+ higher in clocks.

Without more full power gating, the "off" transistors will have temperature and voltage components worsened by the most in-demand logic.

Yes, sadly it's not possible to isolate that from Anandtech's data, since there was no idle power consumption measurement at 65C versus 40C. And Anandtech didn't repeat the test for Nano.

Besides that, FinFET has much better control at lower voltages, which benefits static and active. The foundries are comparing 14/16nm against 28nm, with the largest benefit being the roughly 2X improvement in power efficiency, since 20nm planar got most of the density improvement but very little power scaling.

This article is pretty interesting:

http://www.techdesignforums.com/practice/technique/arm-tsmc-16nm-finfet-big-little-design/

The finFET process gives you a very good performance gain compared to planar. However, it imposes a number of challenges. One of the key challenges will be dynamic power, which does not scale as well as the leakage power. This created a lot of implementation challenges for the team.

Also, allowing lower-Vt cells for for synthesizing dynamic hotspots leads to great reductions in dynamic power for a small increase in leakage power

The increasing power density of 16FF designs relative to the 40nm and 28nm generations cause problems for die utilization in two ways. One is simply the increased competition for metal from the power rails and signal tracks [...] "The effective power-metal area that can be packed into the design is much less than for 28nm" [...] "You need to find the right metal length and spacing in order to not waste routing tracks."

With the caveat:

[...] expected to double performance from 1GHz on 40nm to 2GHz or more on 16nm

So the article is from the perspective of substantially higher clocks (let's say 50% higher 28nm -> 16nm?), which is not what GPUs are trying to do (though clearly NVidia pivots on a much higher base clock than AMD on the same node currently).

Maxwell seems to be more responsive to voltage tweaks or just overclocking in general. The simplified scheduling probably means the pipeline has lower complexity in multiple stages, but I am not sure which element of the physical design could be different.

I think a neat example is that NVidia has much longer ALU pipelines than AMD's 4-cycles, though I don't know what that number is in Maxwell 2.

The transistor portion of the hybrid nodes is going to be significantly better than 28nm, whereas the wires are less so.
The GPU designs at the hybrid nodes should adapt to that reality, although it would be an interesting exercise to know how well the current architectures would fare if transplanted as-is. Maxwell seems generally more comfortable at 28nm than GCN, and sporadic tests by some sites seem to show less variation in power draw based on temp for some reason.

My suspicion is that NVidia uses longer pipelines throughout the chip for all functional areas, which implies lower interconnectedness per cycle of unit processing. Which I'm guessing allows NVidia to use either a lower-power routing library or to bias their design more towards lower power consumption cells.

I dunno, after all these years we still know practically nothing about the micro-architectural power-v-density and power-v-performance trade offs in GPUs. Almost everything we know about the progression through nodes and node technologies comes from CPU designs where the count of ALUs is substantially unchanged over the last decade (compared with GPUs) and idle power consumption has become more important, since custom blocks have been deployed for high-performance features (such as video decompression).

keldor · Oct 24, 2015

Maxwell has 6 cycle latency for most instructions.
Conversion instructions are 8 cycles.
Setting a predicate (compare instruction) has 13 cycle latency.
S2R and POPC are 25 cycles (what are these?)
Shared memory is typically around 30 cycles
Global memory is typically around 200 cycles.
Double precision is 128 cycles, though this may be a maximum latency if all warps are using the DP resources.
Transcendental are unknown.

Some instructions (memory, double precision) have operand read latency meaning the instruction can be issued sooner than would be expected - for instance, global memory has read latency of 4 cycles, so you could have an ALU instruction followed by a dependent global memory access 2 cycles later rather than the normal 6.

Control flow instructions flush the pipeline and take a full 6 cycles to complete (thus always requiring a stall of 5).

Certain instructions can be dual issued with ALU instructions - memory and conversion instructions.

There is also a yield hint flag, which, when set, tells the scheduler to try to issue an instruction from a different warp next. This has additional cost.

Finally, there are write and read dependency barriers, which used to deal with variable latency instructions (memory, probably double precision and transcendental as well since these are shared across the SMX).

It's important to note that scheduling is explicit in the ISA - each instruction specifies how long it needs to stall before the next instruction can be executed. It's the programmer's (compiler's) responsibility to ensure that hazards are properly avoided.

All this information comes from the MaxAs assembler. https://github.com/NervanaSystems/maxas

keldor · Oct 24, 2015

There are 6 read dependency barriers and 6 write dependency ones, so you can have up to 6 (or up to 12 if you use instructions of different types) variable latency instructions in flight at any given time. These add 1 cycle additional latency.

keldor · Oct 24, 2015

*The yield flag has NO additional cost. Typo.

gamervivek · Oct 24, 2015

ImSpartacus said:
I'm pretty sure Pitcairn can't die at this point.

Well it died a long time ago. Only the replacement had same specifications as Pitcairn and apparently didn't have the features that have come to be associated with GCN1.1 on desktop, viz. Freesync(programmable display controller) and TrueAudio.

As for die sizes, AMD don't really have a choice if nvidia retain a 400-500Mhz clockspeed advantage on them. The one rumor from kitguru article had saying that ISA for Arctic islands will change dramatically, so take that fwiw.

lanek · Oct 24, 2015

gamervivek said:
Well it died a long time ago. Only the replacement had same specifications as Pitcairn and apparently didn't have the features that have come to be associated with GCN1.1 on desktop, viz. Freesync(programmable display controller) and TrueAudio.

As for die sizes, AMD don't really have a choice if nvidia retain a 400-500Mhz clockspeed advantage on them. The one rumor from kitguru article had saying that ISA for Arctic islands will change dramatically, so take that fwiw.

Actually, if we speak about GCN as 1.1, 1.2 or even (1.3 as some site like to said for Fiji) you can count Artic Island to be something like 2.0...

Anyway, have you read that Jim Keller seems to have been hired by Samsung ?

mczak · Oct 25, 2015

lanek said:
Actually, if we speak about GCN as 1.1, 1.2 or even (1.3 as some site like to said for Fiji) you can count Artic Island to be something like 2.0...

Fiji is definitely the same gen as the other GCN 1.2 parts. All the relevant blocks share the same revision number (only the memory controller block is newer, big surprise there...).
FWIW here's the relevant IP rev table (extracted from open source driver), starting with Sea Islands.

Code:

       kaveri bonaire kabini hawaii topaz  tonga carrizo fiji
gmc      7.0    7.0    7.0    7.0    8.0    8.0    8.0    8.5
ih       2.0    2.0    2.0    2.0    2.4    3.0    3.0    3.0
smc      7.0    7.0    7.0    7.0    7.1    7.1    8.0    7.1
dce      8.1    8.2    8.3    8.5    N/A   10.0   11.0   10.1
gfx      7.1    7.2    7.2    7.3    8.0    8.0    8.0    8.0
sdma     2.0    2.0    2.0    2.0    2.4    3.0    3.0    3.0
uvd      4.2    4.2    4.2    4.2    N/A    5.0    6.0    6.0
vce      2.0    2.0    2.0    2.0    N/A    3.0    3.0    3.0

No idea though how different Arctic Island is going to be...

eastmen · Oct 25, 2015

I'm really hoping for some good performance gains with the 16nm drop. I hope we get twice the speed of the fury x .

Silent_Buddha · Oct 25, 2015

Esrever said:
Nvidia technically has more than 3 maxwell chips. Even maxwell gen 1 is still pretty modern compared to cards in the AMD line up. Whats AMD gunna do below the 470? rebrand pitcairn again?

Except Nvidia's current desktop lineup is exactly 3 chips. ALL of them. Similar to what AMD are planning.

Esrever said:
If AMD only makes 2 skus out of each chip. They would only have 6 cards with 3 chips. If they want to sell their top chip for $650, they will have a hard time filling the spaces especially since they desperately need competitive mobile skus if they want to compete at all in that space.

Again, that is absolutely no different from Nvidia currently who only have 6 products and 3 chips for their desktop SKUs.

950 - GM206 - ~160 USD
960 - GM206
970 - GM204
980 - GM204
980ti - GM200 - ~650 USD
Titan X - GM200 - ~1000+ USD

That covers the desktop discrete lineup from top to bottom for the current generation, IE - GTX 9xx series. If you want to say that GTX 8xx series fills in the gaps then that would be exactly the same as AMD using the R3xx series to fill in gaps.

Really, I don't see a difference here, and it certainly isn't hurting Nvidia.

Hell, if as you suggest, AMD prices their top card at ~650 USD, that makes 6 products between 150-650 USD compared to Nvidia's 5 products between 150-600 USD.

And with IGPU's becoming better and better, is there even a point for a desktop GPU below the 950? So again, where's the problem as Nvidia is doing just fine.

I miss the days when only 2-3 cards covered the entire range. Now people are thinking you need more than 6? Yeesh.

Regards,
SB

gamervivek · Oct 25, 2015

That IP rev table is quite interesting. I am guessing gmc stands for the memory controller and smc does for the shader engines? Not sure what ih stands for or how that gfx differs between Tonga and Hawaii.

Topaz then looks like a chip with the delta compression of Tonga and thus not a rebrand. I think this could settle the debate about AMD's perennial rebrands of GCN1.0 chip like reported here.

http://wccftech.com/amd-radeon-r5-r7-r9-300-rebrand-mobility-driver-update/

I wonder what gmc rev. would M370x show?

GCN1.3 was explicitly stated for Fiji by HardOCP and the biggest change was of course the memory controller and the uvd block as the table shows. Though I don't think AMD are calling it a new GCN version.

Carizzo features new shaders then and we'd see them in the next GCN iteration unless of course AMD are doing another revision for the Arctic Islands.

kalelovil · Oct 25, 2015

gamervivek said:
That IP rev table is quite interesting. I am guessing gmc stands for the memory controller and smc does for the shader engines? Not sure what ih stands for or how that gfx differs between Tonga and Hawaii.

Carizzo features new shaders then and we'd see them in the next GCN iteration unless of course AMD are doing another revision for the Arctic Islands.

SMC is presumably the clock and voltage microcontroller.

mczak · Oct 25, 2015

kalelovil said:
SMC is presumably the clock and voltage microcontroller.

Yes indeed, power management is handled there (I don't know what the name exactly stands for, but not surprising that the APU is leading the pack there). CUs and such should all be gfx block. (gmc is graphics memory controller, ih is interrupt handler, the rest of the names should be fairly well known.)

gamervivek · Oct 25, 2015

Don't remember reading a specific change to CUs for Tonga though. There was that HWS block though in the new fiji gpu block diagram during nano's release.

Doesn't seem to be making much of a difference to the performance anyway. So perhaps continue to keep up the appearances of perpetual rebrands rather than claiming new generation of GCNs and then disappoint the great expectations.

The full list of these IP blocks for all those codenames AMD have released would be interesting to see as to just how many of the rebrands were not rebrands. Or at the very least a little different Perhaps Grenada has the same smc as Tonga/Fiji?

Ryan Smith · Oct 25, 2015

gamervivek said:
Topaz then looks like a chip with the delta compression of Tonga and thus not a rebrand. I think this could settle the debate about AMD's perennial rebrands of GCN1.0 chip like reported here.

Topaz is the replacement for Oland/Sun. It's a bare-bones chip mostly meant for use in Dual Graphics mode with Kaveri/Carrizo (which is why it doesn't have a bunch of video hardware).

3dilettante · Oct 26, 2015

Jawed said:
Nano demonstrates that Fiji is substantially more hastened by clocks (hence voltage) than leakage power (though we can't isolate the leakage power component of the differences seen).

The lack of isolation is problematic, since we know that both elements get worse with voltage.

Finally, ~300W GPUs in an era where boost clock margins over base of as much as 20% would seem to imply that liquid cooling is here to stay.

There should be an ongoing market for that kind of cooling, but whichever architecture that requires it first to maintain parity with the competition is unlikely to be a winner.

If anything, Nano is proof that chips should be 600mm² before they are 10%+ higher in clocks.

Given the cost of the new processes, their earlier point in the maturation curve, and the likelihood that there is going to be a refresh (or two, or three) before 10nm, they may need to push for more performance per mm2 and consider Amdahl's Law for the portions of the GPU pipeline that do not reside in the CUs.

So the article is from the perspective of substantially higher clocks (let's say 50% higher 28nm -> 16nm?), which is not what GPUs are trying to do (though clearly NVidia pivots on a much higher base clock than AMD on the same node currently).

That's a notable difference, since FinFETs are very good at modest clocks, where their increased channel control and greater amount of channel per area footprint yield better leakage control and better transistor performance at lower voltage points. It significantly helped Intel apply its large cores to such a wide range of power bands. The large percentage of power budget coming from static leakage has been a long-time CPU rule of thumb, which Intel's process choices appear to have done quite a bit to rectify for the current and near-term geometries.
CPUs start to make tradeoffs for the higher range, where FinFET's benefits are more modest, but like you said it's not a range GPUs really dwell in. The trade-off that yields higher leakage for better switching speed in a design with a 2 GHz clock ceiling don't look like a promising choice for a GPU operating at half that.

My suspicion is that NVidia uses longer pipelines throughout the chip for all functional areas, which implies lower interconnectedness per cycle of unit processing. Which I'm guessing allows NVidia to use either a lower-power routing library or to bias their design more towards lower power consumption cells.

It's potentially lower-power, but seemingly capable of hitting higher voltages and clocks. Though it does lose a large amount of its power-efficiency when that happens, it doesn't need a liquid cooler and a VRM setup that apparently has to handle a significant number of >300W transients like Fury X apparently does.

Why Fury X seems on average to be less responsive to upping the voltage, while Maxwell seems to enjoy very nice overclocks, raises a question that was asked in the article about whether a design is limited by device switching speed or wire delay.
GPUs seem less likely to be hurt by switching delay, and a reduced level of interconnectedness can mean something like fewer long wires and reduced load on them relative to what the transistors can drive.

If transistors are sized for density and have heavier interconnect demands (more complexity, more distance to cover), they might top out at higher voltages before they can do enough to drive signals across units or the chip.
Nvidia has done things to reduce both complexity and distance with its latest GPUs. Some of AMD's choices like the amount of work done within 4 cycles might have traded complexity for more uniform compute performance.

I dunno, after all these years we still know practically nothing about the micro-architectural power-v-density and power-v-performance trade offs in GPUs. Almost everything we know about the progression through nodes and node technologies comes from CPU designs where the count of ALUs is substantially unchanged over the last decade (compared with GPUs) and idle power consumption has become more important, since custom blocks have been deployed for high-performance features (such as video decompression).

One of the data points for FinFET has been with Skylake's CPU and GPU power management, with duty cycling being used to keep things at a silicon-optimal clock and voltage point, where the cost of switching balances out against the increase in leakage with too-low voltage and clocks the design is not tuned for, particularly in a not-idle-enough range where the power cost of full power-gating's wakeup exceeds its savings.
That wasn't a tradeoff that would have paid off back when the static component of power consumption was a significant fraction of power budget, at least for complex CPUs with high peak clocks and a very wide dynamic range.

Jawed · Oct 26, 2015

As a datapoint on the subject, my HD7770 runs at 1000mV and HD7970 at 1006mV at stock clocks. The stock configurations at purchase were: 7970 at 1GHz with 1175mV, (not a GHz edition, which launched a little later) and 7770 is 1GHz 1200mV. I think the 7970 would under-volt to 900mV and the 7770 to 800mV a few years ago, they're just a little less tolerant of being under-volted now.

As far as I can tell AMD is over-generous with voltages, which would tend to put a ceiling on "increased voltages".

Dave Baumann · Oct 26, 2015

You mean some samples of those product run at those voltages.

Frenetic Pony · Oct 26, 2015

3dilettante said:
It's potentially lower-power, but seemingly capable of hitting higher voltages and clocks. Though it does lose a large amount of its power-efficiency when that happens, it doesn't need a liquid cooler and a VRM setup that apparently has to handle a significant number of >300W transients like Fury X apparently does.

The same node, with roughly the same board size, roughly the same amount of transistors, and roughly the same clock speed, should put off roughly the same amount of heat regardless of architecture, assuming both boards are generally efficiently utilized under load and nothing's voltage gated for want of being used. In fact, the Fury and Fury X read a few degrees less than the 980ti/TitanX under load. http://www.anandtech.com/show/9421/the-amd-radeon-r9-fury-review-feat-sapphire-asus/17 Which, again, points to the Fury/X having an inefficient architecture, showing that the Fury X especially isn't fully utilized as some portion of the pipeline is bottlenecking the rest.

Regardless, water coolers seem to be for size and noise reasons, and not heat dissipation. The Fury nominally draws as much power as a Fury X yet gets by with an air cooler and even then under the max load of furmark it doesn't get as hot as a 980ti.

3dilettante · Oct 26, 2015

Frenetic Pony said:
The same node, with roughly the same board size, roughly the same amount of transistors, and roughly the same clock speed, should put off roughly the same amount of heat regardless of architecture, assuming both boards are generally efficiently utilized under load and nothing's voltage gated for want of being used. In fact, the Fury and Fury X read a few degrees less than the 980ti/TitanX under load. http://www.anandtech.com/show/9421/the-amd-radeon-r9-fury-review-feat-sapphire-asus/17 Which, again, points to the Fury/X having an inefficient architecture, showing that the Fury X especially isn't fully utilized as some portion of the pipeline is bottlenecking the rest.

Temperature can be handled in various ways that can make the correlation with power consumption weak. It's particularly weak if comparing an air-cooled 980 Ti with an watercooled Fury X, particularly with the very large measured wattage difference.
The overclocked Ti cards can get Fury X power draw, but the clocks are measurably higher. It doesn't help that we know those cards have a memory interface that draws appreciable fractions of the total board power.

Regardless, water coolers seem to be for size and noise reasons, and not heat dissipation.
The Fury nominally draws as much power as a Fury X yet gets by with an air cooler and even then under the max load of furmark it doesn't get as hot as a 980ti.

Nominally per whose definition of nominal? Someone's "typical" board power?

If going by Furmark, the following have the nominally same cards differing by about 100W:
The already mentioned: http://www.anandtech.com/show/9421/the-amd-radeon-r9-fury-review-feat-sapphire-asus/17
http://www.pcworld.com/article/2946...er-is-amds-geforce-gtx-980-slayer.html?page=2

Techreport's Crysis 3 numbers have them differ by about 40W.
http://techreport.com/review/28612/asus-strix-radeon-r9-fury-graphics-card-reviewed/11

So, when we have "nominally" 275W boards with 40-100W differential, where the lower-power board seems to fall in line with other cards with 250-300 W power ranges, is there an extra set of numbers between 275 and 300?

Frenetic Pony · Oct 26, 2015

Which is to say, while the Fury X certainly gets the lowest temp under load of all the cards, 80c seems a perfectly acceptable temp and both the aircooled Fury and 980ti get there roughly. So the use of the watercooler in the FuryX appears to be done as much for size and noise reasons, as it could have gotten much hotter without, assumedly, being thermally limited. And certainly, due to setup and test differences people will report different power draws for the same/similar card. Which is why I said "nominally" as that's just a debate that can go for a long time without getting anywhere useful.

Regardless, some sort of bottleneck for the Fury/X, the most relevant datapoint we have to compare with Maxwell, is certain. What this bottleneck is however is debatable. Wiring? ROPs? Perhaps some combination of how the underlying architecture is done and the balance between CUs/ROPs/etc. is to blame. And since neither is, quite assumedly, going to be the same for the 4xx series we are only left to guess as to its performance in comparison to Pascal.

AMD: Speculation, Rumors, and Discussion (Archive)

Jawed

keldor

keldor

keldor

gamervivek

lanek

mczak

eastmen

Silent_Buddha

gamervivek

kalelovil

mczak

gamervivek

Ryan Smith

3dilettante

Jawed

Dave Baumann

Gamerscore Wh...

Frenetic Pony

3dilettante

Frenetic Pony

Similar threads