AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

The only thing that I don't understand is why would AMD rebrand P10 cards if vega11 is aim to that market and will be better?

Because either Vega 11 will become the RX 590, or is a Laptop exclusive chip. There is no point in releasing a Desktop SKU if it isn't faster than RX 580.
 
I used to think Vega 11 would be half a Vega 10 (thinking 32 CUs / 2048sp) together with a single HBM2 4-Hi stack, consuming 100W or less.
This would make it an excellent "premium midrange" card for premium laptops, AiOs and SFFs.
In fact, a 32 CU Vega GPU clocked at ~1.4GHz could actually get a very similar performance to the R9 Nano, so this would be a worthy successor (Vega Nano?) capable of going substantially lower in TDP.

Then fudzilla came up with news saying some Vega cards would be coming with GDDR5/X. I thought AMD would want the big Vega 10 chip to go on high-margin graphics cards, so if such a GDDR5 Vega would exist, it would be Vega 11 based. I eventually put aside the "premium midrange" idea, and thought Vega 11 would either be replacing Polaris 10 with 32 CUs or be a mid-term between Polaris 10 and Vega 10 (e.g. 2/3rds of a Vega 10, like the GP104 -> GP102 difference).

But then during the Vega tease presentation to the press, AMD not only spoke exclusively about HBM2 for Vega, but also seemed to emphasize its importance by calling it "High Bandwidth Cache" and the new memory controller that is so dependent on it.
Furthermore, AMD also doubled down on the "we don't really need a lot of VRAM" narrative for Vega. For an 8GB card, this narrative wouldn't be needed.. so I guess there are definitely some 4GB HBM2 cards coming.

So I'm back to the idea of Vega 11 being a "premium midrange". Half a Vega 10 (~260 mm^2?) with a single HBM2 stack.



However, a sub-100W Vega 11 for laptops wouldn't exclude a lower-clocked Vega 10 to make its way into top-end gaming laptops with a 150W TDP.


The only thing that I don't understand is why would AMD rebrand P10 cards if vega11 is aim to that market and will be better?
Because Vega 11 should be more expensive to make but the lower power demands would make it very appealing to laptops and SFFs.
Few desktop users will care if Polaris 10 cards consume 120W or 180W, as long as it doesn't reach the 250-300W levels of Hawaii and Fiji. Any regular PSU rated over 400W will handle that and heat isn't a concern at those levels.
OTOH, premium laptop and AiO makers (e.g. apple with macbooks and imacs) depend a lot on power consumption and heat output of the discrete graphics solutions.
 
Timothy Lottes's GDC slides, Vega is GCN5, mentioned with packed maths and FP16 optimisations,

http://gpuopen.com/gdc2017-advanced-shader-programming-on-gcn/

Because either Vega 11 will become the RX 590, or is a Laptop exclusive chip. There is no point in releasing a Desktop SKU if it isn't faster than RX 580.

I don't think AMD would bother with naming a small chip like for Polaris 12, they mentioned two Polaris chips and they mentioned two Vega as well.
Vega 11 will likely slot in Fury X - 1080 bracket while Vega 10 would be above it.
 
The only thing that I don't understand is why would AMD rebrand P10 cards if vega11 is aim to that market and will be better?

They'll rebrand Polaris 10 for the 500 series for the same reason they briefly rebranded Tahiti for the 200 series despite Tonga seeing its introduction in the 200 series as well. If Polaris 10 XT is the 580, then Vega 11 could be the 590. If Vega 11 has any performance bump over Polaris 10 (and it probably has at least a small bump), then that would work decently on the desktop.

Honestly, I could see Vega 11 becoming very similar to Tonga as compared to Polaris 10 and Tahiti, respectively.

Just like in Tonga's case, Apple could make great use of Vega 11 and we know that Apple has AMD pretty whipped. Polaris 10 doesn't find itself in Apple's lineup, but Vega 11 could have a better chance of doing so (iMac, Macbook Pro, etc).

So while a Polaris 10-like Vega 11 would be redundant on the desktop, AMD's world is bigger than the desktop.
 
The only thing that I don't understand is why would AMD rebrand P10 cards if vega11 is aim to that market and will be better?
Cost if all Vega models depend on HBM2. APUs have low end Vegas. We've seen enthusiast Vega. Mid could be P10/580 at performance and Vega11 at high/590 and/or Snowy Owl APUs where the HBM2 could serve as a cache to offset cost on premium APUs.
 
Vega 10 - Geforce GP102 + x%
Vega 11 - Geforce GP104 + x%

AMD has a huge gap above the RX580 that one Chip won´t be able to fill.
 
Vega 10 - Geforce GP102 + x%
Vega 11 - Geforce GP104 + x%

AMD has a huge gap above the RX580 that one Chip won´t be able to fill.

It's likely it's the other way round, Vega 11 will be slower than full GP104 of 1080 and Vega 10 will be slower than GP102 of 1080Ti.

Fury X is around 38% faster than 480 at 4k, the 580 could be 10% faster than 480 and Vega 11 slots in at around 30% faster than that or around 1070 level + a few percent, maybe half way to 1080 at max while Vega 10 is about midway between 1080 and 1080Ti, closer to the latter.
 
Just like in Tonga's case, Apple could make great use of Vega 11 and we know that Apple has AMD pretty whipped. Polaris 10 doesn't find itself in Apple's lineup, but Vega 11 could have a better chance of doing so (iMac, Macbook Pro, etc).
References to Polaris 10, Polaris 10 XT2, Polaris 12, and Vega 10 (in addition to the existing Polaris 11) were found in the macOS 10.12.2 beta, although that doesn't mean there won't be a Vega 11 in some Mac down the road.
 
Bit old, but some hints. Link from the linux drivers.
Code:
def FeatureVolcanicIslands : SubtargetFeatureGeneration<"VOLCANIC_ISLANDS",
  [FeatureFP64, FeatureLocalMemorySize65536,
   FeatureWavefrontSize64, FeatureFlatAddressSpace, FeatureGCN,
   FeatureGCN3Encoding, FeatureCIInsts, Feature16BitInsts,
   FeatureSMemRealTime, FeatureVGPRIndexMode, FeatureMovrel,
   FeatureScalarStores, FeatureInv2PiInlineImm, FeatureSDWA,
   FeatureDPP
  ]
>;

def FeatureGFX9 : SubtargetFeatureGeneration<"GFX9",
  [FeatureFP64, FeatureLocalMemorySize65536,
   FeatureWavefrontSize64, FeatureFlatAddressSpace, FeatureGCN,
   FeatureGCN3Encoding, FeatureCIInsts, Feature16BitInsts,
   FeatureSMemRealTime, FeatureScalarStores, FeatureInv2PiInlineImm,
   FeatureApertureRegs
  ]
>;

Added:
  • FeatureApertureRegs
Removed:
  • FeatureVGPRIndexMode
  • FeatureMovrel
  • FeatureSDWA
  • FeatureDPP
Interestingly there was no mention of packed FP16, just "Feature16BitInsts" like Volcanic Islands.
 
Less than one week after AMDGPU DRM Vega support patches for the Linux driver stack , more Direct Rendering Manager patches are being shot out today.

A few more patch series by AMD developers have been sent out so far today in order to provide fixes to the initial Radeon RX Vega driver support code. Among other things, the SR-IOV fixes for Vega10 is a powerful GPU virtualization feature, clock-gating functions , multi-level VMPT support (four levels with Vega), and other fixes.
http://www.phoronix.com/scan.php?page=news_item&px=More-AMDGPU-Vega
 
Last edited by a moderator:
That sounds like it would be one item, but I was thinking also of the microcode loaded by the GPU for internal load balancing parameters like the CU reservation and high-priority queue functionality that was retroactively applied to some of the prior generation GCN GPUs.
That is also relatively easy to accommodate for as basically every different GPU has its own version of it anyway.
To clarify, I am not saying the external bits are determining the fate of the internals, but that they can be reflective of peculiarities below the surface, such as evidenced by APIC IDs and Zen.
AFAIK the APIC IDs of CPU cores are set by the BIOS/UEFI (a different BIOS version can map them differently)*. That's not indicative of any underlying pecularities. The same is very likely true for the shader engine and CU IDs. After all, a salvaged die with deactivated CUs doesn't show gaps in the IDs, isn't it? ;)
[edit]
*): Check a random BIOS and Kernel Developer Guide from AMD. They detail how the BIOS has to set the APIC IDs. It's definitely done by the BIOS. And I would say it works along the same lines with at least some of the IDs in GPUs.
[/edit]
There's a desire for economy in the external encoding, so it would tend to use as few bits as necessary for a design in the ISA, while having fewer than the internal usage would compromise the purpose of the instructions that return the external bits.
So when items deviate from the most straightforward use case, I wonder if there's a reason.
As the content of some read only registers are not exactly a core part of the ISA, it is extremely easy to return something different with a different chip. I don't see the problem. The internal distribution of stuff is most likely not relying on them. And even if it would, the work distribution is anyway specific to a certain size of the chip (combination of #engines, #CUs, #RBEs). That means for a differently sized chip this part has to be reworked anyway.
That runs counter to some of the goals for the GPU architecture for parallel scaling through copy-pasting more of the same resource. I wouldn't want to re-roll a CU sequencer based on whether it is going into a chip with 1 or 4 shader engines, or if a specific chip can have 1:4 DP in consumer and 1:2 in HPC. AMD marketed GCN's tunable DP rate, which should have been a trivial point if following the "any hardware can be redesigned to hit X rate", unless it's something they explicitly made tunable without that amount of effort.
What you duplicate are the CUs, or groups of CUs, RBEs, external interfaces and such. The work distribution and the crossbars between the different parts of the chips definitely have to be designed for the specific chip at hand (AMD claimed that the infinity fabric will make the crossbar part easier in the future as this gets a bit more modular than it used to be). And they include some flexibility to allow for different configuration in one design (to be able to salvage parts). But the work distribution for a single engine GPU looks of course different than the one for a dual engine GPU or a quad engine GPU. And the command processor(s) need to keep track of way more wavefronts/workgroups on larger chips too. There need to be differences between chips in these areas.
The different DP rates for the same chip in different markets are handled by either the driver limiting the throughput or they are likely programmed using eFuses for a specific rate. Only the CUs need "to know" about this, it's completely irrelevant for the rest of the chip. There are simply no side effects (the same way an individual CU can be ignorant about the number of shader engines or CUs in the chip, it doesn't change its behaviour). And the "configurable" rate between different chips work by AMD having completed the principal design (of course not the physical design [that is started after the decision about the configuration of a chip], but just the hardware description in Verilog or an equivalent) for multiple DP rates and simply chooses which of these to implement in a specific chip. The different units and building blocks have specified interfaces meaning that one can swap them out somewhat independently from each other (or at least easier). But it doesn't mean one doesn't have to do some design work for each individual chip to make the building blocks work together.
I think I found what you are referencing. Is it the following from February?
https://github.com/llvm-mirror/llvm/commit/83c857cd3ae73d71958ccee8c43d55dc40ba3cc1
Yes.
My scenario isn't that it is impossible, just that it is past a threshold where the investment became non-trivial and in AMD's eyes not worth the effort.
AMD left GCN in something of a holding pattern for several cycles, so some economies of effort were made.
So we aren't that far off. That was my first post on this topic:
I'm still convinced, there never was a hard limit of 4 shader engines. AMD just never implemented a design with more than four. But fundamentally, their approach would of course work with a higher number of engines. Obviously they assessed, it wouldn't be worth the effort (diminishing returns without the transition to a more scalable setup and such). But there is nothing special about that number 4. Why should it be impossible to distribute stuff over 6 or 8 engines? Doesn't make the slightest sense to me.
;)
I just think that you underestimate the needed efforts to design a different sized chip in general. In my opinion, it wouldn't be that much more for a 6 or 8 engine design compared to the step from 2 to 4 engines, as some crucial parts need to be designed to the specific number of engines/CUs/RBEs/L2 tiles/whatever in the chip anyway. I still don't see a fundamental issue.

======================================================

Bit old, but some hints. Link from the linux drivers.
Code:
def FeatureVolcanicIslands : SubtargetFeatureGeneration<"VOLCANIC_ISLANDS",
  [FeatureFP64, FeatureLocalMemorySize65536,
   FeatureWavefrontSize64, FeatureFlatAddressSpace, FeatureGCN,
   FeatureGCN3Encoding, FeatureCIInsts, Feature16BitInsts,
   FeatureSMemRealTime, FeatureVGPRIndexMode, FeatureMovrel,
   FeatureScalarStores, FeatureInv2PiInlineImm, FeatureSDWA,
   FeatureDPP
  ]
>;

def FeatureGFX9 : SubtargetFeatureGeneration<"GFX9",
  [FeatureFP64, FeatureLocalMemorySize65536,
   FeatureWavefrontSize64, FeatureFlatAddressSpace, FeatureGCN,
   FeatureGCN3Encoding, FeatureCIInsts, Feature16BitInsts,
   FeatureSMemRealTime, FeatureScalarStores, FeatureInv2PiInlineImm,
   FeatureApertureRegs
  ]
>;

Added:
  • FeatureApertureRegs
Removed:
  • FeatureVGPRIndexMode
  • FeatureMovrel
  • FeatureSDWA
  • FeatureDPP
Interestingly there was no mention of packed FP16, just "Feature16BitInsts" like Volcanic Islands.
This list is nowhere near complete/accurate. GFX9 supports of course packed math and the SDWA feature was of course not removed as it is crucial for the 16bit packed math (see linked GDC talk above). And I seriously doubt AMD removed the DPP instructions, too.
 
Last edited:
This list is nowhere near complete/accurate. GFX9 supports of course packed math and the SDWA feature was of course not removed as it is crucial for the 16bit packed math (see linked GDC talk above). And I seriously doubt AMD removed the DPP instructions, too.
My take was the "AperatureRegs" rolling up those capabilities. Full crossbar and/or LDS mechanism in each SIMD. I'm in agreement with your points.
 
As the content of some read only registers are not exactly a core part of the ISA, it is extremely easy to return something different with a different chip.
AMD seems to value consistency for the reported values, and Vega does go through some amount of extra work to split VMCNT the way it does. It even takes care to leave room for the other fields--or doesn't want to take up part of the range for VALUCNT, for some reason.

I don't see the problem. The internal distribution of stuff is most likely not relying on them.
If the internal distribution is only ever going to be sized to a certain limit, why should the CU be able to report any more than that? The formats seem to be sufficient to fit all the sizes AMD has put forward for generations, and apparently two generations more.
If there is reworking necessary per size, Vega10 and Vega20 would apparently not need it.

What you duplicate are the CUs, or groups of CUs, RBEs, external interfaces and such. The work distribution and the crossbars between the different parts of the chips definitely have to be designed for the specific chip at hand (AMD claimed that the infinity fabric will make the crossbar part easier in the future as this gets a bit more modular than it used to be).
The context I got from the discussion on the Infinity Fabric's role in the SoC was that it was focused on the interconnect between the various separate portions of an APU, rather than internally. The internal crossbar used for AMD's CPUs has a long lineage, and that particular element has been cited as a scaling barrier, and the later APUs that tried to fit different IP onto it were described as being more chaotic. There has been a limited range in client counts for that crossbar since K8, although the coherent domain is a more demanding use case.

I'm not sure that the fabric is going to cover the work distribution or arbitration facet I was focused on.
The data fabric does not intrude into the CCXs in Zen, and my interpretation of the limited description of Vega's implementation has it hooked into the memory controllers and L2, outside of the engines. Even without that, the interconnect is supposed to be agnostic to the the particulars of the clients it is connecting. The controllers in the various blocks and their hardware or microcode settings would do the work, and those are populated by a number of proprietary cores or dedicated sequencers.

And they include some flexibility to allow for different configuration in one design (to be able to salvage parts). But the work distribution for a single engine GPU looks of course different than the one for a dual engine GPU or a quad engine GPU.
Couldn't the flexibility for salvage be used to provide downward scalability for the same hardware block? If the GPU can transparently disable and remap units, couldn't the same block interface with hardware where there are units "disabled" due to their not existing?

And the command processor(s) need to keep track of way more wavefronts/workgroups on larger chips too. There need to be differences between chips in these areas.
The command processors themselves are proprietary F32 cores, per some linked-in info and the PS4 hack. The graphics command processor is a set of multiple custom cores, which is an arrangement whose lineage goes back to the VLIW days. I think Cypress had a brief blurb about that portion once. I think the actual tracking and arbitration happens in another stage. The command front end versus back-end seems to vary pretty significantly, going by the ratio of front-end to back-end for Kaveri, the consoles, and Hawaii. The flexibility or size of the available microcode store seems to have a significant effect on what features they can offer.

The different DP rates for the same chip in different markets are handled by either the driver limiting the throughput or they are likely programmed using eFuses for a specific rate. Only the CUs need "to know" about this, it's completely irrelevant for the rest of the chip.
How the CUs for the same implementation "know" what to do, even with a fuse blown without either having the logic for the different rates or internal sequencing information on hand?
Or, if that is possible between two settings on one implementation, why not design a configurable sequencer with the additional rates and then reuse it, at least within the same generation?

If the base pipeline isn't smart enough for a function or new design option, just add another stall or require the programmer to add a NOP.
Even between generations, the persistence of various wait states or undetected aliasing seems to point to there being a few elements of the CU's architecture that AMD decided were "good enough" for at least 4 revisions of the architecture. The scalar memory pipeline and flat addressing require waitcnts of 0, and haven't merited an interlock for several generations. DPP has forwarding and EXEC delays, rather than the CU being redesigned to handle it.

I just think that you underestimate the needed efforts to design a different sized chip in general.
Perhaps I am overestimating how much AMD is willing to economize effort. I see benefits in having blocks with a certain size or expandability defined once, and then sharing them across the generation and semicustom products with some of the basic questions and interfaces fixed. (edit: And for purposes of further saved effort or continuity, some of those might carry across gens)
If certain stages are to be customized, they can be extended with microcode or internal program stores, perhaps allowing isolation between custom lines (AMD had problems with this kind of firewall, given the console leaks, though).

If the hope is that the replacement architecture comes before the upper bounds of the blocks or spare code storage are reached, then why expand the blocks or context?
 
Last edited:
Then fudzilla came up with news saying some Vega cards would be coming with GDDR5/X. I thought AMD would want the big Vega 10 chip to go on high-margin graphics cards, so if such a GDDR5 Vega would exist, it would be Vega 11 based. I eventually put aside the "premium midrange" idea, and thought Vega 11 would either be replacing Polaris 10 with 32 CUs or be a mid-term between Polaris 10 and Vega 10 (e.g. 2/3rds of a Vega 10, like the GP104 -> GP102 difference).

But then during the Vega tease presentation to the press, AMD not only spoke exclusively about HBM2 for Vega, but also seemed to emphasize its importance by calling it "High Bandwidth Cache" and the new memory controller that is so dependent on it.
Furthermore, AMD also doubled down on the "we don't really need a lot of VRAM" narrative for Vega. For an 8GB card, this narrative wouldn't be needed.. so I guess there are definitely some 4GB HBM2 cards coming.

So I'm back to the idea of Vega 11 being a "premium midrange". Half a Vega 10 (~260 mm^2?) with a single HBM2 stack.
Actually, isn´t PS4 Pro GPU a Vega chip without HBM2 (and lacking also the tiled based rendering) in disguise?. The shader cores seem the same (capable of FP16 packed math), the scheduler and geometry engine are also Vega like...
 
Last edited:
But then during the Vega tease presentation to the press, AMD not only spoke exclusively about HBM2 for Vega, but also seemed to emphasize its importance by calling it "High Bandwidth Cache" and the new memory controller that is so dependent on it.
Furthermore, AMD also doubled down on the "we don't really need a lot of VRAM" narrative for Vega. For an 8GB card, this narrative wouldn't be needed.. so I guess there are definitely some 4GB HBM2 cards coming.
Might also be that AMD sees Vega main rivals as 12 GB Titan X and 11 GB GTX 1080 Ti. If their automatic memory paging solution makes 8 GB Vega is equal to a 16 GB card without it, then they definitely need to market it as so. Otherwise people believe that competitors 11 GB card is more future proof than their 8 GB card, while it might actually be the other way around.

I am sure we also see 4 GB low/mid tier Vega cards eventually. Especially if Vega's automatic memory paging works well. Nvidia is selling 3 GB GTX 1060 and that is a recent upper mid tier product. There's nothing wrong with 4 GB models below GTX 1060 performance class. 4 GB is still perfectly fine for 1080p gaming.
 
Actually, isn´t PS4 Pro GPU a Vega chip without HBM2 (and lacking also the tiled based rendering) in disguise?. The shader cores seem the same (capable of FP16 packed math), the scheduler and geometry engine are also Vega like...

The FP16 instructions, some geometry improvements, and the work distributor were the supposed future features.
Unmentioned: ROPs becoming memory clients, data fabric, more complex memory controller, or specific measures like primitive shaders. (ed: virtualization features, which Sony doesn't seem into anyway)
While I am now having trouble finding the source, I think I remember that the setup pipeline's ability to launch wavefronts was enhanced with Vega.
Also not mentioned are items like the still-vague clock and IPC enhancements of Vega's NCU--the Pro's clocks are a far cry from the rumored clocks of Vega.

There are some elements like the ID buffer that could have similar logic to what feeds into the binning rasterizer, and conversely I am not sure if there's an element disclosed so far in Vega that matches that. Sony also would have kept the custom tweaks it made that I am not sure carried over to the mainline like the volatile flag.

However, I am still cautious on whether the PS4 Pro's GPU doesn't lack a lot of the features that were standard before Vega. GCN's encoding changed significantly in spots, and new items like scalar writes, accessing sub-sections of 32-bit registers, vector cross-lane operations are what arrived after Sea Islands, plus a few instructions the PS4's GPU may have that were dropped.
 
AMD seems to value consistency for the reported values, and Vega does go through some amount of extra work to split VMCNT the way it does. It even takes care to leave room for the other fields--or doesn't want to take up part of the range for VALUCNT, for some reason.
I don't see a reason not to do exactly the same with the HW_ID bits. And there is plenty of space for just adding a different set of bits for newer versions. Only 7 hardware registers are assigned in that manual (1 to 7, 0 is reserved), but the ISA leaves room for 64. AMD could simply define HW_ID_ext as register #8 with a completely new set of bits. I still see no argument there.
And the extra work is mainly necessary for the software to glue the bits together. A fixed "swizzle" between some bits of a register just requires crossing the data lines. That effort is not worth mentioning.
If the internal distribution is only ever going to be sized to a certain limit, why should the CU be able to report any more than that?
Because AMD seems to value consistency across generations? ;)
The formats seem to be sufficient to fit all the sizes AMD has put forward for generations, and apparently two generations more.
If there is reworking necessary per size, Vega10 and Vega20 would apparently not need it.
The main task of the needed rework is not changing a few bits in some registers. It's actually designing the work distribution and all the crossbars in the chip to the needed size. The effort for the bits are irrelevant in comparison.
The context I got from the discussion on the Infinity Fabric's role in the SoC was that it was focused on the interconnect between the various separate portions of an APU, rather than internally. The internal crossbar used for AMD's CPUs has a long lineage, and that particular element has been cited as a scaling barrier, and the later APUs that tried to fit different IP onto it were described as being more chaotic. There has been a limited range in client counts for that crossbar since K8, although the coherent domain is a more demanding use case.
The crossbars in GPUs have way more clients (just look at the connections between all the L1 caches [there 96 separate ones!] and L2 tiles [everything has a 512bit wide connection btw.]). And isn't AMD on record, that AMD will use some infinity fabric derived thingy to connect everything on the chip to the L2 tiles/memory controllers in Vega? AMD wants to go from the massive crossbar switches (which need an expensive redesign for doing a significant change to the number of clients) to a more easily scalable NoC configuration. That's quite an issue for the simplicity of scaling and one of the main efforts when going to larger GPUs.
I'm not sure that the fabric is going to cover the work distribution or arbitration facet I was focused on.
Not entirely of course. But all of them have to be adapted to the size of the chip (and the work/the associated data has to get somehow to the CUs in order to be distributed, right?).
The data fabric does not intrude into the CCXs in Zen, and my interpretation of the limited description of Vega's implementation has it hooked into the memory controllers and L2, outside of the engines. Even without that, the interconnect is supposed to be agnostic to the the particulars of the clients it is connecting.
But a crossbar (and the effort needed for it) is not agnostic to the number of attached clients, quite the contrary actually. The duplicated "building blocks", i.e. the individual CUs (or the groups sharing L1-sD$ and L1-I$), the RBEs and even the L2 tiles and memory controllers, are agnostic to the particulars of the connection between the parts (I mentioned that before). And it is ecactly this connection (and of course also how to distribute stuff between the duplicated units) which has to be reworked for a differently sized chip. This is supposed to be improved upon with the infinity fabric.
The controllers in the various blocks and their hardware or microcode settings would do the work, and those are populated by a number of proprietary cores or dedicated sequencers.
If you scale a chip from 10 CUs (with 16 L1$ and 4 L2-tiles, i.e. you have a 16x4 port crossbar between them [as mentioned, every port is capable of 512bit per clock]) to 64 CUs (96 L1$, 32 L2 tiles, i.e. a 96x32 port crossbar), there is no way the "controllers in the various blocks and their hardware or microcode settings would do the work". You have to put a significant effort in to design that stuff and to make it work.
Couldn't the flexibility for salvage be used to provide downward scalability for the same hardware block? If the GPU can transparently disable and remap units, couldn't the same block interface with hardware where there are units "disabled" due to their not existing?
Within reason of course. As the effort for large crossbars (that is what we have in GPUs up to now) goes up tremendeously with the number of clients, there are strong incentives to put smaller versions in smaller chips. This stuff is definitely not untouched between differently sized chips.
The command processors themselves are proprietary F32 cores, per some linked-in info and the PS4 hack. The graphics command processor is a set of multiple custom cores, which is an arrangement whose lineage goes back to the VLIW days. [..] The command front end versus back-end seems to vary pretty significantly, going by the ratio of front-end to back-end for Kaveri, the consoles, and Hawaii. The flexibility or size of the available microcode store seems to have a significant effect on what features they can offer.
As this was obviously extendable over the range you mentioned (starting with 3/4 SIMD engines all the way up to 64 CUs with 256 SIMD engines in total and from a single to 4 shader engines), I doubt that the number of shader engines is a hard wall on this front.
How the CUs for the same implementation "know" what to do, even with a fuse blown without either having the logic for the different rates or internal sequencing information on hand?
A fuse is basically a bit which can set certain things. And then it is simple. Each vALU instruction has a certain fixed latency usually identical to the throughput (4 cycles for full rate instruction, 8 cycles for half rate and so on, exceptions exist). After decoding an instruction, the scheduler in the CU "knows" the throughput number for the instruction and pauses the instruction issue on the respective port for the corresponding number of cycles. What happens, if the decoder is "told" by a set bit (blown fuse), that the throughput is different from the one of the physically implemented hardware? ;) To give a specific example, the consumer parts for Hawaii (which is physically a half rate double precision chip) have a fuse blown (or potentially some µcode update done by the firmware) which sets this to 1/8 rate for the DP instructions. The chip effectively pauses the vALU for some cycles after each DP instruction.
Other chips have units which can do only 1/16 DP rate. The respective throughput number attached to the instructions in the decoder reflect that of course. In other words: Units with different DP rates are physically different, also in the same generation. But to restrict the throughput down is pretty easy.
Or, if that is possible between two settings on one implementation, why not design a configurable sequencer with the additional rates and then reuse it, at least within the same generation?
What sequencer? The one sequencing instructions from the 10 instruction buffers within a CU? As I said, the CUs itself are of course reused. If you follow the logic above, no real hardware change is needed to allow for different executions rates of specific instructions (basically a change to some kind of LUT with the throughput numbers is enough, probably at least partially hardwired).
If you talk about something in the command processor/work distribution, these parts don't know anything at all about execution rates.
Perhaps I am overestimating how much AMD is willing to economize effort. I see benefits in having blocks with a certain size or expandability defined once, and then sharing them across the generation and semicustom products with some of the basic questions and interfaces fixed. (edit: And for purposes of further saved effort or continuity, some of those might carry across gens)
It's very much about economizing the effort. But you can't just build a large GPU out of the design of a small CP, some compute pipe, a CU, an RBE, and a L2 tile with a connected 32bit memory channel. All of these blocks have somewhat of a standardized interface to the outside (at least within a generation), one can multiply these blocks and AMD is indeed doing that. But this stuff needs to get interconnected and to make to work together. That is the hard part for scaling to a different chip size. And while AMD wants to make it easier for themselves in future iterations, so far you have to redo quite a bit of it for each different chip.
 
Back
Top