AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

We don't even know if there is a big Vega, or has it just been something that is already there and renamed, like greenland for desktop integration.
 
Honestly considering Koduri's words a couple of months ago, i think one doesnt take too many risks assuming Vega 10 will be the only big die coming from AMD.
 
Yup. It should be Vega 10 probably between GP104 and GP102 and Vega 11 Between GP106 and GP104.
 
Honestly considering Koduri's words a couple of months ago, i think one doesnt take too many risks assuming Vega 10 will be the only big die coming from AMD.

Yeah, I have a sneaking suspicion that gcn is limited to 64 CUs and it's relatively difficult to overcome, so we won't get anything materially bigger than Vega 10 (hence Vega 20). I hope I'm wrong.
 
I remember reading that AMD commented on the naming, and that the number doesn't reflect the size but the order in which development started - this was in response to people saying Vega 10 would be the big one, and 11 the small one (like Polaris)

I don't see why they'd be limited to 64 CU, what stops them from having six SEs and 96 CU?

I personally assumed Vega 11 would be big because I assumed Vega and Polaris would both make up the GCN4 product stack with Vega 10 and 11 being equivalent to GP104 and GP102 respectively

I was actually going to ask the same thing.


I've heard folks mention Vega 11 as the big Vega, but I can't recall a source.


I'm sorta disappointed that amd can't design a consistent codename system. Their post-Polaris system is an improvement, but it's not quite where it needs to be. We're really pretty spoiled with Nvidia's codenames.


Nvidia' s GPU naming scheme is perfect :p simple and clear
 
I remember reading that AMD commented on the naming, and that the number doesn't reflect the size but the order in which development started - this was in response to people saying Vega 10 would be the big one, and 11 the small one (like Polaris)

I recall that as well. I consider it a mistake on amd's decision making.

Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.

Amd could've done the same thing with the last digit in their codename numbers. Instead, they publicly said that the numbers are meaningless. There's no link between Polaris 10 and Vega 10. They cover entirely separate market segments.



I don't see why they'd be limited to 64 CU, what stops them from having six SEs and 96 CU?

Apparently GCN 1.1 (second gen gcn) was limited to 4 SEs. Since then, I don't think they have gone past 4. There's no reason why they couldn't in the long term, but they haven't for whatever reason. I hope it's a harmless confidence thatVega 10 and 20 potentially still only having 4 SEs. I may be misunderstanding something as I'm just a layman and this is starting to get relatively low level.

http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4

Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.
 
I recall that as well. I consider it a mistake on amd's decision making.

Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.

Amd could've done the same thing with the last digit in their codename numbers. Instead, they publicly said that the numbers are meaningless. There's no link between Polaris 10 and Vega 10. They cover entirely separate market segments.





Apparently GCN 1.1 (second gen gcn) was limited to 4 SEs. Since then, I don't think they have gone past 4. There's no reason why they couldn't in the long term, but they haven't for whatever reason. I hope it's a harmless confidence thatVega 10 and 20 potentially still only having 4 SEs. I may be misunderstanding something as I'm just a layman and this is starting to get relatively low level.

http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4

Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.

They better
I recall that as well. I consider it a mistake on amd's decision making.

Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.

Amd could've done the same thing with the last digit in their codename numbers. Instead, they publicly said that the numbers are meaningless. There's no link between Polaris 10 and Vega 10. They cover entirely separate market segments.





Apparently GCN 1.1 (second gen gcn) was limited to 4 SEs. Since then, I don't think they have gone past 4. There's no reason why they couldn't in the long term, but they haven't for whatever reason. I hope it's a harmless confidence thatVega 10 and 20 potentially still only having 4 SEs. I may be misunderstanding something as I'm just a layman and this is starting to get relatively low level.

http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4

Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.

I think that's just referring to the fact that they would need more SEs to have more shaders. Fiji is already held back by the front end by a significant amount, simply adding 25% more CUs to each SE would yield extreme diminishing returns in games.

Speaking of front ends, does anyone know what 'Beyond3D Test Suite' is? It's used by several hardware websites and the polygon throughput test shows nvidia cards pushing more polys than they can in theory - black magic? It must be counting triangles that were discarded or something, but Polaris doesn't exhibit this behavior and it does have primitive discard logic.
 
Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.
That decision making is really only relevant to marketing, and those internal chip codes aren't really relevant for the board names the GPUs are given. That's where the differentiation is made.
 
But it had occurred to me that AMD could do a bait and switch on us: what if a wavefront is really small, say 16 work items. And a workgroup is at least 4 wavefronts?

You mean multiple-of-4, otherwise instruction latency has to be non multiple-of-4, which is ... a rather fundamental change.
 
GCN is executing 64 wide waves on a 16 wide SIMD. You don't need to find independent instructions. Lanes (0-15) (16-31) (32-47) (48-64) of the same register are guaranteed to be independent. That gives you four cycles worth of independent work to issue.
I get that this is the case. What I was referencing was how the unit seemingly behaves at a pipeline rather than software level, and whether that has physical implications for clock speed. Not pipelining an operation can reduce transistor count, area, and latency at the cost of things like unit throughput or the affecting the distribution of complexity among pipeline stages.
The SIMD acts at a higher level in some ways like it is a partly pipelined unit with fractional throughput, although internally it may be fully pipelined.

If one goal is more clock speed, there's an overall pipeline of undetermined length with a wad of "something" that must complete in the equivalent of 4 stages.
If it is subdivided internally into fewer stages, or skewed so some of its stages are given more timing budget than others, that bigger chunk of work per stage can become a limiter. One knob for increasing clock if that unit becomes critical path would be to increase the number of stages, but the number based on cadence remains 4 (unless clock multiplying like Fermi?).

One possible organization for the SIMD pipeline would be a derivation of the VLIW 4-cycle cadence defined in the ISA docs for R600 and Evergreen: three cycles to get three operand registers and one cycle to execute.
If this were an idealized pipeline, each stage would have as much work to do as any other with as little slack as possible, and presumably each operand fetch gets one stage.
This might become relevant to the goal of increasing register capacity by 50%. For increasing storage capacities access time increases, and it lengthens the delay of a register access stage. A perfectly balanced pipeline would then see its cycle time increase to match. Larger caches scale delay by the square root of the increase, but the register files at may behave differently.

That implementation also differs from Skylake since the CPU's register file is heavily multiported and uses its wide OoO engine to more heavily utilize register bandwidth. The static stage per operand method would cause underutilization based on operand count in the single-issue SIMD.
 
That decision making is really only relevant to marketing, and those internal chip codes aren't really relevant for the board names the GPUs are given. That's where the differentiation is made.

You're right that it has no "functional" difference and in today's world, codenames only "matter" for marketing purposes.

I think it's also relevant to those speculating (i.e. this forum). Like when all of the Pascal codenames leaked almost a year ago. Everything was expected except GP102. Just seeing "GP102" made people go, "Hmmm, that's new. I wonder why that's necessary." And I know of at least one friend that predicted that between GP100 and GP102, one would use HBM and the other would use GDDR5X. Meanwhile on the AMD side, we're still not 100% sure whether Vega 11 is bigger than Vega 10 (and now with "Vega 20" rumors, that distinction matters if we assume Vega 20 is an updated Vega 10).

Obviously AMD saw some amount of value in recodenaming Ellesmere and Baffin under the Polaris banner despite them already having perfectly functional codenames, so there must be some measurable marketing benefit. I'm just disappointed that they didn't use that opportunity to move to a codename structure as descriptive as Nvidia's.
 
Well not really, I think many of us were expecting a gp102, gp100 just didn't make sense in the consumer market., if Navi has been pushed back, it makes sense that Vega 11 is a bigger chip than Vega 10, if Navi hasn't been pushed back, most likely Vega 11 is a smaller chip. I don't think Navi has been pushed back as of yet outside of rumors I don't think there is anything else to go by.
 
I doubt Vega is limited to 64 CUs, it's just inconvenient to exceed. Currently everything seems a nice power of two and adjusting that would likely require a crossbar for ROPs to line up with memory channels. Especially with HBM bus widths. It would also complicate the hardware scheduling and you'd again exceed a power of two and the units likely need to increase in size to track everything. All those options would require power that may be better spent elsewhere. Would seem more appropriate to stick with the 64 CUs and beef up the SIMDs in each CU. Tonga seems to be the test case for >64 even though it didn't have 64 CUs. Getting some of the clocks decoupled so segments of the card can run faster would seem a likely solution. It's also possible variable sized SIMDs and high powered scalars could be more efficient at processing geometry if they go that route.
 
does anyone know what 'Beyond3D Test Suite' is? It's used by several hardware websites and the polygon throughput test shows nvidia cards pushing more polys than they can in theory - black magic? It must be counting triangles that were discarded or something, but Polaris doesn't exhibit this behavior and it does have primitive discard logic.
All the triangles are culled as if they are back face culled.
 
One possible organization for the SIMD pipeline would be a derivation of the VLIW 4-cycle cadence defined in the ISA docs for R600 and Evergreen: three cycles to get three operand registers and one cycle to execute. If this were an idealized pipeline, each stage would have as much work to do as any other with as little slack as possible, and presumably each operand fetch gets one stage.
If it really needs 3 cycles to fetch the 3 registers needed for multiply-add, it only has 1 cycle to execute it. That could be one of the reason why the clocks cannot be increased much beyond 1 GHz. But why couldn't it fetch all three registers in parallel? I am not an hardware engineer, so I don't understand the limitations here. Limited register file bandwidth? Bank conflicts?
This might become relevant to the goal of increasing register capacity by 50%. For increasing storage capacities access time increases, and it lengthens the delay of a register access stage. A perfectly balanced pipeline would then see its cycle time increase to match. Larger caches scale delay by the square root of the increase, but the register files at may behave differently.
50% increase of registers per SIMD = 96 KB of registers per SIMD. Could be hard, if AMD is already spending 3 cycles to fetch the registers. Could they do something like Nvidia did with Maxwell? Introduce a register file that is more heavily banked (more bank conflicts, but higher BW) and at the same time introduce L0 register caching (use last results directly without writing & reading it from the cache). This worked very well for Maxwell. Kepler's register file had less bank conflicts, but Maxwell avoided this by register caching. But I don't know whether this approach suits fixed length (4 cycle) instruction latency.

Access time increase (of local memories) is the reason why I don't see that simply fattening the CU to hold 6 SIMDs like a good idea. L1 cache and LDS would both become slower to access (and also consume more power). Fat 6xSIMD CU of course wouldn't increase the register access time, as each SIMD has its own separate 64 KB registers. But a fat CU wouldn't bring performance increases either (except for some cases where the LDS/wave resource allocation doesn't evenly split the CU resources). A fatter CU would of course allow larger thread groups, but current APIs restrict the thread group size to 1024. Current GCN CU could already fit 2560 thread sized groups if the API allowed it (unless there are some minor hardware limitations preventing it). But I don't see much use for larger thread groups. At 1024 threads, LDS capacity per thread is already often a limiting factor, and synchronization (barriers) cause too much stalls. Smaller thread groups are more efficient. I am perfectly happy with the thread group sizes we have now. But I would be happy to have more LDS and more registers available for 1024 sized thread groups to make them more useful. Pascal P100 doubled available LDS and registers per thread. Hopefully AMD can at least bring some increases.
 
Vega 10 has 750 GFLOPs of dual-precision computing power (that's 1/16 of SP). Interestingly it will change with VEGA20, which should have a divider of 1/2 of SP. For such reason, some GPUs (like Hawaii) that offer good dual-precision performance will still be around next year (check the roadmap below). So as you can see Vega10 will compete with Pascal GP100 in half and single-precision computing.

AMD is also developing its NVLink alternative called xGMI for peer to peer GPU communication. It will be available with Naples architecture based on Zen and VEGA 20, which is currently expected in the second half of 2018.

http://videocardz.com/63715/amd-vega-and-navi-roadmap
 
If it really needs 3 cycles to fetch the 3 registers needed for multiply-add, it only has 1 cycle to execute it. That could be one of the reason why the clocks cannot be increased much beyond 1 GHz. But why couldn't it fetch all three registers in parallel? I am not an hardware engineer, so I don't understand the limitations here. Limited register file bandwidth? Bank conflicts?
Each cell would add an additional transistor per port, and the array would need the address decoding and sense amplifiers needed to select and read for each port, plus the complexity of arbitrating accesses to the same location.
Increasing cell size impacts density, and the number of wires and array overhead rises as well. Delay for an access becomes longer as well, so it is not an automatic gain. Having the rest of the time slice to do other things could compensate for it.

The relatively smaller capacity, higher power budget, and performance demands for high-ILP CPUs makes this a more acceptable trade off. Even then, the demands for porting and wiring can be so high that the register files can be duplicated, which would be a massive capacity drop for a GPU. Stronger transistors for higher clocks increase area, so a design that goes for the densest SRAM arrays is going to have problems keeping up.

One possible way GCN's register files could be banked that would make sense with the 16-wide SIMDs and 4-cycle cadence is that the registers are subdivided into 4 16-wide banks. It removes conflicts between phases and the need for multiple ports, and it would match what the hardware is doing. It would also be a variation on the R600 register access method, which was 3 cycles while gathering components from 4 banks.

50% increase of registers per SIMD = 96 KB of registers per SIMD.
It's probably banked so that a fraction of it needs to be accessed in a given cycle, so not as bad as the full amount but probably with some overhead.
TSMC mentioned that its FinFET process allowed its SRAMs to become denser since the stronger transistors allowed 512 cells to hang off the same bit line. The prior node allowed 256 per line, which would give how many individual rows could be addressed in a single array. Perhaps this could allow a future GPU to increase the number of rows in the register file, or its caches and LDS.

http://semiengineering.com/ibm-intel-and-tsmc-roll-out-finfets/

"SRAM speed has been improved by greater than 25%. “This improvement allows the use of a 512 bits per bit-line scheme instead of a 256 bits per bit-line scheme to reduce the periphery circuit size,” according to TSMC."

Could be hard, if AMD is already spending 3 cycles to fetch the registers. Could they do something like Nvidia did with Maxwell? Introduce a register file that is more heavily banked (more bank conflicts, but higher BW) and at the same time introduce L0 register caching (use last results directly without writing & reading it from the cache). This worked very well for Maxwell. Kepler's register file had less bank conflicts, but Maxwell avoided this by register caching. But I don't know whether this approach suits fixed length (4 cycle) instruction latency.
The register reuse cache is also controlled by flags in the instruction stream, which along with stall counts complicates software. The problem of needing to expose the existence of forwarding latency and value reuse does not appear to be prohibitive for Nvidia (or others) in the various market segments.
The operand collectors are probably working to more fully utilize register bandwidth when individual instructions may not need all operands at the same time.
 
12TFLOPs of SP in professional market which should translate to around 14TFLOPs on desktops.

That's too high of a clockspeed unless AMD are doing three ops per shader per clock a la those old nvidia cards. In theory at least.
 
12TFLOPs of SP in professional market which should translate to around 14TFLOPs on desktops.

That's too high of a clockspeed unless AMD are doing three ops per shader per clock a la those old nvidia cards. In theory at least.

Its not really the 12TTflops who suprise me , it was expected in this range, but the 64CU... who will put the gpu with a similar number of SP of Fiji ( 4096 ) who was at 8.6 TFlops ( FP32 ). Ofc, this will need a serious increase on clockspeed ( 1.4Ghz ). I have some little doubt about what we read as spec about vega, could be severally mixed before been published.

Even their "xGMI", who is not new, it was allready presented by AMD ( AMD GMI =global memory Interconnect ). http://www.fudzilla.com/news/processors/38402-amd-s-coherent-data-fabric-enables-100-gb-s
 
Last edited:
Back
Top