I've commented on GPU makers' divergence from physical reality before, which is why mostly compare these kinds of aspirational data points within a vendor so there's a better chance of keeping the figures iso-delusional.
Honestly considering Koduri's words a couple of months ago, i think one doesnt take too many risks assuming Vega 10 will be the only big die coming from AMD.
I was actually going to ask the same thing.
I've heard folks mention Vega 11 as the big Vega, but I can't recall a source.
I'm sorta disappointed that amd can't design a consistent codename system. Their post-Polaris system is an improvement, but it's not quite where it needs to be. We're really pretty spoiled with Nvidia's codenames.
I remember reading that AMD commented on the naming, and that the number doesn't reflect the size but the order in which development started - this was in response to people saying Vega 10 would be the big one, and 11 the small one (like Polaris)
I don't see why they'd be limited to 64 CU, what stops them from having six SEs and 96 CU?
I recall that as well. I consider it a mistake on amd's decision making.
Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.
Amd could've done the same thing with the last digit in their codename numbers. Instead, they publicly said that the numbers are meaningless. There's no link between Polaris 10 and Vega 10. They cover entirely separate market segments.
Apparently GCN 1.1 (second gen gcn) was limited to 4 SEs. Since then, I don't think they have gone past 4. There's no reason why they couldn't in the long term, but they haven't for whatever reason. I hope it's a harmless confidence thatVega 10 and 20 potentially still only having 4 SEs. I may be misunderstanding something as I'm just a layman and this is starting to get relatively low level.
http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4
Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.
I recall that as well. I consider it a mistake on amd's decision making.
Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.
Amd could've done the same thing with the last digit in their codename numbers. Instead, they publicly said that the numbers are meaningless. There's no link between Polaris 10 and Vega 10. They cover entirely separate market segments.
Apparently GCN 1.1 (second gen gcn) was limited to 4 SEs. Since then, I don't think they have gone past 4. There's no reason why they couldn't in the long term, but they haven't for whatever reason. I hope it's a harmless confidence thatVega 10 and 20 potentially still only having 4 SEs. I may be misunderstanding something as I'm just a layman and this is starting to get relatively low level.
http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4
Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.
That decision making is really only relevant to marketing, and those internal chip codes aren't really relevant for the board names the GPUs are given. That's where the differentiation is made.Nvidia has a clear structure where you know that a G@##0 (e.g. GM200, GK110) is a 250ish W halo chip. Similarly, G@##4 (e.g. GP104) is a 150ish W performance chip. It goes down the line cleanly.
But it had occurred to me that AMD could do a bait and switch on us: what if a wavefront is really small, say 16 work items. And a workgroup is at least 4 wavefronts?
I get that this is the case. What I was referencing was how the unit seemingly behaves at a pipeline rather than software level, and whether that has physical implications for clock speed. Not pipelining an operation can reduce transistor count, area, and latency at the cost of things like unit throughput or the affecting the distribution of complexity among pipeline stages.GCN is executing 64 wide waves on a 16 wide SIMD. You don't need to find independent instructions. Lanes (0-15) (16-31) (32-47) (48-64) of the same register are guaranteed to be independent. That gives you four cycles worth of independent work to issue.
That decision making is really only relevant to marketing, and those internal chip codes aren't really relevant for the board names the GPUs are given. That's where the differentiation is made.
All the triangles are culled as if they are back face culled.does anyone know what 'Beyond3D Test Suite' is? It's used by several hardware websites and the polygon throughput test shows nvidia cards pushing more polys than they can in theory - black magic? It must be counting triangles that were discarded or something, but Polaris doesn't exhibit this behavior and it does have primitive discard logic.
If it really needs 3 cycles to fetch the 3 registers needed for multiply-add, it only has 1 cycle to execute it. That could be one of the reason why the clocks cannot be increased much beyond 1 GHz. But why couldn't it fetch all three registers in parallel? I am not an hardware engineer, so I don't understand the limitations here. Limited register file bandwidth? Bank conflicts?One possible organization for the SIMD pipeline would be a derivation of the VLIW 4-cycle cadence defined in the ISA docs for R600 and Evergreen: three cycles to get three operand registers and one cycle to execute. If this were an idealized pipeline, each stage would have as much work to do as any other with as little slack as possible, and presumably each operand fetch gets one stage.
50% increase of registers per SIMD = 96 KB of registers per SIMD. Could be hard, if AMD is already spending 3 cycles to fetch the registers. Could they do something like Nvidia did with Maxwell? Introduce a register file that is more heavily banked (more bank conflicts, but higher BW) and at the same time introduce L0 register caching (use last results directly without writing & reading it from the cache). This worked very well for Maxwell. Kepler's register file had less bank conflicts, but Maxwell avoided this by register caching. But I don't know whether this approach suits fixed length (4 cycle) instruction latency.This might become relevant to the goal of increasing register capacity by 50%. For increasing storage capacities access time increases, and it lengthens the delay of a register access stage. A perfectly balanced pipeline would then see its cycle time increase to match. Larger caches scale delay by the square root of the increase, but the register files at may behave differently.
Vega 10 has 750 GFLOPs of dual-precision computing power (that's 1/16 of SP). Interestingly it will change with VEGA20, which should have a divider of 1/2 of SP. For such reason, some GPUs (like Hawaii) that offer good dual-precision performance will still be around next year (check the roadmap below). So as you can see Vega10 will compete with Pascal GP100 in half and single-precision computing.
AMD is also developing its NVLink alternative called xGMI for peer to peer GPU communication. It will be available with Naples architecture based on Zen and VEGA 20, which is currently expected in the second half of 2018.
Each cell would add an additional transistor per port, and the array would need the address decoding and sense amplifiers needed to select and read for each port, plus the complexity of arbitrating accesses to the same location.If it really needs 3 cycles to fetch the 3 registers needed for multiply-add, it only has 1 cycle to execute it. That could be one of the reason why the clocks cannot be increased much beyond 1 GHz. But why couldn't it fetch all three registers in parallel? I am not an hardware engineer, so I don't understand the limitations here. Limited register file bandwidth? Bank conflicts?
It's probably banked so that a fraction of it needs to be accessed in a given cycle, so not as bad as the full amount but probably with some overhead.50% increase of registers per SIMD = 96 KB of registers per SIMD.
The register reuse cache is also controlled by flags in the instruction stream, which along with stall counts complicates software. The problem of needing to expose the existence of forwarding latency and value reuse does not appear to be prohibitive for Nvidia (or others) in the various market segments.Could be hard, if AMD is already spending 3 cycles to fetch the registers. Could they do something like Nvidia did with Maxwell? Introduce a register file that is more heavily banked (more bank conflicts, but higher BW) and at the same time introduce L0 register caching (use last results directly without writing & reading it from the cache). This worked very well for Maxwell. Kepler's register file had less bank conflicts, but Maxwell avoided this by register caching. But I don't know whether this approach suits fixed length (4 cycle) instruction latency.
12TFLOPs of SP in professional market which should translate to around 14TFLOPs on desktops.
That's too high of a clockspeed unless AMD are doing three ops per shader per clock a la those old nvidia cards. In theory at least.