AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Deleted member 13524 · Oct 24, 2016

CSI PC said:
Which is part of my point, the packed FP16 is possibly only going to be available for Vega and SoCs-consoles.

As referred above, GCN3 and up already pack 2*FP16 load/store for bandwidth and latency savings. This means Tonga, Fiji, Polaris 10 and Polaris 11 already to it.
I think you're mistaking 2*FP16 packing with processing FP16 at twice the rate of FP32 in the same ALU units. That's the feature that is present in the PS4 Pro and probably in Vega (and TX1 + GP100 on nvidia's side).

CSI PC · Oct 24, 2016

ToTTenTranz said:
As referred above, GCN3 and up already pack 2*FP16 load/store for bandwidth and latency savings. This means Tonga, Fiji, Polaris 10 and Polaris 11 already to it.
I think you're mistaking 2*FP16 packing with processing FP16 at twice the rate of FP32 in the same ALU units. That's the feature that is present in the PS4 Pro and probably in Vega (and TX1 + GP100 on nvidia's side).

I am possibly misunderstanding the slide from the official Polaris presentation that did not make it clear that the 2xFP16 load store is even possible just says supports native int16/FP16, while Sony mention the full FP16x2 packed function-operations rather than AMD when they discussed-leaked info on their architectures.
I could had then made it clearer that it is interesting that the Consoles with PS4 Pro being Polaris are able to achieve the operations-functions pertaining to packed FP16x2 like Tegra X1 (and yes meaning higher 'performance') but it looks like it is not going to be there for mainstream Polaris discrete GPUs (similar situation to Maxwell Tegra X1 and Maxwell discrete GPUs), along with the implications this can now have when porting.

Basically I was commenting on both but did not make it too clear in the beginning.
Edit:
Went back to the post you quoted as you picked up on the post where I missed out function-operation as part of the packed fp16, although its context was a continuation of the previous post and discussion with another forum member.
Cheers

ieldra · Oct 24, 2016

So PS4 can do double rate fp16?

How likely is it this will even be adopted if there are virtually no desktop (consumer) GPUs supporting it?

Dave Baumann · Oct 24, 2016

Anarchist4000 said:
@sebbbi I believe this was the register packing you were trying to figure out previously. Not sure if you saw this or not. Context missing in my quote, but the register packing should exist on Polaris.

+ Tonga, Fiji, Iceland

Anarchist4000 · Oct 24, 2016

ieldra said:
So PS4 can do double rate fp16?

How likely is it this will even be adopted if there are virtually no desktop (consumer) GPUs supporting it?

PS4 Pro at least according to that PS4 architect.

Not all desktops will support double rate FP16, but what's the alternative? Normal rate FP16/32 or avoid the effect entirely? Practically guaranteed the console devs will be all over it, and that code likely then ported to PC where applicable. For the most part it should be a compiler thing providing a performance boost on applicable hardware. So it shouldn't hurt anyone, just affect relative performance. I don't foresee any downside to it. The only concern relates to the scheduling being one or two instructions.

3dilettante · Oct 24, 2016

A random item that sprang to mind while thinking about the variable-width scheme is where this fits in conjunction with some of the waitcnt and wait state behavior. Some of my earlier questions arose in relation to what happens with the wait states in the presence of a design that can vary its cadence in the presence of the mooted high-performance scalar unit.

That aside: variable latency scenarios could interact with some of the hypothetical implementations that move context, gate off SIMDs, respond to back pressure, or automatically respond to dynamic thread behavior.
For example, one implementation might have issue logic that checks wavefront width per cycle and can opportunistically gate off excess lanes or whole SIMDs if enough lanes are predicated off. That's a particularly fine-grained monitoring of activity based on branching behavior, but is there a convention or architectural feature that avoids a hazard for gating off a SIMD for an earlier load of different width trying to update the register file drawn as being tied to a gated SIMD?
I suppose there's a number of ways like managing the register file and network separately, not adjusting if a waitcnt is outstanding, treating any explicit SIMD migration instructions or events as an overall waitcnt of 0. The complexity of the hardware or the effectiveness of gating can be compromised in some of these, while being conservative in the presence of a waitcnt can leave hundreds of cycles where the hardware is more conservative with its dynamic width behavior.

A convention for code generation that avoids outstanding wait counts in the presence of control flow points that the hardware might try to shift widths might work, but that pulls a low-level detail into the higher level considerations of the software.

Another possibility that synergizes with wait counts could be to maintain vector registers in the memory domain that defer updates to the SIMD register file until such time that the SIMD issue hardware and operand forwarding are in a position to consume the results of the operation as requested. In a port-constrained scenario, perhaps this could be used to overlap an operation's operand read with the update of the vector register, with forwarding supplying the ALU if needed while any cycles the file's not used soaking up writes from other domain.
The selective gating/SIMD routing might require more complex sequencing to properly drain pending updates before clock-gating a register or divert updates to the new destination.

Perhaps something like this already exists to manage updates to the register file from memory? It seems like something like this could fall out of having a sequencer in the other domain controlling when it updates the wait count seen by the front end, plus the forwarding latencies and lead times defined for various dependences.

(ed: spelling)

Anarchist4000 · Oct 24, 2016

3dilettante said:
A random item that sprang to mind while thinking about the variable-width scheme is where this fits in conjunction with some of the waitcnt and wait state behavior. Some of my earlier questions arose in relation to what happens with the wait states in the presence of a design that can vary its cadence in the presence of the mooted high-performance scalar unit.

Having seen that Eurogamer article in really makes me wonder about that MIMD theory I had with the attached scalars. Issuing 2 instructions to the unit, which then has a 16xSIMD and one or more scalars attached. A scalar being like a SIMD with up to 16(multiple of 4) cycle cadence fed through the same pipeline. Copy the vector operands into temporary registers and you have a scalar pipelined 16 deep with no dependencies. A vector register acting like a scalar memory bank. That's about as optimal as you can get for high clockspeeds. It also avoids a lot of the variable SIMD width issues. Even if there are masked off lanes a scalar would have little difficulty skipping ahead in any configuration. Multiple bonded scalars and you simply increase the stride. Going with MIMD just so the scalar and vector ALUs can share the same registers/banks and avoid a lot of data shuffling and porting. Although it seems likely there will be more ports. 2xFP16 could then also schedule on the SIMD or scalar pair.

Replacing a SIMD unit with some sort of scalar combination could work as well. 4 scalars at 4x clocks should be able to maintain the cadence and throughput of a single SIMD. Not particularly power efficient for full waves, but would improve with divergence.

3dilettante · Oct 25, 2016

Anarchist4000 said:
Having seen that Eurogamer article in really makes me wonder about that MIMD theory I had with the attached scalars.
Issuing 2 instructions to the unit, which then has a 16xSIMD and one or more scalars attached.

Do you mean the PS4 Pro's support for 2x FP16? The way double-rate FP16 is implemented in other GPUs doesn't introduce multiple instruction issue.

A scalar being like a SIMD with up to 16(multiple of 4) cycle cadence fed through the same pipeline. Copy the vector operands into temporary registers and you have a scalar pipelined 16 deep with no dependencies. A vector register acting like a scalar memory bank. That's about as optimal as you can get for high clockspeeds.

If they're going through the same pipeline, then that raises questions about high clock speeds. Having no dependences would not change how every other stage would be structured, and the clock period would be limited by everything else.
If the scalar unit is being used on a code stream that was generated for wide wavefronts that have been predicated down to a few lanes, has the hardware found a way to get around how the register IDs were allocated for the purposes of the SIMD code? The register IDs would still be laid out across separate vector registers, which would require changing how the registers are packed or changing how the register file can be accessed.

Remij · Oct 25, 2016

So realistically, how will the ability to perform double rate FP16 improve the GPUs performance.. when most algorithms require 32bit FP to output correctly? Is this actually something that could be significant?

Gipsel · Oct 25, 2016

Remij said:
So realistically, how will the ability to perform double rate FP16 improve the GPUs performance.. when most algorithms require 32bit FP to output correctly? Is this actually something that could be significant?

Actually, quite a few things are fine with FP16 (could be upwards of 50% of all lighting and pixelshader stuff, but I'm no expert for this, others here are). How much one can gain from it varies wildly, of course, and also depends on other bottlenecks. But being able to invest more arithmetic power in certain areas is significant, even when algorithms have to be modified to take advantage of it. Some cross platform stuff is already written with FP16 calculations as a consideration. So there is at least some experience with it.

Anarchist4000 · Oct 25, 2016

3dilettante said:
Do you mean the PS4 Pro's support for 2x FP16? The way double-rate FP16 is implemented in other GPUs doesn't introduce multiple instruction issue.

This would be for PS4 Pro and possibly Vega. My theory had two instructions issued for a VALU+scalar, but with some flexibility to map them differently. VALU+VALU, VALU+Scalar, or Scalar+Scalar could be issued within the MIMD unit. Constrained by available ports, all of which are vectors. Different than other GPUs, but similar to having multiple vector ALUs scheduled. It wouldn't exactly be double rate FP16 however as you'd likely lose some throughput to the scalars.

EDIT: The big difference would be the extra operand(s). There would likely be 4 or 5 operands allowing for the separation as opposed to the typical 3. So a=bc+d, e=f could be a=b*c, d=e+f.

3dilettante said:
If they're going through the same pipeline, then that raises questions about high clock speeds.

The scalar would have it's own internal registers to support the higher clocks. For scalarization it would consume vector operands and run up to 16 cycles without needing any data fed to it.

3dilettante said:
If the scalar unit is being used on a code stream that was generated for wide wavefronts that have been predicated down to a few lanes, has the hardware found a way to get around how the register IDs were allocated for the purposes of the SIMD code?

For execution on the integrated scalar it would be irrelevant. You would be reading 16 lanes just as if the VALU were executing the code. To conserve registers a compaction scheme would be required. A task for which the scalars may be well suited. The scalar I had would just be serializing vectors by copying operands into temporary registers it could more appropriately access.

It could run scalar operations if packed into vectors as well as extract uniform code. That would definitely require issuing scalar instructions more than once ever 4th cycle.

kalelovil · Oct 25, 2016

ieldra said:
So PS4 can do double rate fp16?

How likely is it this will even be adopted if there are virtually no desktop (consumer) GPUs supporting it?

Aside from the PS4 Pro you will have Project Scorpio (probably), Nintendo Switch (almost certainly), the latest ARM, PowerVR and Nvidia mobile GPUs, and future high-end AMD GPUs all increasing operation throughput when using a mix of FP16 and FP32 data.

It seems to be a direction the industry is pursuing. I suspect it will filter down to future consumer GPUs from Nvidia.

(Developers porting to PC will need to take into account the unique GP10x situation though: http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5)

CSI PC · Oct 25, 2016

Gipsel said:
Actually, quite a few things are fine with FP16 (could be upwards of 50% of all lighting and pixelshader stuff, but I'm no expert for this, others here are). How much one can gain from it varies wildly, of course, and also depends on other bottlenecks. But being able to invest more arithmetic power in certain areas is significant, even when algorithms have to be modified to take advantage of it. Some cross platform stuff is already written with FP16 calculations as a consideration. So there is at least some experience with it.

Any idea how easy would it be to take a true DX12 game rendering engine/post processing effects that has parts designed and optimised heavily around FP16 and then port it to the PC and also that of FP32?
Developers have problems managing well DX12 from a post-processing and even rendering engine perspective even now when porting from console to PCs, some successes but not many and plenty of examples where it is causing a headache when there is a hardware differentiation.

This would affect both AMD and Nvidia consumer pre Vega/Volta - assumptions that mainstream-enthusiast consumer cards would have it by then.
Another consideration is that some of the large AAA teams use a R&D team to develop those rendering engines/post processing effects before the games and then when it comes to porting the game it can be totally outsourced., not even another team within the software company.
Cheers

pTmdfx · Oct 25, 2016

Anarchist4000 said:
Having seen that Eurogamer article in really makes me wonder about that MIMD theory I had with the attached scalars. Issuing 2 instructions to the unit, which then has a 16xSIMD and one or more scalars attached. A scalar being like a SIMD with up to 16(multiple of 4) cycle cadence fed through the same pipeline. Copy the vector operands into temporary registers and you have a scalar pipelined 16 deep with no dependencies. A vector register acting like a scalar memory bank. That's about as optimal as you can get for high clockspeeds. It also avoids a lot of the variable SIMD width issues. Even if there are masked off lanes a scalar would have little difficulty skipping ahead in any configuration. Multiple bonded scalars and you simply increase the stride. Going with MIMD just so the scalar and vector ALUs can share the same registers/banks and avoid a lot of data shuffling and porting. Although it seems likely there will be more ports. 2xFP16 could then also schedule on the SIMD or scalar pair.

Replacing a SIMD unit with some sort of scalar combination could work as well. 4 scalars at 4x clocks should be able to maintain the cadence and throughput of a single SIMD. Not particularly power efficient for full waves, but would improve with divergence.

What you describe is more or less what Nvidia proposed for Echelon.

pTmdfx · Oct 25, 2016

3dilettante said:
Perhaps something like this already exists to manage updates to the register file from memory? It seems like something like this could fall out of having a sequencer in the other domain controlling when it updates the wait count seen by the front end, plus the forwarding latencies and lead times defined for various dependences.

(ed: spelling)

According to the ISA manual, memory instructions of the same category are guaranteed to be committed in order. But since practically the results would be returned from the memory hierarchy out of order, I'd assume there is some kind of load/store buffer. So deferring update is possible.

Anarchist4000 · Oct 27, 2016

pTmdfx said:
What you describe is more or less what Nvidia proposed for Echelon.

Definitely similar concepts, both coming from the same DARPA project. My proposal was a more hybrid approach between parallel (SIMD) and temporal (scalar/2-wide SIMT) though. Not quite as versatile as Echelon, but more performant for parallel, non-diverged workloads. Keeps the design compatible with GCN, but with the added ability to schedule 2 VALUs. That would provide a lot of flexibility without being overly complex. If scheduling a single VALU per cycle there should still be a performance gain so long as the scalar can't complete the wave within the 4 cycle cadence. For a full wave the SIMD would be idle every 4th (16th clock) cycle of the cadence. Obviously a lot of ways to configure the CU, just a question of how robustly the scheduling hardware is designed.

CSI PC · Oct 27, 2016

pTmdfx said:
What you describe is more or less what Nvidia proposed for Echelon.

Yeah sort of between Ronny Krashinsky's patent and some work at Nvidia and also with Berkely University and Yunsup Lee, and that of Jan Lucas who started with DART/temporal SIMD architecture and investigating Scalarization and Termporal SIMT but that has evolved quite a bit now to Spatiotemporal SIMT (not publicly available-distributed).

Cheers

Deleted member 87499 · Nov 10, 2016

ToTTenTranz said:
- 64 CUs at 1.5 GHz
- 2x FP16 rate per ALU
- 2 stacks of 2GT/s HBM2
- 225W TDP
- H1 2017 (probably Q2..)

I think that will be 2 x Polaris 10 in configuration aspect, but with GCN2.0 instead of GCN1.4. P10 CUs are put together in groups of nine per every 128-bit/4 channels of memory controllers, and P11 CUs are grouped in eight per 128-bit channel; both gives 576Sps versus 512Sps per channel. Taking the second will take smaller clocking to get to the 12TF performance level, pointing that the clocks will not change much from GFX8 to GFX9, even with 14LPP maturing a tad.
Clocking(all use cases) steads around RX480 to Sapphire 480 Nitro(8GB) levels, going up a bit versus Polaris but not too much. Power consumption in stock varies more according to the load, GPGPU vs. 4K Gaming eg. , than according to the clock variation. i think for 225W as a celling for non-power viruses stock worst-power consumption case, with average power consumption on gaming looking like one from a Fury-NonX card.

I'm betting on maximum GCN configuration going up(versus previous <GCN1.3) from Vega(maybe from Polaris) and on.

Kaotik · Nov 11, 2016

Why does that "GCN 2.0" keep coming up? Vega is not the first big gfx ip jump in GCN-history, we've had 7.x's and now 8.x's already

Deleted member 13524 · Nov 11, 2016

Well at least anandtech has been using the GCN 1.0 for Southern Islands, GCN1.1 for Hawaii and Bonaire, GCN1.2 for Tonga and Fiji and GCN1.3 for Polaris. If Vega turns out to be the largest departure from Southern Islands, then following that logic calling it GCN 2.0 makes some sense.
Though if we follow AMD's nomenclature of using GCNx for each transition, GCN2 already exists in Bonaire + Hawaii and Vega should be GCN5.

IMHO anandtech and all others should start using the same names as AMD, as it would avoid further confusion.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Deleted member 13524

Guest

CSI PC

ieldra

Dave Baumann

Gamerscore Wh...

Anarchist4000

3dilettante

Anarchist4000

3dilettante

Remij

Gipsel

Anarchist4000

kalelovil

CSI PC

pTmdfx

pTmdfx

Anarchist4000

CSI PC

Deleted member 87499

Guest

Kaotik

Drunk Member

Deleted member 13524

Guest