AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

MDolenc · Apr 3, 2017

It's interesting though that this is not exposed (yet?) to D3D/Vulkan in NVIDIA case. There doesn't seem to be any mentioning of pointers from cudaMallocManaged not being allowed when creating texture objects so what gives?

sebbbi · Apr 3, 2017

MDolenc said:
It's interesting though that this is not exposed (yet?) to D3D/Vulkan in NVIDIA case. There doesn't seem to be any mentioning of pointers from cudaMallocManaged not being allowed when creating texture objects so what gives?

Not exposed in OpenCL either. Someone asked in their forums and the response was no plans to support OpenCL. CUDA only at this point. Maybe they start supporting graphics APIs if AMDs technology proves to be a success.

Frenetic Pony · Apr 4, 2017

sebbbi said:
It is hard to see what happens in the future. If hardware paging solutions become prevalent there's significantly less reason to create your own highly complex software paging solution. Data sets are growing all the time, meaning that less and less data gets actually accessed per frame. My prediction is that gains from automated paging systems are increasing in the future. But obviously we are also going to see games using techniques that require big chunks of physical memory. For example big runtime generated volume textures used in global illumination systems. The best bet is to have both more memory and an automated paging system.

The reality right now is that recently Nvidia released Geforce GTX 1060 with 3 GB of memory. Developers need to ensure that their games work perfectly fine on this popular GPU model. Thus I don't see any problem with a 4 GB mid-tier GPU with automated paging solution. Automated memory paging solution doesn't need to double your available memory in every single game to be useful.

Just assuming that the Vega, going by rumors, is a high end GPU cost $500+ exclusively. It would be interesting to see if this sort of thing becomes defacto for the next generation of consoles as well though. As you said, better hardware memory management would definitely cut down on any perceived need to spend a lot of timing doing it in software. It'd also cut down on whatever amount of memory MS and Sony would consider to be needed, a win as far as their concerned too.

sebbbi · Apr 4, 2017

Frenetic Pony said:
Just assuming that the Vega, going by rumors, is a high end GPU cost $500+ exclusively. It would be interesting to see if this sort of thing becomes defacto for the next generation of consoles as well though. As you said, better hardware memory management would definitely cut down on any perceived need to spend a lot of timing doing it in software. It'd also cut down on whatever amount of memory MS and Sony would consider to be needed, a win as far as their concerned too.

I am talking about automatic virtual memory paging from a bigger system memory (DDR4) to a smaller GPU memory (HBM(2), MCDRAM, etc). Current consoles do not have big slow main memories. Might change in the future of course. I don't believe automatic paging from hard drive is fast enough right now. Latency is 10+ milliseconds. Paging from DDR4 main memory has several orders of magnitude smaller latency. Significantly easier to get right with minimal developer intervention.

Rodéric · Apr 4, 2017

There are M.2 SSD/PCIe and Octane now as alternatives, not saying they are good but might be worth investigating when it comes to paging.

Jay · Apr 4, 2017

sebbbi said:
smaller GPU memory (HBM(2), MCDRAM, etc).

This topic falls into something I raised in another thread.
What sort of capacity would the HBM need to be to facilitate 4k fat gbuffers and render targets, and anything else that would have to be stored in it?
I'm aware that the industry is slowly moving away from fat but that's not the case as of yet.
Would 4GB really be enough to allow for that with the automated paging, or would it take developer input? AMD seems to imply that automated would handle it, when paging in from larger DDR4 memory even across pci-e.

sebbbi · Apr 4, 2017

Jay said:
This topic falls into something I raised in another thread.
What sort of capacity would the HBM need to be to facilitate 4k fat gbuffers and render targets, and anything else that would have to be stored in it?
I'm aware that the industry is slowly moving away from fat but that's not the case as of yet.
Would 4GB really be enough to allow for that with the automated paging, or would it take developer input? AMD seems to imply that automated would handle it, when paging in from larger DDR4 memory even across pci-e.

Frostbite needs 472 MB for 4K (all render targets + all temporary resources) in DX12. Page 58:
http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

Assets (textures, meshes, etc) are of course loaded on top of this, but this kind of data can be easily paged in/out based on demand. I'd say 4 GB is enough for 4K (possibly even 2 GB), but too early to say how well Vega's memory paging system works. Let's talk more when Vega has launched. I have only worked with custom software paging solutions that are specially engineered for single engine's point of view. Obviously a fully generic automatic solution isn't going to as efficient.

Game's data sets tend to change slowly (smooth animation). You only need to load new pages from DDR4 every frame. 95%+ of data in GPU memory stays the same.

xEx · Apr 4, 2017

sebbbi said:
Frostbite needs 472 MB for 4K (all render targets + all temporary resources) in DX12. Page 58:
http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

Assets (textures, meshes, etc) are of course loaded on top of this, but this kind of data can be easily paged in/out based on demand. I'd say 4 GB is enough for 4K (possibly even 2 GB), but too early to say how well Vega's memory paging system works. Let's talk more when Vega has launched. I have only worked with custom software paging solutions that are specially engineered for single engine's point of view. Obviously a fully generic automatic solution isn't going to as efficient.

Game's data sets tend to change slowly (smooth animation). You only need to load new pages from DDR4 every frame. 95%+ of data in GPU memory stays the same.

Could we see a king of SSG in future VGAs? with HBC we could see a 4GB "high BW" with an SSD(with like optane latency) of say 8GB which should be cheap. I don't know the difference of latency of on side xpoint vs system ram are but it could give more control or consistency of what the resources are so developers could have a smaller target to aim.

Frenetic Pony · Apr 5, 2017

sebbbi said:
I am talking about automatic virtual memory paging from a bigger system memory (DDR4) to a smaller GPU memory (HBM(2), MCDRAM, etc). Current consoles do not have big slow main memories. Might change in the future of course. I don't believe automatic paging from hard drive is fast enough right now. Latency is 10+ milliseconds. Paging from DDR4 main memory has several orders of magnitude smaller latency. Significantly easier to get right with minimal developer intervention.

Sorry, got mixed, but both sorts of systems could end up in next gen cosoles.. Automatic discard of unused memory would allow devs not to worry about efficient memory systems themselves if they didn't want to. And the PS4 Pro has that, what 1gb? Of extra DDR3 for background app storage to boost game app storage on the main 8gb. Could easily see the PS5 or whatever having automatic discard as an option, and using some sort of interconnect fabric to overflow to extra memory. If there's a fast SSD instead of a hard drive, or an extra, slower ram cache for system use and not game use, or even one of those "between DDR and SSD" memory types vendors are trying to make, then automatic overflow could be hugely beneficial.

A lot moreso than todays managing of ESRAM/DDR3 ram anyway.

Jawed · Apr 13, 2017

PERFORMANCE AND ENERGY EFFICIENT COMPUTE UNIT

So this seems kinda crazy: vary voltage per lane in a SIMD so that slight variations in completion time by individual lanes are ironed-out (effectively clocking lanes of the SIMD at slightly varying frequencies!).

Also seems to be a way to run a SIMD more efficiently.

Anarchist4000 · Apr 13, 2017

Jawed said:
So this seems kinda crazy: vary voltage per lane in a SIMD so that slight variations in completion time by individual lanes are ironed-out (effectively clocking lanes of the SIMD at slightly varying frequencies!).

Clocking probably isn't the right word here. Just adjusting voltage per lane to adjust frequency and lower power. Does seem a bit extreme for that though.

Running each lane with asynchronous clocks would imply independent instruction streams for each lane. That only makes sense with temporal SIMT where instructions are repeated many times or executing loops.

Check out Figure 6 in there. Cascaded FIFOs around a single lane. I'm guessing this isn't Vega, but Navi or later going by filing date. It's not far off of that speculated scalar design I had though.

xEx · Apr 13, 2017

Just to continue the new AMD tradition...:

https://www.reddit.com/r/Amd/comments/652dnk/i_got_a_rx570_early/

snarfbot · Apr 13, 2017

well very interesting! its a shame the guy couldnt get it working tho

xEx · Apr 13, 2017

There were no driver CD in te box

Enviado desde mi HTC One mediante Tapatalk

Jawed · Apr 13, 2017

Anarchist4000 said:
Clocking probably isn't the right word here. Just adjusting voltage per lane to adjust frequency and lower power. Does seem a bit extreme for that though.

At first, to me, it seemed the primary reason for this arrangement is to turn off individual lanes when predication is known to turn off lanes. Although this is problematic in AMD's existing SIMD architecture, because the SIMDs are 4-clocked per instruction, simulating 4 lanes per lane. So predication would turn off "lane" 3 say, but "lanes" 19, 35 and 51 are required to be turned on. Which makes me hesitant.

In this paragraph, reference to a "few nanoseconds" was initially confusing to me:

Even in a parallel computing environment, the lanes of a CU may execute operands at different rates. For example, the last lane of a CU may finish execution a few nanoseconds later than the first lane. This is due to the fact that the execution time for a given lane depends on the size of the operand. Smaller numbers take less time to calculate than larger ones. Similarly, some arithmetic calculations take longer than others. While the magnitude of the latency for a given operand may be quite small, over time the lanes will diverge in time. The difficulty is that the slowest lane will determine the performance for all the lanes.

I took this to mean that different paths through a multiplier, for example, could result in fractionally different completion times - but I've not come across such a multiplier. And, frankly, a few nanoseconds is actually a bloody long time (multiple clock cycles).

Then I thought this has to do with rapid packed math. But I don't see how varying counts of operands per lane comes into play.

Running each lane with asynchronous clocks would imply independent instruction streams for each lane. That only makes sense with temporal SIMT where instructions are repeated many times or executing loops.

There is, perhaps, a giant clue here in the apparently lazy way that the document appears to use compute unit and SIMD interchangeably. SIMD is stated 3 times. Twice in the "background" descriptive paragraph.

Now, this seems like a giant leap, but I'm wondering if this could imply that a compute unit is indivisible: it has no SIMDs. It is a single temporal SIMT device:

Check out Figure 6 in there. Cascaded FIFOs around a single lane. I'm guessing this isn't Vega, but Navi or later going by filing date. It's not far off of that speculated scalar design I had though.

The prior FIFO would be an operand collector (as seen in NVidia's designs).

One could just argue that "compute unit" is just a generic term that doesn't relate specifically to SIMD-ness.

Alternatively, this is just an opportunistic patent application - a drive-by on the way to somewhere completely different:

For example, at some time t0 lane 0 may be instructed to multiply two four bit numbers, lane 1 may be instructed to calculate the natural log of an eight bit number and lane n may be instructed to calculate the cosine of a twelve bit number. In general, smaller numbers take less time to calculate than larger numbers, and more simple arithmetic operations take less time than more complicated arithmetic operations. Therefore, it may be that the execution time for lane 0 may be less than lane n but the slowest lane will decide the performance of all the lanes 0 . . . n.

There's nothing like SIMD-ness in that description.

Yes, in a current GCN CU, amongst the four SIMDs, you would see this behaviour if you took each "lane" in that paragraph and read it as "SIMD". But there is no reason in that case for the relative completion times of these instructions to be meaningful.

So, I'm tempted to agree that temporal SIMT seems like the only way this could appear in a GPU.

Jawed · Apr 13, 2017

Actually, I've thought of a way to use this in a compute unit architecture that consists of the four SIMDs that we know from GCN. This is temporal predication.

In the past we've talked about compaction of work items, e.g.

Dynamic Wavefront Formation

and I'm now thinking that a SIMD-16 where each lane has one of 5 frequencies:

full speed
three-quarters
half
one-quarter
off

would simply support dynamic wavefront formation. There's no need to have SIMDs of different widths, if you can turn off individual lanes, or run them more slowly.

Instead, merely assembling the operands for each lane into a FIFO produces the required result. In the simple case, each lane has an independent FIFO which supports zero to four operands (per "operand slot", since, of course, there are three "operand slots" in GCN, e.g. for FMA: A * B + C).

With predication and dynamic wavefront formation, some operands are not required, so they aren't placed in the FIFO. The predication mask provides this information for the entire scope of the loop (or other control flow), so it's simple to skip queuing of operands that cannot be used.

So the voltage regulators now have time to switch to the appropriate speed (or "off") during the control flow.

This matches up with a seemingly important concept for this patent application: "Input voltages to the lanes are adjusted repeatedly to try to achieve synchronous execution." I take this to mean that all the results of an instruction are delivered simultaneously, so that all the lanes can start the next instruction simultaneously.

The post-lane FIFO is then required in this configuration in order to work out how to send the resultants back to the correct registers in the register file.

So now, compaction is required per physical lane. In my previous example of lanes 19, 35 and 51 active and lane 3 inactive, compaction merely delivers 3 operands into the FIFO. And so the voltage regulator is signalled to run the lane at three-quarters speed. The resultants appear in the post-lane FIFO and are marshalled into the correct destination registers after completion of each instruction.

This enables the wider scope of SALU and VALU architecture and synchronisation, that's always been a part of GCN, to continue mostly unchanged.

Also, there's no need for the hardware to work out when the SALU can perform operations instead of VALU (dynamically forming one to four work items), since we've already seen that this is a compilation problem, i.e. a wavefront-wide instruction, which the latest kernel languages now support explicitly.

Might there be a few cycles of voltage regulator reaction time? I guess that reaction time for a lane's private voltage regulator should be pretty much non-existent, since there's not a significant distance between the origin of the signal and where it's acted upon, or a wide area of regulated transistors (a few million at most).

If there is a latency in voltage switching, then this would affect switching back to full speed for all lanes when control-flow ends (which would be when a new hardware thread arrives at the SIMD, potentially with no lanes predicated-off). The requirement to return to full speed is known in advance, so it's always possible to iron-out the latency simply by switching up to full speed early, just before the control flow ends.

Voltage regulation (and therefore frequency variation) per lane in a SIMD is pretty crazy. But if it's to obviate dynamic wavefront formation, using multiple SIMDs of varying widths and the horror of fragmentation, then it seems to me like a transformational step that's worth taking.

Kaotik · Apr 13, 2017

xEx said:
Just to continue the new AMD tradition...:

https://www.reddit.com/r/Amd/comments/652dnk/i_got_a_rx570_early/

snarfbot said:
well very interesting! its a shame the guy couldnt get it working tho

xEx said:
There were no driver CD in te box

Enviado desde mi HTC One mediante Tapatalk

Curious though, since 17.3.1 or .2 already included RX 570 in the drivers (or at least mention of them with device id and whatnot)

edit: but don't these belong to Polaris-thread?

Anarchist4000 · Apr 13, 2017

Jawed said:
Yes, in a current GCN CU, amongst the four SIMDs, you would see this behaviour if you took each "lane" in that paragraph and read it as "SIMD". But there is no reason in that case for the relative completion times of these instructions

It might apply to a variable SIMD design that co-issued instructions. A "simple arithmetic operation" of 8 or 12 bits might fit into the cadence with some added voltage. Lanes 1-8 executing a FMA op, 9-16 a shortened 8-12 bit instruction synchronously. Bigger issue would be it only applies to two cycle cadence operations at best as the nasty, less frequent instructions were relegated to lane 1 with a 1/16th rate to my understanding? Then the question of just how much faster will those instructions hit a steady state to make this technique useful.

Maybe use it for equalizing execution time of the 4 cycle cadence instructions to increase clockspeed? That would require some insanely fast reaction speeds of the voltage regulators on top of some sort of co-issued instructions to be per lane and not SIMD. I can't imagine a single SIMD has enough variance between lanes to bother equalizing.

Jawed said:
Might there be a few cycles of voltage regulator reaction time? I guess that reaction time for a lane's private voltage regulator should be pretty much non-existent, since there's not a significant distance between the origin of the signal and where it's acted upon, or a wide area of regulated transistors (a few million at most)

Problem I see here is that a linear regulator would be fast, but still burn power. Switching might be interesting, but would be a bit slower, require a capacitor, and possibly introduce noise. The timeline of the patent might track with the metal insulator metal capacitors introduced with Zen though. Still that's a lot of work for a seemingly small range of relatively infrequent instructions. It would work well to tune voltages per lane though.

The temporal SIMT designs make a lot more sense considering the "nanosecond" differences listed. Per lane regulation makes sense there if accommodating waves of varying sizes all on the same instruction with a big FIFO. Dial down the clocks per lane according to execution mask of the wave. Increasing clocks should still affect the entire CU, but with power savings from smaller waves helping the boost.

sebbbi · Apr 13, 2017

The current GCN 4-cycle cadence results in simple and efficient 16-lane split register file design. No bank conflicts at all. Registers can be very close to execution units.

How would the register files be organized in these dynamic sized wavefront systems? It is easy to see the potential improvements of utilization (in branchy code). However (if I understood properly) the register files consume more power than ALUs. Would this new design consume significantly more power in simple (non branchy) code?

3dilettante · Apr 13, 2017

Jawed said:
PERFORMANCE AND ENERGY EFFICIENT COMPUTE UNIT

So this seems kinda crazy: vary voltage per lane in a SIMD so that slight variations in completion time by individual lanes are ironed-out (effectively clocking lanes of the SIMD at slightly varying frequencies!).

Also seems to be a way to run a SIMD more efficiently.

It mentions being a work derived from a Department of Energy contract, possibly one of the exascale projects.
This may actually mesh with the most recent HPC GPU chiplet proposal from AMD, where asynchronous techniques are applied to the ALUs and crossbars of the SIMD units. It might be a Navi or post-Navi architecture, however.

For reference:
http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf

As described, it's not that the lanes are not at some point interfaced with a clock, but it may be that the pipeline has become increasingly decoupled in terms of instruction and operand transfer to the lanes.
This potentially creates a physically dynamic inter-wavefront issue behavior, which has more explicit forms with things like variable DP rate.

The patent is being cagey on how exactly the lanes are running very different instructions. It's more readily true with current SIMD arrangement if the comparison is between a fast lane in SIMD 0 (off-lane or 8-bit add) is compared with a slower op in SIMD 1 (FMA with some mix of bits and exponent requiring the maximum switching activity and shifting). Potentially, predication or specific corner cases may short-circuit evaluation or could be set to forward data unmodified or flushed to a fixed value. The SIMDs would still need to care about each other's delays, since the CU's clock applies the the longest delay to all.

This, coupled with the claimed use of near-threshold computing, would be a significant source of variation in terms of time.
It may be as if the SIMD in this mode is operating in a constant state of Vdroop compensation with dynamic clocking, only AMD's method of extending clock cycles dynamically on a more global level has added some additional adjustments in the positive direction with clock and voltage adjustments if one lane is showing that it is experiencing cumulatively more delay than others. The patent mentions the possibility of linking voltage to clock, which sounds like extending that compensation method.

How the excessive delay would be measured (or predicted in the other variation) would be interesting to see. This would seemingly allow voltage and timing slack to be utilized, rather than a more rigid voltage level and circuit cadence sized closer to the worst-case scenario. Some scenarios that may also be explicitly targeted are cross-lane ops like those that broadcast one lane to the next clock, where that one lane is more timing critical and may incur longer delay setting up a broadcast in a near-threshold environment.

Jawed said:
In this paragraph, reference to a "few nanoseconds" was initially confusing to me:

I think this may be cumulative over some number of execution cycles. One specific wave may only take a fraction of a nanosecond longer, but it can accumulate. As a physical/electrical phenomenon, there may be a level of correlation in a lane--especially if riding the edge in terms of voltage and timing. If a lane's switching activity versus its current power delivery results in voltage droop, the next cycle it has may start at a slightly worse voltage level in addition to its delayed start, causing the next operation to insert even more, and so on.

I took this to mean that different paths through a multiplier, for example, could result in fractionally different completion times - but I've not come across such a multiplier.

In modern x86 CPUs, division latency depends on the actual values of the operand, although that may be too complex an operation for this scenario.
The example does have a CU with multiple SIMDs, but talks about relative delay between lanes--possibly not in the same SIMD. That scenario makes different execution times more plausible, particularly if dealing with low-precision in one SIMD and heavy DP arithmetic in the other.
They'd still indirectly interact, since the rest of the CU may be synchronous and its arbitration cycles would otherwise be bound by the worst-case time of one of them.

If running things more asynchronously, and if running at a low voltage, there may be delays even within a SIMD.

If AMD introduces a dynamically variable-precision ALU that gates off more sections based on how many bits it really needs, the timing could get stranger.

The prior FIFO would be an operand collector (as seen in NVidia's designs).

It could be a straightforward FIFO of command signals, reads from the register file using the same logic as before. AMD's HPC proposal did not note that the register file would be asynchronous and it was explicitly stated the SRAM would not be running at near threshold voltage. Buffering multiple operations ahead of the pipeline, giving them a rough time budget, and buffering any writeback could let the GPU usually complete work at a lower voltage and without as much wall-clock time wasted in a long cycle--usually.
The idea of having a FIFO at the top and a FIFO at the bottom may allow for monitoring of delay and readiness of the asynchronous (isochronous?) lanes, and may also serve to coalesce feedback into the synchronous and higher-voltage regions of the CU. The discrepancy between the upper and lower FIFOs may also give an idea of which lanes need a boost, or if the CU needs to insert an actual stall.

The complexity of the behavior makes me wonder about some of the wait states in current GCN. Parts of the CU are going to be more aware of delays than before, and perhaps some of these paths would now have interlocks. An alternate possibility is adding power-aware instruction scheduling, and adding more wait states for the compiler to allow it to actively target filling the FIFO with dependent instructions, possibly allowing results to forward within the lanes and allowing the output FIFO to elide some writes to SRAM if the same destination shows up. However, to get the most of this, GCN's rather switch-happy threading may need to be curtailed in order to allow a single wavefront more chances to issue into the FIFO without interruption. Perhaps the VALU count value may find use for this?

sebbbi said:
The current GCN 4-cycle cadence results in simple and efficient 16-lane split register file design. No bank conflicts at all. Registers can be very close to execution units.

While the exact width may not be known, one way to split the difference as AMD has indicated for its exascale proposal is to keep SRAM at a higher voltage level and possibly kept at a synchronous clock. The patent may mean things are less tightly linked. The ALU lanes would buffer work and writeback in various FIFOs, so that actual interactions with the register file would only occur after all the variable timing has been resolved.
That may mean some kind of internal operand caching, which might let the lanes go further before having to sync with the outside world.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

MDolenc

sebbbi

Frenetic Pony

sebbbi

Rodéric

a.k.a. Ingenu

Jay

sebbbi

xEx

Frenetic Pony

Jawed

Anarchist4000

xEx

snarfbot

xEx

Jawed

Jawed

Kaotik

Drunk Member

Anarchist4000

sebbbi

3dilettante