Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Not exposed in OpenCL either. Someone asked in their forums and the response was no plans to support OpenCL. CUDA only at this point. Maybe they start supporting graphics APIs if AMDs technology proves to be a success.It's interesting though that this is not exposed (yet?) to D3D/Vulkan in NVIDIA case. There doesn't seem to be any mentioning of pointers from cudaMallocManaged not being allowed when creating texture objects so what gives?![]()
It is hard to see what happens in the future. If hardware paging solutions become prevalent there's significantly less reason to create your own highly complex software paging solution. Data sets are growing all the time, meaning that less and less data gets actually accessed per frame. My prediction is that gains from automated paging systems are increasing in the future. But obviously we are also going to see games using techniques that require big chunks of physical memory. For example big runtime generated volume textures used in global illumination systems. The best bet is to have both more memory and an automated paging system.
The reality right now is that recently Nvidia released Geforce GTX 1060 with 3 GB of memory. Developers need to ensure that their games work perfectly fine on this popular GPU model. Thus I don't see any problem with a 4 GB mid-tier GPU with automated paging solution. Automated memory paging solution doesn't need to double your available memory in every single game to be useful.
I am talking about automatic virtual memory paging from a bigger system memory (DDR4) to a smaller GPU memory (HBM(2), MCDRAM, etc). Current consoles do not have big slow main memories. Might change in the future of course. I don't believe automatic paging from hard drive is fast enough right now. Latency is 10+ milliseconds. Paging from DDR4 main memory has several orders of magnitude smaller latency. Significantly easier to get right with minimal developer intervention.Just assuming that the Vega, going by rumors, is a high end GPU cost $500+ exclusively. It would be interesting to see if this sort of thing becomes defacto for the next generation of consoles as well though. As you said, better hardware memory management would definitely cut down on any perceived need to spend a lot of timing doing it in software. It'd also cut down on whatever amount of memory MS and Sony would consider to be needed, a win as far as their concerned too.
This topic falls into something I raised in another thread.smaller GPU memory (HBM(2), MCDRAM, etc).
Frostbite needs 472 MB for 4K (all render targets + all temporary resources) in DX12. Page 58:This topic falls into something I raised in another thread.
What sort of capacity would the HBM need to be to facilitate 4k fat gbuffers and render targets, and anything else that would have to be stored in it?
I'm aware that the industry is slowly moving away from fat but that's not the case as of yet.
Would 4GB really be enough to allow for that with the automated paging, or would it take developer input? AMD seems to imply that automated would handle it, when paging in from larger DDR4 memory even across pci-e.
Frostbite needs 472 MB for 4K (all render targets + all temporary resources) in DX12. Page 58:
http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/
Assets (textures, meshes, etc) are of course loaded on top of this, but this kind of data can be easily paged in/out based on demand. I'd say 4 GB is enough for 4K (possibly even 2 GB), but too early to say how well Vega's memory paging system works. Let's talk more when Vega has launched. I have only worked with custom software paging solutions that are specially engineered for single engine's point of view. Obviously a fully generic automatic solution isn't going to as efficient.
Game's data sets tend to change slowly (smooth animation). You only need to load new pages from DDR4 every frame. 95%+ of data in GPU memory stays the same.
I am talking about automatic virtual memory paging from a bigger system memory (DDR4) to a smaller GPU memory (HBM(2), MCDRAM, etc). Current consoles do not have big slow main memories. Might change in the future of course. I don't believe automatic paging from hard drive is fast enough right now. Latency is 10+ milliseconds. Paging from DDR4 main memory has several orders of magnitude smaller latency. Significantly easier to get right with minimal developer intervention.
Clocking probably isn't the right word here. Just adjusting voltage per lane to adjust frequency and lower power. Does seem a bit extreme for that though.So this seems kinda crazy: vary voltage per lane in a SIMD so that slight variations in completion time by individual lanes are ironed-out (effectively clocking lanes of the SIMD at slightly varying frequencies!).
At first, to me, it seemed the primary reason for this arrangement is to turn off individual lanes when predication is known to turn off lanes. Although this is problematic in AMD's existing SIMD architecture, because the SIMDs are 4-clocked per instruction, simulating 4 lanes per lane. So predication would turn off "lane" 3 say, but "lanes" 19, 35 and 51 are required to be turned on. Which makes me hesitant.Clocking probably isn't the right word here. Just adjusting voltage per lane to adjust frequency and lower power. Does seem a bit extreme for that though.
I took this to mean that different paths through a multiplier, for example, could result in fractionally different completion times - but I've not come across such a multiplier. And, frankly, a few nanoseconds is actually a bloody long time (multiple clock cycles).Even in a parallel computing environment, the lanes of a CU may execute operands at different rates. For example, the last lane of a CU may finish execution a few nanoseconds later than the first lane. This is due to the fact that the execution time for a given lane depends on the size of the operand. Smaller numbers take less time to calculate than larger ones. Similarly, some arithmetic calculations take longer than others. While the magnitude of the latency for a given operand may be quite small, over time the lanes will diverge in time. The difficulty is that the slowest lane will determine the performance for all the lanes.
There is, perhaps, a giant clue here in the apparently lazy way that the document appears to use compute unit and SIMD interchangeably. SIMD is stated 3 times. Twice in the "background" descriptive paragraph.Running each lane with asynchronous clocks would imply independent instruction streams for each lane. That only makes sense with temporal SIMT where instructions are repeated many times or executing loops.
The prior FIFO would be an operand collector (as seen in NVidia's designs).Check out Figure 6 in there. Cascaded FIFOs around a single lane. I'm guessing this isn't Vega, but Navi or later going by filing date. It's not far off of that speculated scalar design I had though.
There's nothing like SIMD-ness in that description.For example, at some time t0 lane 0 may be instructed to multiply two four bit numbers, lane 1 may be instructed to calculate the natural log of an eight bit number and lane n may be instructed to calculate the cosine of a twelve bit number. In general, smaller numbers take less time to calculate than larger numbers, and more simple arithmetic operations take less time than more complicated arithmetic operations. Therefore, it may be that the execution time for lane 0 may be less than lane n but the slowest lane will decide the performance of all the lanes 0 . . . n.
Just to continue the new AMD tradition...:
https://www.reddit.com/r/Amd/comments/652dnk/i_got_a_rx570_early/
well very interesting! its a shame the guy couldnt get it working tho
There were no driver CD in te box
Enviado desde mi HTC One mediante Tapatalk
It might apply to a variable SIMD design that co-issued instructions. A "simple arithmetic operation" of 8 or 12 bits might fit into the cadence with some added voltage. Lanes 1-8 executing a FMA op, 9-16 a shortened 8-12 bit instruction synchronously. Bigger issue would be it only applies to two cycle cadence operations at best as the nasty, less frequent instructions were relegated to lane 1 with a 1/16th rate to my understanding? Then the question of just how much faster will those instructions hit a steady state to make this technique useful.Yes, in a current GCN CU, amongst the four SIMDs, you would see this behaviour if you took each "lane" in that paragraph and read it as "SIMD". But there is no reason in that case for the relative completion times of these instructions
Problem I see here is that a linear regulator would be fast, but still burn power. Switching might be interesting, but would be a bit slower, require a capacitor, and possibly introduce noise. The timeline of the patent might track with the metal insulator metal capacitors introduced with Zen though. Still that's a lot of work for a seemingly small range of relatively infrequent instructions. It would work well to tune voltages per lane though.Might there be a few cycles of voltage regulator reaction time? I guess that reaction time for a lane's private voltage regulator should be pretty much non-existent, since there's not a significant distance between the origin of the signal and where it's acted upon, or a wide area of regulated transistors (a few million at most)
PERFORMANCE AND ENERGY EFFICIENT COMPUTE UNIT
So this seems kinda crazy: vary voltage per lane in a SIMD so that slight variations in completion time by individual lanes are ironed-out (effectively clocking lanes of the SIMD at slightly varying frequencies!).
Also seems to be a way to run a SIMD more efficiently.
I think this may be cumulative over some number of execution cycles. One specific wave may only take a fraction of a nanosecond longer, but it can accumulate. As a physical/electrical phenomenon, there may be a level of correlation in a lane--especially if riding the edge in terms of voltage and timing. If a lane's switching activity versus its current power delivery results in voltage droop, the next cycle it has may start at a slightly worse voltage level in addition to its delayed start, causing the next operation to insert even more, and so on.In this paragraph, reference to a "few nanoseconds" was initially confusing to me:
In modern x86 CPUs, division latency depends on the actual values of the operand, although that may be too complex an operation for this scenario.I took this to mean that different paths through a multiplier, for example, could result in fractionally different completion times - but I've not come across such a multiplier.
It could be a straightforward FIFO of command signals, reads from the register file using the same logic as before. AMD's HPC proposal did not note that the register file would be asynchronous and it was explicitly stated the SRAM would not be running at near threshold voltage. Buffering multiple operations ahead of the pipeline, giving them a rough time budget, and buffering any writeback could let the GPU usually complete work at a lower voltage and without as much wall-clock time wasted in a long cycle--usually.The prior FIFO would be an operand collector (as seen in NVidia's designs).
While the exact width may not be known, one way to split the difference as AMD has indicated for its exascale proposal is to keep SRAM at a higher voltage level and possibly kept at a synchronous clock. The patent may mean things are less tightly linked. The ALU lanes would buffer work and writeback in various FIFOs, so that actual interactions with the register file would only occur after all the variable timing has been resolved.The current GCN 4-cycle cadence results in simple and efficient 16-lane split register file design. No bank conflicts at all. Registers can be very close to execution units.