AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

https://www.youtube.com/user/amd/videos
Bunch of videos AMD recently uploaded.
  • "Removing resource management from the developer with Vega" Link
Most awesome news in long time. Someone is finally talking about games and fine grained automated memory paging from CPU memory. Hopefully Nvidia follows the suit. Professional (5000$+) Pascal P100 supports this already in CUDA. Link: http://www.techenablement.com/key-aspects-pascal-commercial-exascale-computing/. Now we just need consumer NV GPU support and graphics API support.

Future looks bright :)
 
Expecting a 530mm2 Vega not to clean with 1080 is lunacy. That would be the biggest fail AMD ever conceived.
 
Ur not taking into account drivers. As far as we know the drivers for vega were in alpha. And due to how different vega is from the past GCN we can be sure AMD have a lot of work to do with the drivers.

What we forget to take in account, is the run of the Vega was with V-sync fixed at 60fps ..... We have no idea at what fps it was run in reality. could be 61fps as 75 average. or maybe who now .. Honestly i dont know what will be the performance of it based on this demo.

It will be same thing for the TitanX if.v-sync is enabled.( and it is in some of link posted upper. )

i just cant tell for SW what was the performance of it based on a artificially fixed fps to 60fps. (v-sync on )
 
Last edited:
In 3840x2400 my 4790 + 1080 delivery 60FPS with some dips to 53FPS with lots of action in the settings used and on Endor. So I think the shown performance is around the 1080 level.
 
Most awesome news in long time. Someone is finally talking about games and fine grained automated memory paging from CPU memory. Hopefully Nvidia follows the suit. Professional (5000$+) Pascal P100 supports this already in CUDA. Link: http://www.techenablement.com/key-aspects-pascal-commercial-exascale-computing/. Now we just need consumer NV GPU support and graphics API support.

Future looks bright :)

Is this unified memory? Every Pascal GPU supports 49bit and "Page Migration Engine" - comment section:

Any Pascal GPU supports the new Unified Memory features such as on-demand paging and GPU memory oversubscription, so you can definitely prototype your application on GTX 1080 or GTX Titan X
https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
 
Moving on, it seems we were mistaken in the interpretation of that "128 32bit ops per clock" slide.
It's "ops", not "madds", so ALU count on the "NCU" is the same. Each ALU can do two operations in a cycle (multiply + add), so that hasn't changed.

Could be, but this slide indicates both twice the clock (which I doubt should be taken literally) and twice the ops(/units) per cu.
 
Just use a shader to implement all of it. Same thing that occurred with DX9 replacing fixed function vertex/pixel functions with programmable.
DX8 did that not DX9. And implementing 4 programmable stages running all at different cadences into a single programmable stage is a far far more complex thing to do then what was done in DX8.
 
Is this unified memory? Every Pascal GPU supports 49bit and "Page Migration Engine" - comment section:


https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/


It takes some work out of the developers hands but pretty sure most of the work that is being done now for streaming assets still has to be there, of course it does take away from CPU usage I think with the new implementation methods, just guessing at this point though.
 
It takes some work out of the developers hands but pretty sure most of the work that is being done now for streaming assets still has to be there, of course it does take away from CPU usage I think with the new implementation methods, just guessing at this point though.
Assuming it is on-demand page migration, it should work differently though (push/copy vs pull/paging). Everything would be like automatic tiled resources, probably except render targets. Subsets of the resources should be swapped in only on demand (upon page fault when accessed, or a prefetch hint is given). The current model still requires manual management with assumptions in the size of VRAM.

Although the real question is if it is that simple... AFAIU it can be done rather easily with the abstractions and coherency guarantees the graphics stack provide, as long as the GPU address translation hierarchy is architected to handle it. But probably not for compute (HSA/OCL), especially for HSA which requires agents to share/mirror the process VAS.

Sounds a fit to Linux's HMM effort though.
 
Last edited:
I would actually prefer variable wavefront sizes realized by a variable amount of looping with a narrower SIMD (like vec4). Okay, it stays a bit more granular (if one keeps the latency=troughput=4 cycles one would get wavefronts sizes of at least 16), but one could keep a lot of the other stuff intact. For the smaller wavefronts one needs relatively more scalar ALUs in the CU (optimally still one per 4 vALUs). But that should be a relatively small investment.
Based on the recent die shot of Polaris, if the scalar portion is next to the shared instruction fetch/scalar cache portion of a CU group and below what appears to be the LDS, quadrupling that band gets around the area of one SIMD partition (2 SIMDs).
However, some of that area might be due to the scalar portion being integrated with part of the overall scheduling pipeline, which goes to the later portion.

One needs to increase the scheduling capacity per SP though, as each small vALU needs its own instructions. But it could work out in terms of power consumption as larger wavefronts should still dominate and one could gate the scheduling logic for 75% of the time for the old fashioned 64 element wavefronts. In case of smaller 16 or 32 element wavefronts, the increased throughput (potentially factor 4) justifies the increased consumption of the scheduler.
Breaking that 16-wide SIMD into 4 independent quad-width units creates 4x peak scheduling demand+area, but this is for only 1/4 of a current CU's ALU capacity.

What happens with the vector register file might be interesting, since the file's depth per lane would increase in order to house a 64-wide wavefront's context in a narrower SIMD.
If GCN optimized its register read process for the current configuration, there's possibly a bit more math since finding the physical register needs to take into account the wavefront width since a registerID may equal 1, 2, or 4 rows before considering 64-bit.


When I first saw that diagram, I did not think of asymmetrical SIMDs but of lane gating, which can help quite a bit if you expect to run into power limits or have a rather aggressive clock boost in place.
I'm curious whether lane clock gating isn't already being done for inactive lanes even without physically different SIMDs.
Unless a wavefront monopolizes a SIMD for multiple vector cycles, shutting down lanes based on one wavefront's mask is not going to realize much of a savings if the next issue cycle comes from another wavefront with a contradictory mask unless there's fine-grained detection of lane use. Once you have that level of detection, then it should work fine with or without the diagram's method.
The diagram's stating there are space savings doesn't make sense if it's just gating. Inactive lanes don't go away in that case, barring some other change in the relationship between lanes and storage.


About time! I have been fixing and improving console GCN2 cache management code in the past two weeks. I am happy to hear that L2 handles ROPs now as well (even if there's still some tiny L1 ROP caches). Much less L2 flushing needed. Should be good for async compute as well :)
Up until execution/export could become reordered with binning, the pipelined import/export of ROP tiles may have posed a risk of thrashing the L2 more than saving ROP traffic back to memory. Shading and exporting on the basis of a bin's consolidated lifetime rather than a mix of fragments with conflicting exports might have helped. It could also help reduce or avoid thrashing of the compression pipeline, and possibly give a point of consistency for an in-frame read after write to produce valid data.
That might depend on where the compression pipeline and its own DCC cached data resides.

And we still don't know if it supports ROV's and conservative rasterization.
Part of the ROV process might fall out of binning. ROV mode could make a bin terminate upon an detecting an overlap, then start a new bin with the most recent fragment. If the binning process has multiple bins buffered for each tile, the front end might be able to switch over to another tile if the next bin also hits a conflict.

Could be, but this slide indicates both twice the clock (which I doubt should be taken literally) and twice the ops(/units) per cu.
The Vega MI25 is supposedly 25 TF of FP16. Unless there are half as many CUs, it should be higher than 25. 4096 SPs x 2(FMA) x 2(FP16) x ~1.5(clock) gives ~25 TF.
 
Last edited:
Would be strange for them to pull a Fermi (RE: clocks), wouldn't it? :confused:o_O

Yes, that's why I say we probably should not take the clock doubling in the diagram literally - but just as a significant increase as mentioned in the text.

The Vega MI25 is supposedly 25 TF of FP16. Unless there are half as many CUs, it should be higher than 25. 4096 SPs x 2(FMA) x 2(FP16) x ~1.5(clock) gives ~25 TF.

Yeah, twice as wide CUs (and I agree that 128 sounds too wide) would mean half as many CUs. The slide could also indicate better packing of the 'ops' (higher utilization) - for instance the wavefront compaction, but I think we concluded that was for Navi only?
 
Ur not taking into account drivers. As far as we know the drivers for vega were in alpha. And due to how different vega is from the past GCN we can be sure AMD have a lot of work to do with the drivers.
Indeed. That may also explain why the release is likely further away than many were expecting.
AMD would want to avoid a re-run of Tahiti's release, where initial drivers gave a poorer impression of the architecture than was eventually the case.
 
Indeed. That may also explain why the release is likely further away than many were expecting.
AMD would want to avoid a re-run of Tahiti's release, where initial drivers gave a poorer impression of the architecture than was eventually the case.
Talking about that I saw a recent review where Tahiti came ahead of the 680.

Enviado desde mi HTC One mediante Tapatalk
 
Yeah, twice as wide CUs (and I agree that 128 sounds too wide) would mean half as many CUs.
The preview footnotes also had triangle rate at up to 11 polygons per clock over 4 shader engines, which would bring the CUs per engine lower in this performance tier than it has been. There was no mention about texture or load/store throughput, which would presumably see sharply higher demand if the vector path doubled.

Maybe the higher ops count is about removing sources of stalls. Some other possibilities could be that a CU could issue more than one instruction from a wavefront, possibly from more than one category in the absence of a dependence such as issuing a vector and scalar op.

The slide could also indicate better packing of the 'ops' (higher utilization) - for instance the wavefront compaction, but I think we concluded that was for Navi only?
I don't recall much being concluded about Navi, or if any compaction is in the cards at all.
 
Back
Top