AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

In the end, P100 has much larger register files (total) than any other GPU.
This nitpick in the AMD Vega thread should be allowed:
P100 has more registers in total than any other nVidia GPU. Fiji and very likely Vega10 have both more registers in total (16MB vector registers, 16.8MB including the scalar ones).
Traditionally, AMD builds GPUs with relatively large register files. Improving the energy efficiency of the register file accesses could help there too. The question would be, how efficient/wasteful is the current register file design of AMD (or are the lower hanging fruits somewhere else) and how could one improve it without giving up the general simplicity of how it works. GCN appears to be carefully designed from the beginning to reach this simplicity and different aspects are intertwined to make it work.
 
Last edited:
There were hints around here already that even with Maxwell (and other Pascal GPUS) it's more like 2x 64 SP and that P100 just makes it explicit. So there's definitely something going on with the register file. It's just a question of figuring out what. :)
Maybe indirectly but the register file size for the 3,840 Cuda core P40 is 7,680KB, keeping the ratio of 256KB * 30 SM, 1st full core Pascal that is nearly half the register file size to that of the P100 due to its 56SM.
Cheers
 
In the common case, the waitcnt instruction is reached (and issued) before the data is ready. Issuing the waitcnt instruction would inform the load buffer to start moving data to the registers (at this point you can be sure nobody anymore accesses those registers).
This turns s_waitcnt effectively into a vector instruction with multiple write destinations, none of which are encoded in the instruction itself for the instruction buffer or upstream decoding to reference. Actually, the "s_" portion of it may already be a problem, since it puts it in the wrong domain and possibly a different sequencer. That s_waitcnt is a special instruction consumed in the instruction buffer may indicate it is even further upstream than that, and may only do a very simple operation in an early pipe stage that checks the counters and flips a bit to indicate the wavefront is ready in the next cycle.
This change would be significant as it requires the operation to communicate across the IB, SIMD, and VMEM domains in non-trivial ways with information not actually encoded in the instruction.

After waitcnt has informed the load buffer that registers are now available, it would start waiting (as it does now), and obviously some other wave would take over the SIMD.
That implies a certain amount of information moving upstream. There is some number of pipeline stages between where s_waitcnt is able to make the thread wait, and where registers become available. Some of the manual wait states that affect instruction issue like setting VSKIP may mean a lag of 1 or 2 (x4 for real cycles), and that is following the flow of the pipeline. There may be registers that aren't available, which s_waitcnt would not be able to know for some time.

I think the easiest way would be full software management of this buffer. Similar to register file. Load instruction would specify both target VGPR and load buffer slot.
Is there a particular reason why an instruction needs to explicitly specify a slot number? Memory operations are ordered, and it's not saving register context if the same slot is linked to more than one register pending the same waitcnt. The scenario where more than one slot is applied to the same register is equivalent to two loads hitting the same register.

Since this actually still imposing a requirement that there be physical storage somewhere, why not put it into a special range of the register file? On top of that, add a small bit of register renaming and remove the explicit slot number. The target registers prior to a waitcnt get added to a small list, with each entry being paired with one of the special registers. A waitcnt would flip a bit or add a small offset, and then the current register and the load switch roles.
Possibly, you could just define a range at one end of the register allocation for this, although that might be something along the lines of a clause-temporary or scratch register from the VLIW days.
It keeps the shared vector memory pipeline from being modified too heavily, and reduces the need for communication between pipelines.
 
Hmm, so he really claimed this HBCC stuff is managing the memory in hardware and the developer doesn't have to take much care of it (doesn't rule out driver intervention, though). Let's see how this will work out.
PS: And he explicitly confirmed the draw stream binning rasterizer just works without the need for code changes as one could expect (someone here was doubting that).
 
Does the VRAM controller need much software input today? Imho not, as long as you do not try to interfere with what the application does. I can not see how hardware should be able to guess correctly which textures need to be loaded and which will not be used soon and can be discarded without the application handling it.
 
But driver does know that. It has access to future rendering commands that it will send to the GPU and as such knows which textures it will need (as I mentioned before they like it if they can have an entire frame pending; the so called "render ahead"). Though that does not necessarily mean it also knows which mip level it will need. There really isn't much interesting stuff going on with regard to VRAM controller and software. But only until application doesn't blow over the available GPU local memory. That's when the fun starts.
DX11 drivers are already very good at this. The only problem with drivers interfering with applications are cases of virtual texturing when applications overshoots available GPU memory. Because this explicitly means that driver will say "ok this texture hasn't been used in a while, push it to system memory" and game will go "ok this texture hasn't been used in a while, lets reuse it for this new texture which we need next frame". But this should kinda fix its self with DX12/Vulkan and usage of actual hardware sparse resources.
With HBCC and memory paging, yes it can all just work. When GPU access something that's not in local memory it will fault suspend an execution of a wave and OS will upload required page to GPU. After that is done wave will wake up and resume. This will "just work". Question is how well (without someone issuing a prefetch).
 
But it makes little difference between VRAM Controller and HBCC when you need to stop the wavefront until you loaded the data from the system RAM. Both are much faster than PCIe. So for HBCC to make sense it would need to wait less often than a conventional VRAM controller.
 
At the moment you won't just stop a wavefront. You'll stop feeding rendering commands to GPU (when next command will need a texture that's not yet local).
 
Does the VRAM controller need much software input today? Imho not, as long as you do not try to interfere with what the application does. I can not see how hardware should be able to guess correctly which textures need to be loaded and which will not be used soon and can be discarded without the application handling it.
Depend on what you meant by software input. The driver still needs to update the page table and make sure the resources needed are resident before you access an allocation in the GPUVM address space, and this part of memory management is now exposed via modern APIs (D3D12, Vulkan, etc). Otherwise the GPU would go dead when a page fault occurred (except for sparse texture).

Accessing the pageable memory via ATC (process VAS) can be already done transparently on the right hardware (e.g. Carrizo).
 
The "AMD Radeon RX 490 8GB" was mentioned as part of the recommended specs for the Fallout 4 High Res Texture Pack. The recommendation from the Nvidia side was a 1080.

It's now been removed and the only recommendation is the 1080, but the folks at r/fallout happened to copy the text.

Original Text:

Next week, we’re rolling out new updates for both Fallout 4 and Skyrim Special Edition across Xbox One, PlayStation 4, and PC. Both games are bringing new features to Mod content (more on that next week), and specifically for PlayStation 4 Pro users, we are thrilled to share details on our official PS4 Pro support.

FALLOUT 4 GOES PRO FOR PS4

Beginning next week, Fallout 4’s Update 1.9 on PlayStation 4 adds support for the power of the PlayStation 4 Pro console. The update provides enhanced lighting and graphic features, including:

  • Native 1440p resolution
  • Enhanced draw distance for trees, grass, objects and NPCs
  • Enhanced Godray effects
To experience the improvements provided with our PS4 Pro update, load up Fallout 4 and download the latest title update when it becomes available.

OFFICIAL HIGH RESOLUTION TEXTURE PACK FOR PC

Also available next week, with so many fans still actively playing Fallout 4 on Steam, we’re excited to announce the release of the game’s High-Resolution Texture Pack. Consider this free download a love letter to our amazing PC fans that have supported us – not just with Fallout 4, but across multiple decades and games.

Note: To utilize the High-Resolution Texture Pack, make sure you have an additional 58 GB of available and that your system meets/exceeds the recommended specs below.

Recommended PC Specs

Windows 7/8/10 (64-bit OS required)

Intel Core i7-5820K or better

GTX 1080 8GB/AMD Radeon RX 490 8GB

8GB+ Ram

If your system can handle it, the Commonwealth will look better than ever. Give it a shot and if you need to return to the original textures, you can disable them within the game’s launcher menu.
 
Might not even be Vega. It could be that P12(unlikely, small?) or P10 GX2(no idea), assuming it wasn't a typo.

I'm thinking if it was a typo, they would've corrected the typo and been done with it.

It's not unheard of for recommended specs to pick the "best" part from each camp, even if those aren't terribly comparable, e.g. 1080, 480.

But if that was the case, then they would've just fixed the typo rather than remove mention of the "490".
 
The problem with the typo is that a RX480 would never be in the performance bracket as the GTX1080. Then again, these requirements are completely messed up a lot more often than they should.

I'm gonna go with either typo or dual Polaris 10 which has also been appearing in the driver and kernel listings.
2017's new product line will probably consist of Polaris cards being rebranded to RX5xx, big Vega being Fury Z or 2 and small Vega being RX580 or 590.




Speaking of Vega, Hynix has updated their HBM2 portfolio again, and it seems they now only have 1.6Gbps chips. 2Gbps chips are nowhere to be found.

https://videocardz.com/65649/sk-hynix-updates-memory-product-catalog-hbm2-available-in-q1-2017

So either big Vega is coming with 1.6Gbps after all (410GB/s total), or AMD will be turning to Samsung to get the 2Gbps chips on big Vega and Hynix's chips will be used on something else like small Vega or APUs.
 
Back
Top