AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

http://www.eurogamer.net/articles/d...tation-4-pro-how-sony-made-a-4k-games-machine
In actual fact, two new AMD roadmap features debut in the Pro, ahead of their release in upcoming Radeon PC products - presumably the Vega GPUs due either late this year or early next year.
Saw this posted by someone else in the PS4 Pro thread. Has some interesting hints in regards to Polaris and Vega.

Finally, there's better support of variables such as half-floats. To date, with the AMD architectures, a half-float would take the same internal space as a full 32-bit float. There hasn't been much advantage to using them. With Polaris though, it's possible to place two half-floats side by side in a register, which means if you're willing to mark which variables in a shader program are fine with 16-bits of storage, you can use twice as many. Annotate your shader program, say which variables are 16-bit, then you'll use fewer vector registers.
...
One of the features appearing for the first time is the handling of 16-bit variables - it's possible to perform two 16-bit operations at a time instead of one 32-bit operation," he says, confirming what we learned during our visit to VooFoo Studios to check out Mantis Burn Racing. "In other words, at full floats, we have 4.2 teraflops. With half-floats, it's now double that, which is to say, 8.4 teraflops in 16-bit computation. This has the potential to radically increase performance.
@sebbbi I believe this was the register packing you were trying to figure out previously. Not sure if you saw this or not. Context missing in my quote, but the register packing should exist on Polaris.

You may later on see something that looks very much like a console GPU as a discrete GPU, but that's then being very familiar with the design and taking inspiration from the console GPU. So the similarity, if you see one, is actually the reverse of what you're thinking,
Shared memory? Or something along the lines of SSG?

"Once a GPU gets to a certain size, it's important for the GPU to have a centralised brain that intelligently distributes and load-balances the geometry rendered. So it's something that's very focused on, say, geometry shading and tessellation, though there is some basic vertex work as well that it will distribute," Mark Cerny shares, before explaining how it improves on AMD's existing architecture.
"The work distributor in PS4 Pro is very advanced. Not only does it have the fairly dramatic tessellation improvements from Polaris, it also has some post-Polaris functionality that accelerates rendering in scenes with many small objects... So the improvement is that a single patch is intelligently distributed between a number of compute units, and that's trickier than it sounds because the process of sub-dividing and rendering a patch is quite complex."
Dual GPU possibility here?
 
Saw this posted by someone else in the PS4 Pro thread. Has some interesting hints in regards to Polaris and Vega.
It's referred in the post previous to yours ;)

Shared memory? Or something along the lines of SSG?
He's talking about features that appeared in the console first and desktop GPU later.
We did observe that with the original PS4. Having 8 ACEs to better distribute graphics and compute workloads appeared on Liverpool first and only in GCN2 Hawaii, and dedicated audio DSPs appeared in Bonaire and later.
None of these were present in 2012's first generation of GCN chips.
Same thing will probably be happening with 2*FP16 throughput and the new work distributor for Vega chips.

Dual GPU possibility here?
Nope, it's better work distribution among CUs within the same GPU, leading to better utilization.
 
wccf's Khalid is back with claims about a 1TB SSD card with Vega 10 gpu.

AMD’s next generation Vega 10 GPU will power this unique board which will also be equipped with an excess of one terabyte of directly accessible on-board graphics memory. Represented in 16GB, interposer linked, 2.5D stacked, second generation high bandwidth memory in addition to an on-board solid state device. So, in many ways, Dracarys is quite similar to AMD’s recently announced Radeon Pro SSG. Which features a Fiji GPU as well as an integrated SSD storage option, expandable via an M.2 slot.

However, the storage solution in Dracarys is far more integrated and a lot “closer to the metal” so to speak compared to the Radeon Pro SSG. Which is more like a graphics card and a PCIe SSD integrated into one expansion card. Where there are a few hoops that have to be jumped over for the Fiji GPU to access all of the available storage space on the SSD, rather than having it as a direct pool of memory. In that respect alone, the new Vega 10 powered Dracarys board is very different. Another key difference is the new 20+ FP16 teraflops VEGA 10 graphics engine which has more than double the graphics horsepower of Fiji In the Radeon Pro SSG. Suffice to say, Dracarys is dramatically more potent.

http://wccftech.com/amd-dracrays-vega-10/


Also,

The new AMD Crimson 16.10.2 drivers support a new Fury card: AMD7300.5.

http://www.bitsandchips.it/52-english-news/7582-new-fury-card-spotted-in-the-newest-crimson-drivers
 
On the followup of Vega possibly being produced outside GlobalFoundries, there's one interesting tidbit from yesterday's AMD Q3 results:


Apparently this was in the original slide from back in September:

fci0Q8.png



So this agreement not only came into place but it's already been accounted for in their GAAP sheet. Meaning it's not just a fallback for something that could happen until 2020, it's something that is already happening for 2017 products from AMD.
 
On the followup of Vega possibly being produced outside GlobalFoundries, there's one interesting tidbit from yesterday's AMD Q3 results:



Apparently this was in the original slide from back in September:

fci0Q8.png



So this agreement not only came into place but it's already been accounted for in their GAAP sheet. Meaning it's not just a fallback for something that could happen until 2020, it's something that is already happening for 2017 products from AMD.



Yep with the current rumors of GF process not being up to the task, this is why I think Vega would be on TSMC's 16nm.
 
Yep with the current rumors of GF process not being up to the task, this is why I think Vega would be on TSMC's 16nm.
Well... Rumors of Vega using custom cell libraries seem suspect now, unless they were developing for 16nm FF+ from the outset
 
Yeah probably not going to be any custom on the silicon level. The time it would take to do that, just hasn't been there.
 
@sebbbi I believe this was the register packing you were trying to figure out previously. Not sure if you saw this or not. Context missing in my quote, but the register packing should exist on Polaris.[/QUOTE]

These instructions can be found when compiling for Polaris (and Fiji AFAIR) and observing the assembly in CodeXL. Polaris even produces min/max/avg and similar for half floats. Only muls (and adds, don't remember that one) are converted to floats. I guess the author didn't ask a programmer.
 
Nope, it's better work distribution among CUs within the same GPU, leading to better utilization.
So a HWS that also does geometry? Nice addition, but I'm not sure I'd consider it revolutionary. It seems more like a simple iteration on the existing design than something spearheaded by a console.

He's talking about features that appeared in the console first and desktop GPU later.
We did observe that with the original PS4. Having 8 ACEs to better distribute graphics and compute workloads appeared on Liverpool first and only in GCN2 Hawaii, and dedicated audio DSPs appeared in Bonaire and later.
None of these were present in 2012's first generation of GCN chips.
Same thing will probably be happening with 2*FP16 throughput and the new work distributor for Vega chips.
Again, they are nice features, but not something I would consider revolutionary.

wccf's Khalid is back with claims about a 1TB SSD card with Vega 10 gpu.
Not surprising considering the SSG. 1TB seems a bit limited, but maybe an entry level product. Considering the FP16 news on Vega that could make a respectable deep learning product.

Well... Rumors of Vega using custom cell libraries seem suspect now, unless they were developing for 16nm FF+ from the outset
Wouldn't be surprising with the consoles. While there could be some differences in the designs, I'd imagine some parts could be copied.

These instructions can be found when compiling for Polaris (and Fiji AFAIR) and observing the assembly in CodeXL. Polaris even produces min/max/avg and similar for half floats. Only muls (and adds, don't remember that one) are converted to floats. I guess the author didn't ask a programmer.
Instructions for packed registers and execution? FP16 has been around for a while, but the doubling/halving is relatively new from my recollection. I believe he was after halving the register counts for his shaders.
 
So a HWS that also does geometry? Nice addition, but I'm not sure I'd consider it revolutionary. It seems more like a simple iteration on the existing design than something spearheaded by a console.
HWS is for compute, and it serves to schedule user mode queues, and bind them to the limited hardware ACE slots dynamically.

The workload distributor described here is more likely to be something IN the intermediate dispatch layer (presumably still being called Shader Processor Interpolator), that deals with the dispatched grids from graphics pipelines and the ACEs. Especially when you saw lines about balancing dispatches from patches among CUs.
 
Last edited:
@sebbbi I believe this was the register packing you were trying to figure out previously. Not sure if you saw this or not. Context missing in my quote, but the register packing should exist on Polaris.

These instructions can be found when compiling for Polaris (and Fiji AFAIR) and observing the assembly in CodeXL. Polaris even produces min/max/avg and similar for half floats. Only muls (and adds, don't remember that one) are converted to floats. I guess the author didn't ask a programmer.
There are native FP16 instructions, but the ISA manual does not specify them being packed.

Description D.f16 = S0.f16 + S1.f16. Supports denormals, round mode, exception flags, saturation.

If it does support packed mode, it should have specified it like `[15:0]` and [31:16]`.
 
Last edited:
HWS is for compute, and it serves to schedule user mode queues, and bind them to the limited hardware ACE slots dynamically.

The workload distributor described here is more likely to be something IN the intermediate dispatch layer (presumably still being called Shader Processor Interpolator), that deals with the dispatched grids from graphics pipelines and the ACEs. Especially when you saw lines about balancing dispatches from patches among CUs.
Agreed, but in principle it seems like they may follow similar designs. ACEs dispatching compute queues and the new hardware dispatching patches or sub-patches? Different hardware units, most likely, separated only by their dispatch requirements and capable of some synchronization. Not sure how widespread that ID buffer will be, or if it exists in Vega, but that would seemingly be able to provide hints at how much geometry was actually in a patch. Possibly allowing the new units to more accurately divide the workload prior to actually evaluating the patch. That could significantly change how they go about the interpolation or even shift away from fixed function geometry units.
 
For reference which architectures are "included" (to whichever degree) by AMD:
https://radeonopencompute.github.io/GCN_Float16.html

There are native FP16 instructions, but the ISA manual does not specify them being packed.

If it does support packed mode, it should have specified it like `[15:0]` and [31:16]`.

I see. Then I probably misunderstood, or just looked to quickly. The assembly looked very much like "load (16fp texture), min (loaded packed content with some constant), unpack (and continue with fp32 mul)" last time I generated some with various arithmetics to probe what's the status of the instructions. A pass through shader just consisted of a load+export without exporting with conversion from float to half, the rendertarget was half (the PSO knows this because it needs to be specified), but that one's ordinary if packing is a feature.
 
wccftech's Khalid likely got it from this,

PRINTED CIRCUIT BOARD ASSEMBLY (VIDEO/ GRAPHIC CARD)DRACARYS FIJI/SSD P/N:102-D02702-00 (FOC)

https://www.zauba.com/import-printed+circuit+board+assembly/hs-code-84733030/ip-INHYD4-hs-code.html
Strange they'd call it "Fiji" though. It looks like a standard SSG (Fiji + SSD) from the description. Maybe Fiji and Vega are somewhat interchangeable with the mounting? Could be an interesting form factor as I'd have expected most of the SSG designs to be past the design phase by now.

Vega + SSD seems a no brainer considering the FP16 news, but I don't really see that insight coming from this shipping manifest.

For reference which architectures are "included" (to whichever degree) by AMD:
https://radeonopencompute.github.io/GCN_Float16.html
The current GPUs execute at same speed as Float32.
The speed really is the big difference. The capability for FP16 has been there, but the benefits were a bit more limited.
 
I see. Then I probably misunderstood, or just looked to quickly. The assembly looked very much like "load (16fp texture), min (loaded packed content with some constant), unpack (and continue with fp32 mul)" last time I generated some with various arithmetics to probe what's the status of the instructions. A pass through shader just consisted of a load+export without exporting with conversion from float to half, the rendertarget was half (the PSO knows this because it needs to be specified), but that one's ordinary if packing is a feature.
As far as I know, GCN supports packed FP16 format in buffer/image load/store since day one, and also a variety of buffer formats. But they are expanded/packed by the memory pipeline and take a full register for each component regardless of the buffer format. Having said that, there are packing instructions that are perhaps mainly for export.

Only since GCN3 they have added packed 16-bit x2 load/store.
 
As far as I know, GCN supports packed FP16 format in buffer/image load/store since day one, and also a variety of buffer formats. But they are expanded/packed by the memory pipeline and take a full register for each component regardless of the buffer format. Having said that, there are packing instructions that are perhaps mainly for export.

Only since GCN3 they have added packed 16-bit x2 load/store.
It is interesting though from what I can tell the 1st we hear about the 16-bitx2 packed was with regards to a console and not from any of the AMD presentation slides where it would make sense for it to be announced for HPC.
Context here is Polaris rather than Vega.
The Polaris slides only mention native support for 16int and fp16, so curious if this packed operation-function can be implemented on current-existing Polaris GPUs, or whether it will be just SoCs-consoles and Vega.
Sort of like how Nvidia initially had packed fp16 for Tegra X1 with specific functions and not other Maxwell discrete GPUs and still sort of limited in implementation on selective Pascal GPUs.
Cheers
 
It is interesting though from what I can tell the 1st we hear about the 16-bitx2 packed was with regards to a console and not from any of the AMD presentation slides where it would make sense for it to be announced for HPC.
Context here is Polaris rather than Vega.
The Polaris slides only mention native support for 16int and fp16, so curious if this packed operation-function can be implemented on current-existing Polaris GPUs, or whether it will be just SoCs-consoles and Vega.
Sort of like how Nvidia initially had packed fp16 for Tegra X1 with specific functions and not other Maxwell discrete GPUs and still sort of limited in implementation on selective Pascal GPUs.
Cheers
AMD doesn't seem pushing Polaris for HPC at all, and clearly Polaris focuses on TTM and lower-end segments that packed FP16 would not have an immediate effect on.
 
AMD doesn't seem pushing Polaris for HPC at all, and clearly Polaris focuses on TTM and lower-end segments that packed FP16 would not have an immediate effect on.
Which is part of my point, the packed FP16 [edit functions-operations] is possibly only going to be available for Vega and SoCs-consoles.
So the other challenge is then possibly porting games that are heavily optimised around the FP16 packed operations-function as it would not provide same level of performance for mainstream Polaris discrete GPUs on PC/Macs/laptops.
That could had potentially hurt Nvidia more than DX12, from the context of gaming consumer market and over next 12-16 months, which is a large proportion of GPUs.
Cheers
 
Last edited:
Back
Top