AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Hmm, 4× the power efficiency is good, but compared to what?
While the "Nx" comparisons could be against multiple GPUs, the "bandwidth per pin" and "capacity/stack" comparisons are specifically against Fiji*. So perhaps the other comparisons are also against Fiji.

I think this assumption is consistent with existing rumors and information:
  • "2x Peak Throughput per Clock": makes sense from the double rate FP16.
  • "4x Power Efficiency": The Radeon Instinct MI25 has 25 FP16 TFLOPS and a < 300 W TDP, which gives > 2.7x compared to the Fury X. I've seen a rumored 230 W TDP for some Vega 10 part, which gives 3.5x using the same TFLOPS value, so the 4x number could make sense.
Exactly 4x the FP16 GFLOPS/W of Fiji in ~230 W would give ~29 TFLOPS. It's possible that consumer Vega has higher clock speeds than the Instinct MI25. (The 4x number could also have been rounded up.)

* Or a non-AMD GPU, presumably, but I think Fiji makes the most sense.
 
Could just be the power efficiency of HBM2 compared to GDDR5 per bandwith. They give HBM 3x times the efficiency, so HBM2 could easily be 4 times.
 
What's Vega NCU? Is it the same as "Next Generation Computer Engine"? Also is that a typo "computer engine"?
Also a draw stream binning rasterizer sounds interesting alongside "Next Generation Pixel Engine". Maybe conservative rasterization?
 
What's Vega NCU? Is it the same as "Next Generation Computer Engine"? Also is that a typo "computer engine"?
Also a draw stream binning rasterizer sounds interesting alongside "Next Generation Pixel Engine". Maybe conservative rasterization?
I hope so on CR.
 
Hmm, 4× the power efficiency is good, but compared to what?

I'll guess the previous top-end solution, meaning Fiji in Fury X form.

Though it could/should be closer to 3x Hawaii's efficiency in actual performance.
 
Seem like a new cache hierarchy and a new rasteriser are coming.
What's Vega NCU? Is it the same as "Next Generation Computer Engine"? Also is that a typo "computer engine"?
Also a draw stream binning rasterizer sounds interesting alongside "Next Generation Pixel Engine". Maybe conservative rasterization?
Some guess NCU stands for Next Compute Unit. A binning rasteriser sounds like a match to Nvidia's tiled rasteriser, with "draw stream" probably meaning it isn't TBDR (?).
 
Any guesses as to why we have a bullet point for the cache and cache controller?

Perhaps the HBM sits inbetween the GPU and a traditional GDDR pool?
 
Any guesses as to why we have a bullet point for the cache and cache controller?
The simplest guess without too much fanciness is a new cache hierarchy, which might be a complement with AMD's claim (reported by EETimes) of Vega utilising the same data fabric (NoC) as Zen SoCs do, which is said to be scalable from SoC uses (<40 GB/s) to beyond 512 GB/s (two HBM2 stacks).
 
512TB virtual address space presumably means there is one extra address bit (49 bits) in GPU's own VM hierarchy over GCN3 (48 bits). No idea why they would just bump up one bit though... Are they going to map the entire host virtual address space into the GPUVM and unify the address translation hierarchies (ATC/GPUVM), heh?
 
Yes, same reason why nVidia supports this with Pascal:
GP100 extends GPU addressing capabilities to enable 49-bit virtual addressing. This is large enough to cover the 48-bit virtual address spaces of modern CPUs, as well as the GPU's own memory.
This allows GP100 Unified Memory programs to access the full address spaces of all CPUs and GPUs in the system as a single virtual address space, unlimited by the physical mem ory size of any one processor
 
Yes, same reason why nVidia supports this with Pascal:
AMD kinda did this in GCN already though. There is an ATC bit in various descriptors that specifies whether the address is in GPUVM or in ATC (host address space through IOMMU).
 
Guess they wouldn't want Compute Unit NexT. ;)
As opposed to Graphics Core Next? I'm still unsure on what exactly NCU is in reference too. It might be their command processor or something.

The simplest guess without too much fanciness is a new cache hierarchy, which might be a complement with AMD's claim (reported by EETimes) of Vega utilising the same data fabric (NoC) as Zen SoCs do, which is said to be scalable from SoC uses (<40 GB/s) to beyond 512 GB/s (two HBM2 stacks).
Using the same fabric was extremely likely as separate Naples and Zen dice would have to coexist on the same package. Cache controller might also be related to ESRAM, partitioning of cache and units, or off package storage like SSG.

512TB virtual address space presumably means there is one extra address bit (49 bits) in GPU's own VM hierarchy over GCN3 (48 bits). No idea why they would just bump up one bit though... Are they going to map the entire host virtual address space into the GPUVM and unify the address translation hierarchies (ATC/GPUVM), heh?
They were already leaning on the CPU's IOMMU unit for addressing with the ATC. Considering the amount of memory controllers in a GPU as opposed to CPU, it could be a pin/pad issue requiring additional routing. Additional bits possibly being used for additional addressing of virtual pools, encryption, CRC, etc. While not a lot, it could add up.
 
So, I have to log in to link this patent application:

http://www.freepatentsonline.com/y2016/0371873.html

HYBRID RENDER WITH PREFERRED PRIMITIVE BATCH BINNING AND SORTING

A system, method and a computer program product are provided for hybrid rendering with deferred primitive batch binning A primitive batch is generated from a sequence of primitives. Initial bin intercepts are identified for primitives in the primitive batch. A bin for processing is identified. The bin corresponds to a region of a screen space. Pixels of the primitives intercepting the identified bin are processed. Next bin intercepts are identified while the primitives intercepting the identified bin are processed.

Don't understand it as yet...
 
They were already leaning on the CPU's IOMMU unit for addressing with the ATC. Considering the amount of memory controllers in a GPU as opposed to CPU, it could be a pin/pad issue requiring additional routing. Additional bits possibly being used for additional addressing of virtual pools, encryption, CRC, etc. While not a lot, it could add up.
Those possibilities you mentioned aren't likely covered by this one addressing bit, however. Let's say ESRAM, it is hard to imagine it not being virtualised behind the per-process virtual address space. For the LDS or scratch in the flat address space, they just need an aperture base pointer to remap from a full 64-bit flat address, and those addresses would not touch address translation at all (scratch memory would, but the address would be the resolved one). Encryption like Zen would be a bit in the physical address, not the virtual memory address.

Perhaps it is a sign of supporting Linux HMM in its ideal form — flipping ATC/GPUVM at page granularity to support flexible hot migration of pages to the GPU local memory.
 
Last edited:
Those possibilities you mentioned aren't likely covered by this one addressing bit, however. Let's say ESRAM, it is hard to imagine it not being virtualised behind the per-process virtual address space. For the LDS or scratch in the flat address space, they just need an aperture base pointer to remap from a full 64-bit flat address, and those addresses would not touch address translation at all (scratch memory would, but the address would be the resolved one). Encryption like Zen would be a bit in the physical address, not the virtual memory address.

Perhaps it is a sign of supporting Linux HMM in its ideal form — flipping ATC/GPUVM at page granularity to support flexible hot migration of pages to the GPU local memory.
Not by the one bit, but they could have influenced the ability to add more bits. While 512TB is a lot of space, that's not a whole lot for the exascale systems. Might be something they could change for customers actually needing more than 512TB on a GPU in a single pool. The HMM would definitely be a possibility along the same lines as that compute wave save/restore.
 
Back
Top