AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I remember back in the G80 days, NVIDIA had enormous amount of scheduling done in hardware compared to AMD's VLIW5 which relied heavily on the compiler, that remained the case through Tesla to Fermi. Kepler changed that a bit and relied on a mixture of compiler and hardware.
That goes to the point that there are multiple levels of scheduling, and different kinds.
That specific element is hardware that was tracking warp/wave register dependences at an instruction level for ALU operations.
That hardware was removed and compiler-driven stall counts were added instead.

The claim was AMD was wasting hardware optimizing this. VLIW did not, and GCN does not. It either cannot physically have an instruction issue before the last operation is done (4-cycle cadence) or it requires the programmer/compiler to add a NOP or stall count.
I think there's more to question as to the implications of the first option.
GCN represents a relatively tidy and elegant balancing of a lot different design parameters--but is it an ideal or a local minimum where the costs of adjustment have become steeper (or AMD progressively unwilling to pay the cost)?
 
I remember back in the G80 days, NVIDIA had enormous amount of scheduling done in hardware compared to AMD's VLIW5 which relied heavily on the compiler, that remained the case through Tesla to Fermi. Kepler changed that a bit and relied on a mixture of compiler and hardware.


AMD was never ahead in clock speeds ever since Terscale was created. NV had double bumped ALUs which compensated for their lack of ALUs count, they also relied more on hardware scheduling, while AMD's VLIW suffered both lower clocks and lower utilization rate.

A pointless nitpick, nobody cared about double pumped ALUs of nvidia when it came to crowning 4890 as the first card to get 1Ghz. Besides, nvidia's ALU count was even lower to make that doubling worth nothing compared to Kepler bringing it up without dual pumped shaders. The problems AMD face compared to nvidia's current crop are lack of front-end performance and ROPs, both wouldn't be if the situations were switched as they were earlier.

I don't know if it's straightforward to take clockspeed out like that, when it goes to judging performance per area. It goes to who made the right call with regards to the market, workloads, and the realities of manufacturing.
GP102 is roughly the area of Vega, has roughly the same base/turbo range for standard versions, is faster, and draws the same or less power.
Vega is acting like a Polaris/Fiji that is twice the size--with a memory bus that should have freed up area and power for the main silicon to waste.

GP102 also has substantially more hardware that wouldn't be a problem for GCN if it clocked higher, the power issue is again one card being in its comfort range, 1080Ti easily gets to 2Ghz while Vega overclocked to merely 1682Mhz in pcper's review. Maybe devs can play to AMD's strengths, but that isn't going to be likely and even if they did, AMD don't scale their chips, Hawaii does pretty well compared to its compute numbers while Fiji falls flat.

The broader point is that neither AMD nor nvidia have the engineering chops to have such major improvements in their architecture that they can overcome other's with a 50% or so clock deficit. VLIW, Tesla, Maxwell, Pascal and GCN differences notwithstanding.

Vega acts better than a dual Polaris in that it can at least touch clockspeeds in regular use that Polaris can only do once in blue moon on a golden chip, but it's too late and not enough.
 
GP102 also has substantially more hardware that wouldn't be a problem for GCN if it clocked higher, the power issue is again one card being in its comfort range, 1080Ti easily gets to 2Ghz while Vega overclocked to merely 1682Mhz in pcper's review.
I may have been wrong about the area comparison.
Per PCPerspective, Vega's ~564 mm2. GP102 is significantly smaller for the performance it gets. It doesn't need to overclock to compete in a perf/mm2 basis.
What extra hardware does it have over Vega?
 
I may have been wrong about the area comparison.
Per PCPerspective, Vega's ~564 mm2. GP102 is significantly smaller for the performance it gets.

If chip size was measured correctly and logic density is in line with Polaris, we are looking at ~13.5B transistor chip.
 
Density for larger chips is usually higher, since smaller chips expend disproportionately more area on IO and memory PHY along their perimeters.
For Polaris versus Vega, it should be even more pronounced since Polaris is almost totally surrounded by GDDR5 interface whereas Vega is almost Fury-sized while only having two(probably?) HBM2 interfaces.
 
It's still binning per the patent because the hardware is still exercised and the additional latency is still there.
The binning is ineffective. It's also potentially injecting a specific order to where ordering isn't required.
In that case it's just a longer pipeline which is effectively not binning. Geometry is not collected, accumulated and then processed in batches but triangle after triangle. That's how I would define it that it's not binning. Binning implies at least some accumulation within a bin. Otherwise it's sequential processing of triangles as known from traditional IMRs.
Even if the binning were in play, it doesn't seem like skipping to the end of a shader that is supposed to be keeping a global count incremented by every pixel is valid.
I still don't get what you are referring to. The Vega processes geometry sequentially and not binned in tiles in that test. That is obvious. Otherwise it would complete tiles while others have not even started.
Depending on how the hardware works, all the hardware locally evaluating a tile may know is that bin 1 has some unspecified serial dependence on bin 0. The binning process is at least locally sequential. It creates a first bin, then iteratively sends primitives that intersect with other tiles to those tiles for binning.
But the bins/tiles should be independent of each other. And as the access order to an unordered access view in a pixel shader (what the triangle bin test uses for pixel counting) isn't guaranteed at all that means there are no actual dependencies between tiles (only false ones maybe).
This first bin is reducing everything it is doing to one triangle, and the hardware is flagged as needing to pause any issue for the next tile until this tile is done.
Again, if it is processing triangles one by one, is basically equivalent to saying it disables binning. And by the way, the binning rasterizer should process the geometry tile by tile anyway (with a small numbers of tiles potentially overlapping execution). But it should process all geometry for a given tile instead of one triangle after triangle (for all bins it covers).
If the hardware/driver are conservative, they may say a screen tile cannot issue primitives in bin N if N-1 is still in progress and there was a dependence flagged for the initial bin.
A bin is a screen tile. One bins into screen tiles. I'm afraid, that somehow we are not using the same terminology.
And bins are supposed to be processed somewhat sequentially (save for some overlap if cache size and shader resources allow it). Even in absence of any potential dependencies.
In this case there's just one primitive, and there all tiles have the exact same sequences of bins due to identical coverage. The patent also indicates something like double-buffering the binning process, so it may also only queue up a few bins before stalling.
Again, for me the screen tiles are the bins the geometry is binned to. Therefore, the first sentence doesn't make too much sense to me.
And as said, bins are supposed to be processed more or less sequentially with a binning rasterizer. That's the whole difference: process bins/screen tiles (with accumulated geometry) sequentially instead of processing the geometry directly in a sequential manner.
If binning is inactive (for whatever reason, maybe because of a false dependency), the rasterizer(s) work effectively like in an IMR. Primitives are processed sequentially (pixel waves are started over all screentiles the triangle covers before the next triangle is rastered [as mentioned before, some small overlap between different triangles in different screen tiles is possible]).
 
Last edited:
In that case it's just a longer pipeline which is effectively not binning. Geometry is not collected, accumulated and then processed in batches but triangle after triangle. That's how I would define it that it's not binning. Binning implies at least some accumulation within a bin.
I may have reversed the terms used by AMD, it's one primitive per batch. A batch gets put into various bins that its primitive intersects with.

I still don't get what you are referring to. The Vega processes geometry sequentially and not binned in tiles in that test. That is obvious. Otherwise it would complete tiles while others have not even started.
My question is what would happen if it were allowed to batch and then cull hidden surfaces before their pixels can increment the counter.

But the bins/tiles should be independent of each other. And as the access order to an unordered access view in a pixel shader (what the triangle bin test uses for pixel counting) isn't guaranteed at all that means there are no actual dependencies between tiles (only false ones maybe).
The scenario I'm discussing is a conservative solution where the hardware/driver is not intelligent enough to know whether it can avoid a dependence. The driver might just see that the pixels are writing to a common location.

And by the way, the binning rasterizer should process the geometry tile by tile anyway (with a small numbers of tiles potentially overlapping execution). But it should process all geometry for a given tile instead of one triangle after triangle (for all bins it covers).
...
A bin is a screen tile. One bins into screen tiles. I'm afraid, that somehow we are not using the same terminology.
AMD's batching method injects the serial component. I should have written that it tiles the elements in the batch across the bins, except in this case there is one triangle. Bins may have some freedom, but the system won't give them more primitives to work with until the current batch is done.
 
AMD's batching method injects the serial component. I should have written that it tiles the elements in the batch across the bins, except in this case there is one triangle. Bins may have some freedom, but the system won't give them more primitives to work with until the current batch is done.
Isn't a batch a complete draw call (as long as it fits the caches)?
 
Isn't a batch a complete draw call (as long as it fits the caches)?
A batch is built up in the pipeline until some condition is met. It could be storage limits, a total number of triangles evaluated is hit, or a total number of contributing triangles is hit, or some other case such as:


Sequential primitives are captured until a predetermined condition is met, such as batch full condition, state storage full condition, or a dependency on previously rendered primitives is determined, according to an embodiment.

The shader is updating a counter per-pixel and each pixel is reading from it.
This is all theorizing why the triangle test might not capture Vega's tiling in the same manner as Pascal or Maxwell.
It could very well be inactivated, but it might have a hard time showing itself since everything perfectly overlaps and reads the same data.
 
A batch is built up in the pipeline until some condition is met. It could be storage limits, a total number of triangles evaluated is hit, or a total number of contributing triangles is hit, or some other case such as:




The shader is updating a counter per-pixel and each pixel is reading from it.
This is all theorizing why the triangle test might not capture Vega's tiling in the same manner as Pascal or Maxwell.
It could very well be inactivated, but it might have a hard time showing itself since everything perfectly overlaps and reads the same data.

But the behavior of Vega rasterizer is exactly the same like fiji ? And also ist realy seriell. Maybe the behavior of Vega is different but it should not be seriell.

Also there is a guy at computerbase who have Vega and in GPUz the tiled based rasterizer is off:
https://www.computerbase.de/forum/showthread.php?t=1692170&page=21&p=20221597#post20221597
 
Last edited:
A batch is built up in the pipeline until some condition is met. It could be storage limits, a total number of triangles evaluated is hit, or a total number of contributing triangles is hit, or some other case such as:
Sequential primitives are captured until a predetermined condition is met, such as batch full condition, state storage full condition, or a dependency on previously rendered primitives is determined, according to an embodiment.
So in case of DX12 and Vulkan it could bin potentially even across draw calls if there is no barrier inbetween?
:runaway:
 
But the behavior of Vega rasterizer is exactly the same like fiji ? And also ist realy seriell. Maybe the behavior of Vega is different but it should not be seriell.
Unless the rasterizer's fallback is to act like Fiji? This is just speculation about how it might behave on a very specific test.

So in case of DX12 and Vulkan it could bin potentially even across draw calls if there is no barrier inbetween?
:runaway:
Unless that's another one of the system's batch closure conditions. The patent is pretty agnostic on how primitives are sourced, although a batch can stay open until a "last" primitive is encountered. Not sure what that means.

Temporally related primitives are segmented into a batch until a predetermined threshold is met. For example, sequential primitives may be captured into a primitive batch until a predetermined condition is met, such as batch full threshold, state storage full threshold, a primitive dependency threshold, or if the incoming primitive is identified as a last primitive.

Re-reference: https://www.google.com/patents/US20140292756
 
Unless the rasterizer's fallback is to act like Fiji? This is just speculation about how it might behave on a very specific test.

But this Test was written to Test exactly this behaviour. So why it should ne behaviour different in other programms? Tiled Base Rasterizer should be the first step to go. And if this ist not working than it should fallback to intermediate rasterizer
 
The fallback can be implemented differently, I guess. Two rasterizer methods or one, which has a "do not sort" option, but traverses the binning buffer nevertheless. And of course we could be looking at a simply misdetected application behaviour.

It seems a likely explanation.
 
But this Test was written to Test exactly this behaviour. So why it should ne behaviour different in other programms? Tiled Base Rasterizer should ne the First step to Go. And of this ist not working than it should fallback to intermediate rasterizer

As far as we know, the tiling method used by Maxwell and Pascal is not the same as what AMD has promised with Vega.
Rasterization is done over tiles that stay on-chip, but hidden surface removal or culling pixels before they are shaded is not part of that.

That is why multiple triangles show up as in-progress even though they would all be at different depths. It doesn't matter if they issue out of order as long as at the very end their pixel outputs are correctly stored or rejected based on depth. That means the counter used in test to control the number of pixels rendered still increases even if the screen doesn't show the pixels.
This may meant this test isn't going to capture Vega's behavior the same way.

Vega's rasterizer is supposed to collect triangles, and prevent pixels that are blocked by a closer primitive from being shaded.
If Vega's new mode were on, what would this test actually do? The pixel counter is increased when a pixel is shaded, but if they are batched almost all of them should be stopped ahead of time.
If the per-pixel counter is considered a condition for reducing the number of triangles in a batch to 1, one possible outcome is that the GPU cannot move on to the next triangle until the current batch is done. That starts looking sequential even if the rasterizer is fully enabled.
 
Also i think because David Kanter made a pure Rasterizer behaviour Test, He prevent the Test to use the fallback solution and use of activated the tiled Base rasterizer.

But in the Test you dont See any tiels. If you have tiels, you have not the behaviour of drawing Triangel by Triangle. You will See parts of a few Triangel in on tile. Than there will come the next tile. Did you watched the Video of David Kanter?
http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/
 
Last edited:
Also i think because David Kanter made a pure Rasterizer behaviour Test, He prevent the Test to use the fallback solution and use of activated the tiled Base rasterizer.

But in the Test you dont See any tiels. If you have tiels, you have not the behaviour of drawing Triangel by Triangle. You will See parts of a few Triangel in on tile. Than there will come the next tile. Did you watched the Video of David Kanter?
http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/
The test is almost a year old. It was meant to tease out the behavior of the Nvidia GPUs that existed at the time.
The tester written by nlguillemot wasn't written at that time to help or hinder a GPU and rasterization method that didn't exist, and it depends on a specific behavior that may not hold true for other methods.

I do not follow what the last section meant.
 
It doesn't Matter how old this test is. It use directX as interface. So the Code has no influance which typ of Rasterizer is used. Driver and DirectX select Typ of Rasterizer.

http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/

Using simple DirectX shaders, we demonstrate the tile-based rasterization in Nvidia’s Maxwell and Pascal GPUs and contrast this behavior to the immediate-mode rasterizer used by AMD.

Also David Kanter didn't Know how ist the behaviour of Maxwell rasterizer, so He wrote a universal Code to finde Out how the rasterizer behave
 
Last edited:
Back
Top