AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Do we expect GV104 to have tensor cores, part of Volta feature set? :)
No, but I would expect it to have their tiling rasterizer. I do not expect R9 390X to have half-rate DP either.

edit: And I do not expect GV104 to have a price premium on Tesla V100, as it is probably the case with Vega FE vs. RX Vega.
 
Last edited:
no-X said:
Do we expect GV104 to have tensor cores, part of Volta feature set? :)
Taking JHH's words at GTC at face value, I do.... but this is another can of worms that has nothing to do with Vega, best to stay on topic.
 
Cache hitrate is a valid point, yes. But on a single-colored full screen quad as many fillrate tests use (or 2x2 textures and the like), that would not be an issue at all contrary to more real-world applications.
It depends on how the fillrate test is designed. I've seen tests, which render multiple full screen quads per frame (depending on what you want to measure exactly either with disabled z test or from back to front) and can have a significant overdraw. This is done to get rid of the overhead associated with a new frame (and at >40,000 fps @1080p for a GPU with 90GPixel/s fillrate [if you draw only one fullscreen quad per frame] that overhead used to be significant [you can also cut down on it by not clearing the framebuffer for a new frame]) and to approach the theoretical maximum. It works pretty well for an IMR (that approach can give too high numbers for a TBDR without enabling blending as you basically measure the performance of the HSR). It also means the bandwidth savings with binning are potentially significant in that case.
If you don't want to implement both options in hardware though, you could have a kind of passthrough mode through your TBR, basically a compatibiility mode (for whatever reasons) where you do not bin and do not tile.
I would expect that solution. Nobody (also not nV) is going to implement two separate rasterizers in hardware. All you need to do is to skip the binning and just pass through the primitives and raster them immediately (as an IMR used to do). And as I said before, any binning rasterizer basically need this capability as a fallback anyway, afaiu.
 
But this make no sens. The draw-stream binning rasterizer have always an Advantage above a common rasteriezer. Why avoid them?
The sort of localized deferred binning that it is attempting is limited by the finite capacity of the binning hardware, and certain operations that make it unknown at the time of binning what should be culled, may unpredictably/incorrectly affect something outside of the tile, or may compromise the intended function of the code.

That means bins can overflow or stop prematurely, and that means there may be a dependence or culling relationship between bins, or transparency. Given where it plugs in, the binning hardware's link to the rest of the GPU may constrain their ordering. Once binning starts the data in the bin may have dropped or compressed information that would make it difficult to do to bins what they did to the geometry in them, and it may constrain the overall geometry pipeline.

At a minimum, it makes this front end of the pipeline have longer latency, which if it degrades to the fail-safe method means it's a strict immediate mode renderer with a longer latency setup pipeline.


The geometry appears not to be binned at all, it is processed serially and not binned in tiles.
The fail-safe position from AMD's patent that seems most applicable is that binning doesn't stop. The hardware instead drops to 1 triangle per bin.
The way the triangle tiling test works depends on each pixel rendered incrementing a counter, allowing those whose count is higher than the threshold to be killed. If the deferred binning method were in play, why wouldn't it just show the very last triangle all the time if all other occluded pixels were never permitted to run the increment?

Albeit it has to be said the "shade once" part (i.e., HSR) cannot work with the UAV counter used in this code (as that's certainly a side effect which would change the rendered result). But that should work independently from the actual binning part I'd hope...

Per this https://www.google.com/patents/US20140292756:
The use of deferred primitive batch binning in an immediate mode rendering system is opportunistically employed and degrades to an immediate mode rendering system with small delay when a primitive batch size is based on one or more conditions forcing receipt of only one primitive per batch.

There may be constraints on how serialized execution gets because there is binning and culling going on. As a first-gen product there could also be more conservative fallback cases, assuming it's working correctly.
 
I came across this post on Reddit and thought, "this makes the most sense of any analysis I've seen re: AMD's design decisions over the last several years that I've seen."
...
Comments?

I find some items believable, and some questions on some of the points.
The work stealing and scheduling optimization in hardware vs software seems like it might be conflating a number of things.
Nvidia still puts a lot of instruction issue and kernel scheduling work on hardware. Volta's thread grouping and handling of divergence actually expands this. Items like that are general parallel computation problems that AMD ***(edit: incomplete thought, would have added that AMD wouldn't be alone in this)
The specific points where it doesn't is more on an optimization of the in-kernel stream. However, I would say that generally AMD isn't devoting scads of hardware to optimize bad instruction streams--it just doesn't optimize.

Technically, AMD does have a gaming-dedicated set of GPUs. It's announced at least two bespoke GPU architectures in the last two years, one with leading features ahead of Vega and one that will be larger than Polaris and using GDDR5. I wouldn't blame the PS4 Pro and Xbox One X on HBM, although perhaps the diversion of resources into two variants of several-generations-old architectures should be examined.

Elements such as the underinvestment and lack of due diligence throughout its IP range seem plausible. My personal addition to the over-generalized architecture point, which I feel is broadly true, is that some of that isn't out of commitment to a jack-of-all-trades philosophy as a decision not improve any existing mastery out of an economy of effort and/or not valuing it.
 
Last edited:
Exactly. Vega - as per the techreport quote earlier - seems to be able to switch back and forth between both rasterizers, thus showing no trace of binning if the traditional option is used (or - at your discretion - the DSBR is not enabled in the driver). If you don't want to implement both options in hardware though, you could have a kind of passthrough mode through your TBR, basically a compatibiility mode (for whatever reasons) where you do not bin and do not tile.
#define C_028C44_BINNING_MODE 0xFFFFFFFC
#define V_028C44_BINNING_ALLOWED 0
#define V_028C44_FORCE_BINNING_ON 1
#define V_028C44_DISABLE_BINNING_USE_NEW_SC 2
#define V_028C44_DISABLE_BINNING_USE_LEGACY_SC 3
https://cgit.freedesktop.org/mesa/mesa/tree/src/amd/common/gfx9d.h
Looks like they have a few options.
 
The tiled based rasterizer has been said to be the secret sauce for nvidia's power efficiency but I think the answer is less arcane. nvidia have such a large lead in clockspeeds that AMD can't overcome despite being ahead in perf/mm2(normalized for clockspeeds). So AMD have to turn up the voltages and the result is a power-hungry chip that just nips at the heels of the corresponding nvidia card. AMD had similar lead in TFLOPs during the VLIW era that didn't translate into gaming performance, but they were ahead on clockspeeds, the 48xx series vs. the gtx 2xx series being the best example, and that made up for it. Polaris with clockspeeds north of 2Ghz would've been right next to 1080, 48xx was based on similar chip size with the >300mm2 dx11 chip coming in 58xx series a year later. Nothing new under the sun.

Anyway, the latest TPU review puts Fury X right next to a 1070, enough so that I think that if Fury had 8GB it would have made up the difference. A straight up Fury shrink clocked to 1600Mhz would have end up comfortably better than 1080, but the Vega FE results have been pretty disappointing even if you account for the throttling and excuse the power draw. Even the LN2 1400Mhz Fury chip got ahead of 1080 in 3DMark.:no:
 
The work stealing and scheduling optimization in hardware vs software seems like it might be conflating a number of things.
Nvidia still puts a lot of instruction issue and kernel scheduling work on hardware. Volta's thread grouping and handling of divergence actually expands this. Items like that are general parallel computation problems that AMD
I remember back in the G80 days, NVIDIA had enormous amount of scheduling done in hardware compared to AMD's VLIW5 which relied heavily on the compiler, that remained the case through Tesla to Fermi. Kepler changed that a bit and relied on a mixture of compiler and hardware.

but they were ahead on clockspeeds, the 48xx series vs. the gtx 2xx series being the best example,
AMD was never ahead in clock speeds ever since Terscale was created. NV had double bumped ALUs which compensated for their lack of ALUs count, they also relied more on hardware scheduling, while AMD's VLIW suffered both lower clocks and lower utilization rate.
 
The tiled based rasterizer has been said to be the secret sauce for nvidia's power efficiency but I think the answer is less arcane. nvidia have such a large lead in clockspeeds that AMD can't overcome despite being ahead in perf/mm2(normalized for clockspeeds).
I don't know if it's straightforward to take clockspeed out like that, when it goes to judging performance per area. It goes to who made the right call with regards to the market, workloads, and the realities of manufacturing.
GP102 is roughly the area of Vega, has roughly the same base/turbo range for standard versions, is faster, and draws the same or less power.
Vega is acting like a Polaris/Fiji that is twice the size--with a memory bus that should have freed up area and power for the main silicon to waste.
 
The fail-safe position from AMD's patent that seems most applicable is that binning doesn't stop. The hardware instead drops to 1 triangle per bin.
Which basically means it is not binning anymore but processing the triangles serially (save for some overlap between different triangles falling into different screen tiles assigned to different parallel raster engines). You wrote yourself, that is is basically reverting to the classical IMR behaviour.
The way the triangle tiling test works depends on each pixel rendered incrementing a counter, allowing those whose count is higher than the threshold to be killed. If the deferred binning method were in play, why wouldn't it just show the very last triangle all the time if all other occluded pixels were never permitted to run the increment?
I don't get it. My conclusion was that the observed behaviour is a hint, that the binning is not in play for Vega right now. That's why it doesn't show just the last triangle color.
If you ask for Maxwell/Pascal. Well, if you don't do sorting of geometry / hidden surface removal (either because the circumstances don't allow it [transparencies, side effects or whatever] or your hardware is incapable of doing so), you still do all the work and just save bandwidth from the increased cache hitrate. That means you can still catch pixels after any amount of drawn triangles. But it still processes tile by tile (you have completed tiles with all triangles drawn and tiles with no triangles drawn on screen [in the framebuffer] at the same time), so binning is still working.
 
Looking at old AMd material I am starting to wonder if TBR might be connected to the use of primitive shaders in some way.
 
Which basically means it is not binning anymore but processing the triangles serially (save for some overlap between different triangles falling into different screen tiles assigned to different parallel raster engines).
It's still binning per the patent because the hardware is still exercised and the additional latency is still there.
The binning is ineffective. It's also potentially injecting a specific order to where ordering isn't required.

I don't get it. My conclusion was that the observed behaviour is a hint, that the binning is not in play for Vega right now. That's why it doesn't show just the last triangle color.
Even if the binning were in play, it doesn't seem like skipping to the end of a shader that is supposed to be keeping a global count incremented by every pixel is valid.

If you ask for Maxwell/Pascal. Well, if you don't do sorting of geometry / hidden surface removal (either because the circumstances don't allow it [transparencies, side effects or whatever] or your hardware is incapable of doing so), you still do all the work and just save bandwidth from the increased cache hitrate. That means you can still catch pixels after any amount of drawn triangles. But it still processes tile by tile (you have completed tiles with all triangles drawn and tiles with no triangles drawn on screen [in the framebuffer] at the same time), so binning is still working.

Depending on how the hardware works, all the hardware locally evaluating a tile may know is that bin 1 has some unspecified serial dependence on bin 0. The binning process is at least locally sequential. It creates a first bin, then iteratively sends primitives that intersect with other tiles to those tiles for binning.

This first bin is reducing everything it is doing to one triangle, and the hardware is flagged as needing to pause any issue for the next tile until this tile is done.
That information is passed to every other tile, which will mark this in-sequence behavior and dependence on primitive 0, or it's because each triangle in the test has the exact same starting coordinates and bin coverage.
If the hardware/driver are conservative, they may say a screen tile cannot issue primitives in bin N if N-1 is still in progress and there was a dependence flagged for the initial bin.
In this case there's just one primitive, and there all tiles have the exact same sequences of bins due to identical coverage. The patent also indicates something like double-buffering the binning process, so it may also only queue up a few bins before stalling.

The process could be more parallel if there weren't a dependence, and probably if there were more than one initiating tile.
 
Comments?
There might be some historical truth there... say in the AMD of 18 months to 2+ years ago. However, I don't get the impression that is an accurate representation of the AMD of today. Of course, Vega did not just manifest itself out of thin air a week ago. My impression is that after Maxwell, attitudes regarding energy efficiency started to shift, but the full gravity of the situation probably hadn't struck them. I suspect they genuinely expected to gain a little ground with Polaris... not lose more, and that Pascal was a wake up call that they actually took to heart. I do believe that AMD "gets it" now, but getting it and translating that into a product already in the pipeline while being stuck at 16/14nm when your competitor already has existing products in the market aren't the best conditions for success. While the early benchmarks show a practically flat efficiency curve, tiling is apparently not being used. Assuming that gets rectified, I would anticipate a 10-15% efficiency gain in general usage (perhaps even more in certain pathological situations). And that would be progress. Now, if I did not sincerely believe that they "get it" I would be less forgiving. And by the time products hit with the next node shrink (next year), they damn well better have made significant progress.

As for execution and ecosystems, yeah they need to settle on something and stick with it. With the launches, I think one easy improvement they could make is to understand paper launching architectures can be quite beneficial (both for them and their customers). You don't have to reveal everything, but get some solid details out there (mostly for devs but maybe even consumers). Plant the seed so that people start thinking about how they might want to use the new features, even before any specific product is announced. That way when product becomes available people already have in mind things they want to try and do with hardware. Inspiration happens randomly and sporadically, give it some time to grow. That way you foster demand for your products even before they are launched. Paper launch of specific products - bad.... paper launch of future architectures - good.
 
Code:
#define     V_008F14_IMG_DATA_FORMAT_16_AS_16_16_16_16_GFX9         0x2B
#define     V_008F14_IMG_DATA_FORMAT_16_AS_32_32_32_32_GFX9         0x2C
...
#define     V_008F14_IMG_DATA_FORMAT_32_AS_32_32_32_32              0x3F
https://cgit.freedesktop.org/mesa/mesa/tree/src/amd/common/gfx9d.h#n1332
Anyone know what these are exactly? Context is all the usual color and compressed texture formats.

/* GFX9 has the ESGS ring in LDS. */
/* GFX9 has only 4KB of CE, while previous chips had 32KB. In order
* to make CE RAM as useful as possible, this defines limits
* for the number slots that can be in CE RAM on GFX9. If a shader
* is using more, descriptors will be uploaded to memory directly and
* CE won't be used.
*
* These numbers are based on shader-db.
*/
https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/radeonsi/si_descriptors.c#n2717

Other tidbits I came across.
 
I sadly don't have link (as the guy who posted this to our forum at io-tech.fi didn't provide one), but this is supposedly from Reddit
The reason its not doing tiled rasterization is it has not been "turned on"
yet. Vega is using the fallback rendering method instead of tiled.

When it was first discovered that Maxwell used tiled based rendering
there was talk about a lot of software that needed to be written or
rewritten in order to utilize it correctly and Nvidia implemented that
in their drivers.

Vega is using a predominantly Fiji driver and this feature has not
been "turned on" actually all but one of the new features in Vega is not
functional right now the exception being the pixel engine being
connected to the L2 cache as that is hardwired. I tore apart the new
drivers in IDA today and the code paths between Fiji and Vega are very
close and only differ slightly.

This arch is a massive change from anything they have released with
GCN. They built in fallbacks in the hardware because of the massive
changes. Its a protection against poorly written games and helps AMD
have a starting point for driver development. Hell even architecturally
Vega is essentially Fiji at its most basic thats why it is performing
exactly like it because none of its new features are enabled or have a
driver published for them yet. It is performing like a Fury X at 1400
MHz because that is exactly how every computer is treating it
 
Back
Top