Do we expect GV104 to have tensor cores, part of Volta feature set?LOL. It is based on Vega, but not on Vegas feature set. That would be a new dimension. Thinking GF4MX.
Do we expect GV104 to have tensor cores, part of Volta feature set?LOL. It is based on Vega, but not on Vegas feature set. That would be a new dimension. Thinking GF4MX.
No, but I would expect it to have their tiling rasterizer. I do not expect R9 390X to have half-rate DP either.Do we expect GV104 to have tensor cores, part of Volta feature set?
Taking JHH's words at GTC at face value, I do.... but this is another can of worms that has nothing to do with Vega, best to stay on topic.no-X said:Do we expect GV104 to have tensor cores, part of Volta feature set?
It depends on how the fillrate test is designed. I've seen tests, which render multiple full screen quads per frame (depending on what you want to measure exactly either with disabled z test or from back to front) and can have a significant overdraw. This is done to get rid of the overhead associated with a new frame (and at >40,000 fps @1080p for a GPU with 90GPixel/s fillrate [if you draw only one fullscreen quad per frame] that overhead used to be significant [you can also cut down on it by not clearing the framebuffer for a new frame]) and to approach the theoretical maximum. It works pretty well for an IMR (that approach can give too high numbers for a TBDR without enabling blending as you basically measure the performance of the HSR). It also means the bandwidth savings with binning are potentially significant in that case.Cache hitrate is a valid point, yes. But on a single-colored full screen quad as many fillrate tests use (or 2x2 textures and the like), that would not be an issue at all contrary to more real-world applications.
I would expect that solution. Nobody (also not nV) is going to implement two separate rasterizers in hardware. All you need to do is to skip the binning and just pass through the primitives and raster them immediately (as an IMR used to do). And as I said before, any binning rasterizer basically need this capability as a fallback anyway, afaiu.If you don't want to implement both options in hardware though, you could have a kind of passthrough mode through your TBR, basically a compatibiility mode (for whatever reasons) where you do not bin and do not tile.
The sort of localized deferred binning that it is attempting is limited by the finite capacity of the binning hardware, and certain operations that make it unknown at the time of binning what should be culled, may unpredictably/incorrectly affect something outside of the tile, or may compromise the intended function of the code.But this make no sens. The draw-stream binning rasterizer have always an Advantage above a common rasteriezer. Why avoid them?
The fail-safe position from AMD's patent that seems most applicable is that binning doesn't stop. The hardware instead drops to 1 triangle per bin.The geometry appears not to be binned at all, it is processed serially and not binned in tiles.
Albeit it has to be said the "shade once" part (i.e., HSR) cannot work with the UAV counter used in this code (as that's certainly a side effect which would change the rendered result). But that should work independently from the actual binning part I'd hope...
The use of deferred primitive batch binning in an immediate mode rendering system is opportunistically employed and degrades to an immediate mode rendering system with small delay when a primitive batch size is based on one or more conditions forcing receipt of only one primitive per batch.
I came across this post on Reddit and thought, "this makes the most sense of any analysis I've seen re: AMD's design decisions over the last several years that I've seen."
...
Comments?
Exactly. Vega - as per the techreport quote earlier - seems to be able to switch back and forth between both rasterizers, thus showing no trace of binning if the traditional option is used (or - at your discretion - the DSBR is not enabled in the driver). If you don't want to implement both options in hardware though, you could have a kind of passthrough mode through your TBR, basically a compatibiility mode (for whatever reasons) where you do not bin and do not tile.
Looks like they have a few options.#define C_028C44_BINNING_MODE 0xFFFFFFFC
#define V_028C44_BINNING_ALLOWED 0
#define V_028C44_FORCE_BINNING_ON 1
#define V_028C44_DISABLE_BINNING_USE_NEW_SC 2
#define V_028C44_DISABLE_BINNING_USE_LEGACY_SC 3
https://cgit.freedesktop.org/mesa/mesa/tree/src/amd/common/gfx9d.h
I remember back in the G80 days, NVIDIA had enormous amount of scheduling done in hardware compared to AMD's VLIW5 which relied heavily on the compiler, that remained the case through Tesla to Fermi. Kepler changed that a bit and relied on a mixture of compiler and hardware.The work stealing and scheduling optimization in hardware vs software seems like it might be conflating a number of things.
Nvidia still puts a lot of instruction issue and kernel scheduling work on hardware. Volta's thread grouping and handling of divergence actually expands this. Items like that are general parallel computation problems that AMD
AMD was never ahead in clock speeds ever since Terscale was created. NV had double bumped ALUs which compensated for their lack of ALUs count, they also relied more on hardware scheduling, while AMD's VLIW suffered both lower clocks and lower utilization rate.but they were ahead on clockspeeds, the 48xx series vs. the gtx 2xx series being the best example,
I don't know if it's straightforward to take clockspeed out like that, when it goes to judging performance per area. It goes to who made the right call with regards to the market, workloads, and the realities of manufacturing.The tiled based rasterizer has been said to be the secret sauce for nvidia's power efficiency but I think the answer is less arcane. nvidia have such a large lead in clockspeeds that AMD can't overcome despite being ahead in perf/mm2(normalized for clockspeeds).
Which basically means it is not binning anymore but processing the triangles serially (save for some overlap between different triangles falling into different screen tiles assigned to different parallel raster engines). You wrote yourself, that is is basically reverting to the classical IMR behaviour.The fail-safe position from AMD's patent that seems most applicable is that binning doesn't stop. The hardware instead drops to 1 triangle per bin.
I don't get it. My conclusion was that the observed behaviour is a hint, that the binning is not in play for Vega right now. That's why it doesn't show just the last triangle color.The way the triangle tiling test works depends on each pixel rendered incrementing a counter, allowing those whose count is higher than the threshold to be killed. If the deferred binning method were in play, why wouldn't it just show the very last triangle all the time if all other occluded pixels were never permitted to run the increment?
It's still binning per the patent because the hardware is still exercised and the additional latency is still there.Which basically means it is not binning anymore but processing the triangles serially (save for some overlap between different triangles falling into different screen tiles assigned to different parallel raster engines).
Even if the binning were in play, it doesn't seem like skipping to the end of a shader that is supposed to be keeping a global count incremented by every pixel is valid.I don't get it. My conclusion was that the observed behaviour is a hint, that the binning is not in play for Vega right now. That's why it doesn't show just the last triangle color.
If you ask for Maxwell/Pascal. Well, if you don't do sorting of geometry / hidden surface removal (either because the circumstances don't allow it [transparencies, side effects or whatever] or your hardware is incapable of doing so), you still do all the work and just save bandwidth from the increased cache hitrate. That means you can still catch pixels after any amount of drawn triangles. But it still processes tile by tile (you have completed tiles with all triangles drawn and tiles with no triangles drawn on screen [in the framebuffer] at the same time), so binning is still working.
There might be some historical truth there... say in the AMD of 18 months to 2+ years ago. However, I don't get the impression that is an accurate representation of the AMD of today. Of course, Vega did not just manifest itself out of thin air a week ago. My impression is that after Maxwell, attitudes regarding energy efficiency started to shift, but the full gravity of the situation probably hadn't struck them. I suspect they genuinely expected to gain a little ground with Polaris... not lose more, and that Pascal was a wake up call that they actually took to heart. I do believe that AMD "gets it" now, but getting it and translating that into a product already in the pipeline while being stuck at 16/14nm when your competitor already has existing products in the market aren't the best conditions for success. While the early benchmarks show a practically flat efficiency curve, tiling is apparently not being used. Assuming that gets rectified, I would anticipate a 10-15% efficiency gain in general usage (perhaps even more in certain pathological situations). And that would be progress. Now, if I did not sincerely believe that they "get it" I would be less forgiving. And by the time products hit with the next node shrink (next year), they damn well better have made significant progress.Comments?
#define V_008F14_IMG_DATA_FORMAT_16_AS_16_16_16_16_GFX9 0x2B
#define V_008F14_IMG_DATA_FORMAT_16_AS_32_32_32_32_GFX9 0x2C
...
#define V_008F14_IMG_DATA_FORMAT_32_AS_32_32_32_32 0x3F
/* GFX9 has the ESGS ring in LDS. */
https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/radeonsi/si_descriptors.c#n2717/* GFX9 has only 4KB of CE, while previous chips had 32KB. In order
* to make CE RAM as useful as possible, this defines limits
* for the number slots that can be in CE RAM on GFX9. If a shader
* is using more, descriptors will be uploaded to memory directly and
* CE won't be used.
*
* These numbers are based on shader-db.
*/
The reason its not doing tiled rasterization is it has not been "turned on"
yet. Vega is using the fallback rendering method instead of tiled.
When it was first discovered that Maxwell used tiled based rendering
there was talk about a lot of software that needed to be written or
rewritten in order to utilize it correctly and Nvidia implemented that
in their drivers.
Vega is using a predominantly Fiji driver and this feature has not
been "turned on" actually all but one of the new features in Vega is not
functional right now the exception being the pixel engine being
connected to the L2 cache as that is hardwired. I tore apart the new
drivers in IDA today and the code paths between Fiji and Vega are very
close and only differ slightly.
This arch is a massive change from anything they have released with
GCN. They built in fallbacks in the hardware because of the massive
changes. Its a protection against poorly written games and helps AMD
have a starting point for driver development. Hell even architecturally
Vega is essentially Fiji at its most basic thats why it is performing
exactly like it because none of its new features are enabled or have a
driver published for them yet. It is performing like a Fury X at 1400
MHz because that is exactly how every computer is treating it