The flickering happens when you enable the 48-100 mode.
See https://forum.beyond3d.com/posts/1993993/
The flickering happens when you enable the 48-100 mode.
I'd wager none, They basically used the new node to increase clock speeds at roughly the same power consumption as FuryX. Apparently GCN required a lot of juice and area for that. Ryzen turned out fine despite being fabbed at GF.How much of Vega's (and Polaris's?) power problems are solely due to Global Foundries, versus TSMC?
Reference?
Sorry, per thread group not SIMD. Regardless that seems a rather interesting change in the scheme of things. However I thought 1024 was an established limit for most APIs.radeonsi/gfx9: reduce max threads per block to 1024 on gfx9+
The number of supported waves per thread group has been reduced to 16 with gfx9. Trying to use 32 waves causes hangs, and barriers might not work correctly with > 16 waves.
https://cgit.freedesktop.org/mesa/mesa/commit/?id=a0e6b9a2db5aa5f06a4f60d270aca8344e7d8b3f
AMD has had a bunch of patents recently that were all quickly filed and published. For the most part they seem to be software techniques for ambiguous hardware. Like most patents. Anyways:I would reference the filing date for AMD's binning rasterizer, which is 2013. Vega's development may have been prolonged by internal issues, but even in other contexts you wouldn't expect something patented now to have been sat on for the years the design was moving down the pipeline. Given the timing, we might have to wonder if or when it might show up. That it was filed and made public may mean AMD isn't worried about competitors seeing it too early (publication can be delayed significantly from filing if you care).
In a more compressed form yes. While certainly possible, I'm guessing the paged memory is a subset of the overall pool. Leaving the HBCC to only track active pages. Some resources(framebuffer, meshes, stacks) simply won't lend themselves to paging very well and likely be kept in a separate pool.128K of what, page table entries?
GCN does support 4KB x86 page tables, which would not cover 16GB. Even if going with the coarse 64KB PRT granularity, the additional context and history tracking would go over a MB.
On top of that, CPU TLB hierarchies and the page table hierarchy are backed up by their caches. The HBCC may not have that option, or it might want to avoid using it that much given the L2 isn't that expansive.
I'd wager none, They basically used the new node to increase clock speeds at roughly the same power consumption as FuryX. Apparently GCN required a lot of juice and area for that. Ryzen turned out fine despite being fabbed at GF.
Some references to "Texture Caches" in drivers so there could be multiple generic L2s. That might actually make sense for async to avoid trashing.Possible additions so far - without actual sizes though:
The GCN3 ISA indicates the maximum workgroup size is 16 wavefronts (1024 work items). Whatever limit is being set here has some other confounding issue if they got away with 2048 before.Sorry, per thread group not SIMD. Regardless that seems a rather interesting change in the scheme of things. However I thought 1024 was an established limit for most APIs.
It could be an evolution of a customization created for a console already built.The "explanation" of the primitive shader is entirely unconvincing. Seems likely to be a white elephant. I wonder if this was built by AMD as the basis for a console chip at some later date. In a console it would be totally awesome, I presume.
"There are a broad variety of techniques we've come up with to reduce the vertex bottlenecks, in some cases they are enhancements to the hardware," said Cerny. "The most interesting of those is that you can use compute as a frontend for your graphics."
This technique, he said, is "a mix of hardware, firmware inside of the GPU, and compiler technology. What happens is you take your vertex shader, and you compile it twice, once as a compute shader, once as a vertex shader. The compute shader does a triangle sieve -- it just does the position computations from the original vertex shader and sees if the triangle is backfaced, or the like. And it's generating, on the fly, a reduced set of triangles for the vertex shader to use. This compute shader and the vertex shader are very, very tightly linked inside of the hardware."
Exact wording may be important. There are references to texture channel caches, which are actually describing the L2.Some references to "Texture Caches" in drivers so there could be multiple generic L2s. That might actually make sense for async to avoid trashing.
Which slide suggests that INT and FP are running concurrently (on the vec16-SIMDs)?
Going to hold off on the concurrently part given the context for the time being after re-reading that paragraph. That 4-5x part is still interesting though. INT8 would be 4x, but the extra +1 relative to Polaris I'm unsure about. The rest of the paragraph is INT16/FP16 which I figured ran concurrently for 4x.AMD estimates a Vega NGCU to be able to handle 4-5x the number of operations per clock cycle relative to the previous CUs in Polaris. They demonstrate a use case of Rapid Packed Math using 3DMark Serra- a custom demo created by Futuremark for AMD to show off this technology- wherein 16-bit integer and floating point operations result in as much as 25% benefit in operation count.
https://www.techpowerup.com/reviews/AMD/Vega_Microarchitecture_Technical_Overview/3.html
It may be a linux thing because that code would have been actively used for years now. All the documentation I recall has the 1024 work item limit as you mentioned, but obviously they were exceeding that limit with some success.The GCN3 ISA indicates the maximum workgroup size is 16 wavefronts (1024 work items). Whatever limit is being set here has some other confounding issue if they got away with 2048 before.
If some of the instructions in the addressing category have scalar and vector variants, the scalar portion running a chained operation can be the +1 if running concurrently with a 4x INT8 operation. It seems like a sensible thing to have in both domains.Going to hold off on the concurrently part given the context for the time being after re-reading that paragraph. That 4-5x part is still interesting though. INT8 would be 4x, but the extra +1 relative to Polaris I'm unsure about. The rest of the paragraph is INT16/FP16 which I figured ran concurrently for 4x.
It's only inactive in Vega FE, it's active in AMD's final RX performance targets. AMD stated it's active in 17.20 driver (page 43 note), AMD tested all of their games using driver 17.30, which is the driver after it.
I'd say you're oversimplifying things. Ryzen is too diferent to Vega / Fury to compare (and who's to say it doesn't performs as it does in spite of the process that hampers it, instead of because of it).
Further, for Fury vs Vega, both the architecture and process are different. Which is the main cause of Vega's underperforming could be anyone's guess.
I think the real question is, with so many FreeSync monitors on the market, why do they bundle the one with the worst reputation? I think Samsung simply saw this as a opportunity to offload a lemon, and AMD took the bait.Honestly no idea why samsung does the multiple modes when other vendors using the same panels don't, because yeah, 80-100 is very small.
It was indeed, Vega FE is inbetween 1070 and 1080, and it had DSBR disabled. AMD will provide a patch to enable it for FE when RX launches. So FE and RX will have equal gaming performance. Maybe @Rys can shine more light on the matter, if his hands are not tied that is.that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled.
I think the real question is, with so many FreeSync monitors on the market, why do they bundle the one with the worst reputation? I think Samsung simply saw this as a opportunity to offload a lemon, and AMD took the bait.
Exotic memory raising platform cost, lengthened pipeline for higher clock-speeds costing a lot of transistors and reducing IPC, feature extensions which require significant developer effort to implement, an architecture pitched as best suited for 'tomorrow's workloads'.If all of AMD's slides were made using the newer drivers with DSBR enabled and those slides show Vega ~= 1080FE, that indicates Vega would be 10-15% slower than the 1080 without that new feature enabled. I'm not saying you're wrong but my brain can't wrap itself around what that means (i.e. with a 50% clock increase Vega would only be ~15% faster than a FuryX). What has AMD been doing for the last 2yrs?
Nevertheless it does clock lower than the same architecture implemented on the similar TSMC process.GP107 is the counter point. Although with baseline frequencies rather low it boost easily to 1700Mhz, uses less power per FPS than the AMD competition and is made in the same process but at Samsung.