AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Sure, but again this is not likely something that happens automatically. There's a "fast geometry shader" logic on NV part as well and that's something that has to coded for specifically (via NvAPI not standard D3D).
If I wasn't obvious, the hw simply does not have separate shader stages for all these api shader stages. So, I'd very much say shader stage merging happens "automatically" (in the driver), regardless if the app knows anything about primitive shader.
 
Whoever told you that was right and it probably referred to many professional user's real-world environments, but not for all benchmarks in specviewperf 12.
For example, the Maya 04 benchmark sees only a small boost between Pro and gaming graphics cards:

wVxsJWF.png


Yet in this benchmark, Vega FE sees a 40% improvement over Fury X, clock-for-clock.
Wait - is that the benchmark, where Vega FE scores 114-ish at gamersnexus? And downclocked to Fury X leel 97-ish and Fury X is at 70-ish? Don't you think something's fishy here?
 
If I wasn't obvious, the hw simply does not have separate shader stages for all these api shader stages. So, I'd very much say shader stage merging happens "automatically" (in the driver), regardless if the app knows anything about primitive shader.
That would only work for DX12 and Vulkan since the Pipeline State Object is monolithic, all the older api's would be left to explicitly program for it.

^would/could
 
Last edited:
But I will wait until Vega RX launch reviews. I am sure AMD will spill more details about their architectural changes regarding to gaming then (ROP flushes = gaming).
Can we get your perspective over the matter? Anything weird you noticed during your playtime with Vega, do you feel the drivers are not working in full capacities or the hardware have limitations or something?
 
That would only work for DX12 and Vulkan since the Pipeline State Object is monolithic, all the older api's would be left to explicitly program for it.

^would/could
No. Monolithic PSOs or not, the driver _has to_ merge these shader stages.
(In practice, this shouldn't be that much of a problem. Typically there shouldn't be too many different combinations for all those separately specified vs/gs/tesselation shaders - the driver maintains multiple binaries for even single shaders in much the same way already anyway, since there's still non-shader state which will force it to recompile a shader.)
 
That's an option yes. But how do you map this VS/GS merger to SIMDs? They don't run at the same frequency...
You can figure out how it works with the mesa driver, this was commited about 2 months ago. I don't quite follow the details, it's rather complex (and would probably be easier to understand with hw docs) albeit there's a barrier between two parts of such a merged shader :). Also, actually it may not even be necessary to compile multiple shaders together, as it may be possible to compile them independently and stitch the binary pieces together (the driver could do something similar even before for non-merged shaders, by having pre- and post-amble sections so it's not necessary to recompile everything for non-shader state changes).
Maybe somehow more/different threads are spawned for executing the second part, I really didn't study it...
 
I think it's launching on the 29th at siggraph.
They've already stated it won't be available to order during that week. They haven't guaranteed it's going to be the week after but it hasn't been discounted either, so we'll see. It looks like they'll be talking about Vega in detail at SIGGRAPH though so we should get a lot more details and possibly depending on how work goes on drivers and such over the next few weeks, maybe a more firm actual launch.
 
They've already stated it won't be available to order during that week. They haven't guaranteed it's going to be the week after but it hasn't been discounted either, so we'll see. It looks like they'll be talking about Vega in detail at SIGGRAPH though so we should get a lot more details and possibly depending on how work goes on drivers and such over the next few weeks, maybe a more firm actual launch.
For me i just want to see the final product and benches. I wont buy anything till late fall anyway and if vega is a dud i may upgrade my cpu and wait with my 290
 
If I wasn't obvious, the hw simply does not have separate shader stages for all these api shader stages. So, I'd very much say shader stage merging happens "automatically" (in the driver)
No. Monolithic PSOs or not, the driver _has to_ merge these shader stages.
I'm not quite sure how DX11/OpenGL drivers handle shader permutations... don't remember. But some insight can be garnered from these two diagrams back when D3D12 first came out: page 6 and 7 of the following presentation https://www.slideshare.net/DevCentralAMD/introduction-to-dx12-by-ivan-nevraev

They are both merged and seperated under DX11, leading me to believe there is "module" reuse by the driver. Would primitive shader's breakdown to different hardware states or fewer states? Maybe we don't know yet. Either way the driver would need to keep track of shader permutations and compile outputs to match the HW states from the diagram. I don't know, I just get the feeling it has to be done explicitly since the way it was introduced by AMD.
 
Somebody knows where I can get the beyond3d suite? It's a cool tool which gives you a good hint about bottelnecks of an GPU. I want to buy a Rx Vega and want to test if the tiles based Rasterizer is working.
 
Not public :runaway: why not? Such a good tool! I think there will be a lot people outside which spend 20-50 buggs for this tool. We are living in a time of benchmarks and this is the king for gpu benchmarks
 
You don't need more than 4 shader engines. Best example is comparing Nvidia GP102 and GP104. If you look at Polygonoutput test of Beyond3d suite you see no difference between GP102 and GP104 when culling comes into play.
http://www.pcgameshardware.de/Titan...hmark-Tuning-Overclocking-vs-1080-1206879/#a5

http://techreport.com/review/31562/nvidia-geforce-gtx-1080-ti-graphics-card-reviewed/3

So limitations are not made bye Rasterizer. The Culling is the Issue. PcgamesHardware say in the article, that they want to check this behaviour, but they never wrote an answer about this!?

Also if you look clocked normalized Fiji don't look so bad there.
The Tech Report conclusion is incorrect. Nvidia had a faster culling rate prior to the tiled rasterizer.
 
They've already stated it won't be available to order during that week. They haven't guaranteed it's going to be the week after but it hasn't been discounted either, so we'll see. It looks like they'll be talking about Vega in detail at SIGGRAPH though so we should get a lot more details and possibly depending on how work goes on drivers and such over the next few weeks, maybe a more firm actual launch.

is this where they said it wouldnt be available that week? because, if so, he was talking about computex.

 
The Tech Report conclusion is incorrect. Nvidia had a faster culling rate prior to the tiled rasterizer.
Ever since they have distributed setup, to be exact (with the "polymorph engine", starting with fermi). (The tiled rasterizer would not help in any case for that.)
FWIW gp102 is a bit of an anomaly as it shows no scaling over gp104 with the culled polygon throughput test (which I think is what techreport must have been using). Since the theoretical culled throughput is nominally simply 1/3 tri per clock per smm, suggesting it hits another limit on gp102.
 
Efficient sparse matrix-vector multiplication on parallel processors
A method of multiplication of a sparse matrix and a vector to obtain a new vector and a system for implementing the method are claimed. Embodiments of the method are intended to optimize the performance of sparse matrix-vector multiplication in highly parallel processors, such as GPUs. The sparse matrix is stored in compressed sparse row (CSR) format.
Wave level operations like Nvidia's Tensor Core.

METHOD AND APPARATUS FOR PERFORMING HIGH THROUGHPUT TESSELLATION
A method, a system, and a computer-readable storage medium directed to performing high-speed parallel tessellation of 3D surface patches are disclosed. The method includes generating a plurality of primitives in parallel. Each primitive in the plurality is generated by a sequence of functional blocks, in which each sequence acts independently of all the other sequences.

Programmable substitutions for microcode
Looks like IF routing updates.

Method and system for yield operation supporting thread-like behavior
A method, system, and computer program product synchronize a group of workitems executing an instruction stream on a processor. The processor is yielded by a first workitem responsive to a synchronization instruction in the instruction stream. A first one of a plurality of program counters is updated to point to a next instruction following the synchronization instruction in the instruction stream to be executed by the first workitem. A second workitem is run on the processor after the yielding.
Volta's sync threads?

Memory access monitor
For each access request received at a shared cache of the data processing device, a memory access pattern (MAP) monitor predicts which of the memory banks, and corresponding row buffers, would be accessed by the access request if the requesting thread were the only thread executing at the data processing device. By recording predicted accesses over time for a number of access requests, the MAP monitor develops a pattern of predicted memory accesses by executing threads. The pattern can be employed to assign resources at the shared cache, thereby managing memory more efficiently.
Possibly HBCC, but may be Zen or both.

Stacked memory device with metadata management
A processing system comprises one or more processor devices and other system components coupled to a stacked memory device having a set of stacked memory layers and a set of one or more logic layers. The set of logic layers implements a metadata manager that offloads metadata management from the other system components. The set of logic layers also includes a memory interface coupled to memory cell circuitry implemented in the set of stacked memory layers and coupleable to the devices external to the stacked memory device. The memory interface operates to perform memory accesses for the external devices and for the metadata manager. By virtue of the metadata manager's tight integration with the stacked memory layers, the metadata manager may perform certain memory-intensive metadata management operations more efficiently than could be performed by the external devices.
Examples of metadata include data integrity/security information, such as parity bits, checksums, and error correcting codes (ECCs), address translation information (e.g., page table entries and translation lookaside buffer entries), status indicators (e.g., dirty bits, valid bits, reachable bits), and the like. More generally, operational data is data provided to the stacked memory device for storage, and metadata is data used by the stacked memory device to access, characterize, or modify the stored operational data.

Computer architecture using rapidly reconfigurable circuits and high-bandwidth memory interfaces
A programmable device comprises one or more programming regions, each comprising a plurality of configurable logic blocks, where each of the plurality of configurable logic blocks is selectively connectable to any other configurable logic block via a programmable interconnect fabric. The programmable device further comprises configuration logic configured to, in response to an instruction in an instruction stream, reconfigure hardware in one or more of the configurable logic blocks in a programming region independently from any of the other programming regions.
Configuring Infinity to make pipelines on the fly?
 
Back
Top