AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Didn't Spencer say all Xbox exclusives would be getting a dual-release between Xbox and UWP from now on?
no, he said that games will go where it makes sense.
it just so happens that all the games makes sense to be on both. :LOL:
maybe crackdown won't? Simply because it relies heavily on cloud which gold could offset the cost of, but pc wouldn't?
 
https://lists.freedesktop.org/archives/amd-gfx/2017-March/006570.html

Thx to Ieldra

switch (adev->asic_type) {
+ case CHIP_VEGA10:
+ adev->gfx.config.max_shader_engines = 4;
+ adev->gfx.config.max_tile_pipes = 8; //??
+ adev->gfx.config.max_cu_per_sh = 16;
+ adev->gfx.config.max_sh_per_se = 1;
+ adev->gfx.config.max_backends_per_se = 4;
+ adev->gfx.config.max_texture_channel_caches = 16;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;
+
+ adev->gfx.config.sc_prim_fifo_size_frontend = 0x20;
+ adev->gfx.config.sc_prim_fifo_size_backend = 0x100;
+ adev->gfx.config.sc_hiz_tile_fifo_size = 0x30;
+ adev->gfx.config.sc_earlyz_tile_fifo_size = 0x4C0;
 
+ case CHIP_FIJI:
+ adev->gfx.config.max_shader_engines = 4;
+ adev->gfx.config.max_tile_pipes = 16;
+ adev->gfx.config.max_cu_per_sh = 16;
+ adev->gfx.config.max_sh_per_se = 1;
+ adev->gfx.config.max_backends_per_se = 4;
+ adev->gfx.config.max_texture_channel_caches = 8;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;
https://patchwork.kernel.org/patch/6932301/

The basic data is the same, 4096 ALUs, 4 Shader/Geometry-Engines, probably 64 ROPs unless AMD changed the number of ROPs from 4 to 8 per Render-Backend.
 
I can't tell what you mean by "execution" here and why you say "four" instructions are in execution at once. Perhaps you're referring to decode, operand fetch, compute and resultant write?

The logic to compute an instruction runs for four clocks. After that, another instruction runs for four clocks. Therefore at any point, the 4-clock computation logic is working on either one or two instructions. Labelling the logic stages as A, B, C and D:

A - instruction 2, work items 0 to 15
B - instruction 1, work items 48 to 63
C - instruction 1, work items 32 to 47
D - instruction 1, work items 16 to 31

I want to check what you're saying before going any further.
3 cycles to fetch input registers one cycle to execute (ALU) for each 16-wide part of the instruction (sub-instruction). 16-wide part execution starts one cycle after each other (4 cycles in total). Sorry for the ambiguous use of terms. I used "instruction" instead of "sub-instruction". I meant that four 16-wide sub-instructions are processed simultaneously. Only two actual (64-wide) instructions interleave each other. Each 16-wide part of the previous instruction is finished one clock cycle before the same 16-wide part of the next instruction fetches it's first operand. I am not sure whether it works like this, but it would pipeline perfectly without any bubbles. Execution unit would be fed all the time and register files would serve one load (or store) per cycle. Disclaimer: I am not a hardware engineer, so this mental model might be totally wrong.
 
+ case CHIP_FIJI:
+ adev->gfx.config.max_shader_engines = 4;
+ adev->gfx.config.max_tile_pipes = 16;
+ adev->gfx.config.max_cu_per_sh = 16;
+ adev->gfx.config.max_sh_per_se = 1;
+ adev->gfx.config.max_backends_per_se = 4;
+ adev->gfx.config.max_texture_channel_caches = 8;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;
https://patchwork.kernel.org/patch/6932301/

One item that was later changed was max_texture_channel_caches value, which was later bumped up to 16.
https://patchwork.kernel.org/patch/7613411/
 
"SE" refers to "Shader Engine", but can't help you on SH or Tile_Pipes

Going by the following, tile_pipes corresponds in some manner to the number of memory channels.
https://lists.freedesktop.org/archives/mesa-dev/2016-February/107122.html

Tahiti's non-power of 2 memory bus made it something of a special case.
The tile pipe and texture channel cache values generally match. Perhaps that is why the Vega tile pipe count has a ??? after it, given it does not match Fiji's tile pipe count or its own texture cache count.

I'm not clear on SH, but perhaps it is related to the shader array within an SE? It's a separate count in GCN, although it's been 1 at least most of the time.
Not sure about this reference, where some were set to 2 for some of the discretes in the past.
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/si.c


It may be how to reference the CUs in an SE without referencing the SE's other components (I cannot speak to the full context of the following comment however).
https://patchwork.freedesktop.org/patch/145143/
if (sscreen->b.info.num_good_compute_units /
(sscreen->b.info.max_se * sscreen->b.info.max_sh_per_se) <= 4) {
/* Too few available compute units per SH. Disallowing
* VS to run on CU0 could hurt us more than late VS
* allocation would help.
*
* LATE_ALLOC_VS = 2 is the highest safe number.
*/
 
Probably CU reservation for geometry to avoid saturating the pipeline.

These values were covered a month or so ago, not too much new with them besides the similar pipeline structure.
 
Probably CU reservation for geometry to avoid saturating the pipeline.

These values were covered a month or so ago, not too much new with them besides the similar pipeline structure.

References to this section outside of the diff seem to show that this low-resource case does something to expand the number of CUs immediately available, since it sets CU0 to be enabled for VS work. The value for determining if the CU is available or not looks to be 0xffff in that case, 0xfffe in the case where resources are more plentiful.
The number set for SH per SE seems like it can be 1 or 2, but the higher number seems to be less-used.

I recall discussing some of the config variables earlier in the month, with regards to the cache and pipe values.
 
The value for determining if the CU is available or not looks to be 0xffff in that case, 0xfffe in the case where resources are more plentiful.
Assuming "SH" is literally shaders, the 0xffff looks like a 16 bit execution mask. Fiji being 16, 4x16=64CUs and not scheduling across CUs. Although that should be coming with Vega and PS4 Pro according to interviews. Might also be workgroups.

3 cycles to fetch input registers one cycle to execute (ALU) for each 16-wide part of the instruction (sub-instruction).
3R1W would read 3 registers and write 1 in a single cycle. There may be a delay if swizzling within those registers as the data shifts according to the DPP patterns. Flipping along powers of two with write masks.

Realistically it's probably 5R2W with the extra ports servicing LDS or scalar instructions. Which is why you get maximum 5 instructions from different waves per SIMD. The scheduler should be doing a round robin across SIMDs so the start of each full instruction issue will offset a clock cycle. Data copies with masks are fast (unlike adders where the signal has to propagate every bit), so the DPP could fit 3 (probably more, but too m) shifts into a single clock cycle given limited patterns.
 
3R1W would read 3 registers and write 1 in a single cycle. There may be a delay if swizzling within those registers as the data shifts according to the DPP patterns. Flipping along powers of two with write masks.

...so the DPP could fit 3 (probably more, but too m) shifts into a single clock cycle given limited patterns.
DPP needs 2 wait cycles before data can be used. See here: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/. This clearly tells us that DPP operates outside the common 4-cycle cadence (no visible latency). Unlike all other instructions DPP needs manual wait states. Compiler needs to ensure this. DPP also needs 5 wait cycles in case of execution mask and SGPRs. Clearly there's some lane swizzle hardware operating in parallel to the register files and it adds some additional latency. DPP is not the common case. Only a small minority of instructions use DPP modifier. DPP was also added later (GCN3). The original GCN used LDS crossbar for all cross lane swizzles. There was no instructions that allowed you to directly access data from other lanes.
 
Last edited:
I'm not clear on SH, but perhaps it is related to the shader array within an SE? It's a separate count in GCN, although it's been 1 at least most of the time.
Not sure about this reference, where some were set to 2 for some of the discretes in the past.
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/si.c
You are probably right:
ws->info.max_sh_per_se = ws->amdinfo.num_shader_arrays_per_engine;

-> https://github.com/anholt/mesa/blob/master/src/gallium/winsys/amdgpu/drm/amdgpu_winsys.c
 
Assuming "SH" is literally shaders, the 0xffff looks like a 16 bit execution mask. Fiji being 16, 4x16=64CUs and not scheduling across CUs. Although that should be coming with Vega and PS4 Pro according to interviews. Might also be workgroups.
Setting the mask predates Fiji, and is set for anything GCN2 and newer, regardless of the CU count per SE. The setup for GCN seems to assume that is the max, and other similar CU masks added for Vega have the same length.
 
This clearly tells us that DPP operates outside the common 4-cycle cadence (no visible latency). Unlike all other instructions DPP needs manual wait states.
It would be pipelined in, so while one instruction is executing for 4 cycles the data for the next is being prepared at least one clock ahead. How this occurs is unclear, but would be a design decision. Every permutation could occur in a single clock if they had transistors to burn, each permutation being an instruction. The cadence and contention from other SIMDs is likely what would require the waits for any permutation not approaching random.

A SGPR read should be able to occur without a wait if the data aligned properly. I don't believe any functionality like that was exposed as it breaks the cadence. SGPRs are likely just VGPRs using an extra read port with the added ability to address the lanes. Problem being the actual address may fall into the RF domain of another SIMD requiring some synchronization as the bank may not be active.

Setting the mask predates Fiji, and is set for anything GCN2 and newer, regardless of the CU count per SE. The setup for GCN seems to assume that is the max, and other similar CU masks added for Vega have the same length.
Sixteen may be the theoretical max CUs per SE which is what I was getting at. Workgroups might also make sense as they haven't changed much. What hardware was using the 0xfffe? I'd guess the setting is along the lines of no more than one wave/workgroup per CU of geometry. Maybe 2 on more robust hardware or CAD/pro products.
 
DPP needs 2 wait cycles before data can be used. See here: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/. This clearly tells us that DPP operates outside the common 4-cycle cadence (no visible latency). Unlike all other instructions DPP needs manual wait states. Compiler needs to ensure this. DPP also needs 5 wait cycles in case of execution mask and SGPRs.
Earlier discussions about the 3-read 1-write cadence had some speculation on the writeback being skewed into a later cycle. Having 2 vector issues of latency for DPP may indicate that it is skewed by that many vector issue cycles.
The 5 wait cycles for EXEC may align with the other 5 cycle wait state for EXECZ and VCCZ sourcing. It is possible there's a gap in the forwarding for those as well.
5 cycle waits for non-forwarded execution values seem to show the round trip from the stages generating them and their changes reaching the stage upstream that uses them. That makes the register file the final point of communication for side-units.

DPP may still operate in the 4-cycle cadence, given the permutations that either stick to within a quad, use the 1-value broadcast path, or are bound by the 16-wide waves. It may need to fit the cadence if only to prevent it from obstructing the non-dependent work being done in its shadow. What this may mean is that the SIMD's sequencing logic detects instruction issue and when it executes--which isn't the same as whether it's done forwarding/writing back. If new units are added in the SIMD, they can be slotted into the issue logic with a counter or readiness flag, but the pipeline may not provision for them disrupting the VALU.

It would be pipelined in, so while one instruction is executing for 4 cycles the data for the next is being prepared at least one clock ahead. How this occurs is unclear, but would be a design decision. Every permutation could occur in a single clock if they had transistors to burn, each permutation being an instruction.
Even if there were transistors to burn, that doesn't give unlimited space for the wires, and maybe not the capacity to freely re-architect the execution loop of the vector unit.
The DPP path's wait states seem like they have more to do with it not being able to fit into the VALU's forwarding paths.
The lack of transistors and constraints on how many permutations can be fit into a cycle in the space of a subsection of a SIMD--plus trying not to use too much encoding space for control masks, may explain the restrictions on the combinations.

SGPRs are likely just VGPRs using an extra read port with the added ability to address the lanes.
The scalar and vector units and their register files are physically distinct elements in the die shots.

What hardware was using the 0xfffe?
Going by the math, it's a GPU with 4 or fewer usable CUs per shader array, or a salvage die that somehow falls to that level. This seems to be mostly a concern for the low-end APUs.
 
Back
Top