I think it was later clarified that most first party titles, not necessarily allDidn't Spencer say all Xbox exclusives would be getting a dual-release between Xbox and UWP from now on?
I think it was later clarified that most first party titles, not necessarily allDidn't Spencer say all Xbox exclusives would be getting a dual-release between Xbox and UWP from now on?
no, he said that games will go where it makes sense.Didn't Spencer say all Xbox exclusives would be getting a dual-release between Xbox and UWP from now on?
oh well, that could change by e3 haha.Crackdown it's a W10 game as well
3 cycles to fetch input registers one cycle to execute (ALU) for each 16-wide part of the instruction (sub-instruction). 16-wide part execution starts one cycle after each other (4 cycles in total). Sorry for the ambiguous use of terms. I used "instruction" instead of "sub-instruction". I meant that four 16-wide sub-instructions are processed simultaneously. Only two actual (64-wide) instructions interleave each other. Each 16-wide part of the previous instruction is finished one clock cycle before the same 16-wide part of the next instruction fetches it's first operand. I am not sure whether it works like this, but it would pipeline perfectly without any bubbles. Execution unit would be fed all the time and register files would serve one load (or store) per cycle. Disclaimer: I am not a hardware engineer, so this mental model might be totally wrong.I can't tell what you mean by "execution" here and why you say "four" instructions are in execution at once. Perhaps you're referring to decode, operand fetch, compute and resultant write?
The logic to compute an instruction runs for four clocks. After that, another instruction runs for four clocks. Therefore at any point, the 4-clock computation logic is working on either one or two instructions. Labelling the logic stages as A, B, C and D:
A - instruction 2, work items 0 to 15
B - instruction 1, work items 48 to 63
C - instruction 1, work items 32 to 47
D - instruction 1, work items 16 to 31
I want to check what you're saying before going any further.
+ case CHIP_FIJI:
+ adev->gfx.config.max_shader_engines = 4;
+ adev->gfx.config.max_tile_pipes = 16;
+ adev->gfx.config.max_cu_per_sh = 16;
+ adev->gfx.config.max_sh_per_se = 1;
+ adev->gfx.config.max_backends_per_se = 4;
+ adev->gfx.config.max_texture_channel_caches = 8;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;
https://patchwork.kernel.org/patch/6932301/
"SE" refers to "Shader Engine", but can't help you on SH or Tile_Pipeswhat are tile_pipes
and sh_per_se ?
"SE" refers to "Shader Engine", but can't help you on SH or Tile_Pipes
if (sscreen->b.info.num_good_compute_units /
(sscreen->b.info.max_se * sscreen->b.info.max_sh_per_se) <= 4) {
/* Too few available compute units per SH. Disallowing
* VS to run on CU0 could hurt us more than late VS
* allocation would help.
*
* LATE_ALLOC_VS = 2 is the highest safe number.
*/
Probably CU reservation for geometry to avoid saturating the pipeline.
These values were covered a month or so ago, not too much new with them besides the similar pipeline structure.
Assuming "SH" is literally shaders, the 0xffff looks like a 16 bit execution mask. Fiji being 16, 4x16=64CUs and not scheduling across CUs. Although that should be coming with Vega and PS4 Pro according to interviews. Might also be workgroups.The value for determining if the CU is available or not looks to be 0xffff in that case, 0xfffe in the case where resources are more plentiful.
3R1W would read 3 registers and write 1 in a single cycle. There may be a delay if swizzling within those registers as the data shifts according to the DPP patterns. Flipping along powers of two with write masks.3 cycles to fetch input registers one cycle to execute (ALU) for each 16-wide part of the instruction (sub-instruction).
DPP needs 2 wait cycles before data can be used. See here: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/. This clearly tells us that DPP operates outside the common 4-cycle cadence (no visible latency). Unlike all other instructions DPP needs manual wait states. Compiler needs to ensure this. DPP also needs 5 wait cycles in case of execution mask and SGPRs. Clearly there's some lane swizzle hardware operating in parallel to the register files and it adds some additional latency. DPP is not the common case. Only a small minority of instructions use DPP modifier. DPP was also added later (GCN3). The original GCN used LDS crossbar for all cross lane swizzles. There was no instructions that allowed you to directly access data from other lanes.3R1W would read 3 registers and write 1 in a single cycle. There may be a delay if swizzling within those registers as the data shifts according to the DPP patterns. Flipping along powers of two with write masks.
...so the DPP could fit 3 (probably more, but too m) shifts into a single clock cycle given limited patterns.
You are probably right:I'm not clear on SH, but perhaps it is related to the shader array within an SE? It's a separate count in GCN, although it's been 1 at least most of the time.
Not sure about this reference, where some were set to 2 for some of the discretes in the past.
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/si.c
Setting the mask predates Fiji, and is set for anything GCN2 and newer, regardless of the CU count per SE. The setup for GCN seems to assume that is the max, and other similar CU masks added for Vega have the same length.Assuming "SH" is literally shaders, the 0xffff looks like a 16 bit execution mask. Fiji being 16, 4x16=64CUs and not scheduling across CUs. Although that should be coming with Vega and PS4 Pro according to interviews. Might also be workgroups.
It would be pipelined in, so while one instruction is executing for 4 cycles the data for the next is being prepared at least one clock ahead. How this occurs is unclear, but would be a design decision. Every permutation could occur in a single clock if they had transistors to burn, each permutation being an instruction. The cadence and contention from other SIMDs is likely what would require the waits for any permutation not approaching random.This clearly tells us that DPP operates outside the common 4-cycle cadence (no visible latency). Unlike all other instructions DPP needs manual wait states.
Sixteen may be the theoretical max CUs per SE which is what I was getting at. Workgroups might also make sense as they haven't changed much. What hardware was using the 0xfffe? I'd guess the setting is along the lines of no more than one wave/workgroup per CU of geometry. Maybe 2 on more robust hardware or CAD/pro products.Setting the mask predates Fiji, and is set for anything GCN2 and newer, regardless of the CU count per SE. The setup for GCN seems to assume that is the max, and other similar CU masks added for Vega have the same length.
Earlier discussions about the 3-read 1-write cadence had some speculation on the writeback being skewed into a later cycle. Having 2 vector issues of latency for DPP may indicate that it is skewed by that many vector issue cycles.DPP needs 2 wait cycles before data can be used. See here: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/. This clearly tells us that DPP operates outside the common 4-cycle cadence (no visible latency). Unlike all other instructions DPP needs manual wait states. Compiler needs to ensure this. DPP also needs 5 wait cycles in case of execution mask and SGPRs.
Even if there were transistors to burn, that doesn't give unlimited space for the wires, and maybe not the capacity to freely re-architect the execution loop of the vector unit.It would be pipelined in, so while one instruction is executing for 4 cycles the data for the next is being prepared at least one clock ahead. How this occurs is unclear, but would be a design decision. Every permutation could occur in a single clock if they had transistors to burn, each permutation being an instruction.
The scalar and vector units and their register files are physically distinct elements in the die shots.SGPRs are likely just VGPRs using an extra read port with the added ability to address the lanes.
Going by the math, it's a GPU with 4 or fewer usable CUs per shader array, or a salvage die that somehow falls to that level. This seems to be mostly a concern for the low-end APUs.What hardware was using the 0xfffe?