[STATE MODES]
That just about wraps up GPU specifications speculation. Could probably focus a bit on display output support and resolution, etc. but a lot of that can likely be assumed. 4K, 8K, HDMI of whatever standard is in place by then, the usual codecs, etc. >There is an opportunity with future AMD hardware to figure a way for relatively wider GPUs to dynamically scale down saturation workloads to smaller cluster of CUs while proportionately increasing the frequency of clocks on those active hardware components while the inactive hardware components/CUs reserve at a dramatically lower clock (sub-100 MHz) until they are needed for more work.
>This assumes that AMD can continue to scale GPU clock frequencies higher (4 GHz - 5 GHz) with future RDNA designs, provided they can make such work with silicon designs on smaller node processes. Since any given cluster of the GPU would need to be able to clock this high, it means the entire GPU design must be able to clock at this range, potentially across the entire chip, in order to make this feasible.
>Power delivery designs may also have to be reworked; chiplet approach will help a lot here.
>This approach would be more suitable for products that need to squeeze out and scale performance for various workloads, support variable frequency (this is, essentially, variable frequency within portions of the GPU itself), and has to stay within a fixed power budget...such as a games console. Therefore it might be less required (though potentially beneficial) for PC GPUs as it gives a different means of scaling clocks with workloads while having more granularity in control of the GPU's power consumption.
>AMD's implementation would be based on Shader Array counts, so the loads would be adjusted per Shader Array. On chiplet-based designs, each chiplet would theoretically be its own Shader Array, so this is essentially a way of scaling power delivery between the multiple chiplets dynamically.
>This could be used in tandem with already-established power budget sharing between the CPU and GPU seen in designs like PS5; in this case it would be beneficial in allowing the GPU to maintain implementation of this particular feature for games that may have lighter volume workloads, but intense iteration workloads that could stress a given peak frequency. However, this should be minimal and its fuller use would be more in the traditional fashion when talking about full GPU volume workloads.
>Another benefit of State Mode is that when targeting power delivery to a smaller cluster of the GPU hardware and increasing the clock, clock-bound processes (pixel fillrate, instructions per second, primitives per second) see large gains, generally inverse of the decrease in active CU count. However, some other things such as L0$ and L1$ amounts will reduce, even if actual bandwidths have better-than-linear scaling respective of the total active silicon.
[PS6 - STATE MODE IMPLEMENTATION]
>This assumes that AMD can continue to scale GPU clock frequencies higher (4 GHz - 5 GHz) with future RDNA designs, provided they can make such work with silicon designs on smaller node processes. Since any given cluster of the GPU would need to be able to clock this high, it means the entire GPU design must be able to clock at this range, potentially across the entire chip, in order to make this feasible.
>Power delivery designs may also have to be reworked; chiplet approach will help a lot here.
>This approach would be more suitable for products that need to squeeze out and scale performance for various workloads, support variable frequency (this is, essentially, variable frequency within portions of the GPU itself), and has to stay within a fixed power budget...such as a games console. Therefore it might be less required (though potentially beneficial) for PC GPUs as it gives a different means of scaling clocks with workloads while having more granularity in control of the GPU's power consumption.
>AMD's implementation would be based on Shader Array counts, so the loads would be adjusted per Shader Array. On chiplet-based designs, each chiplet would theoretically be its own Shader Array, so this is essentially a way of scaling power delivery between the multiple chiplets dynamically.
>This could be used in tandem with already-established power budget sharing between the CPU and GPU seen in designs like PS5; in this case it would be beneficial in allowing the GPU to maintain implementation of this particular feature for games that may have lighter volume workloads, but intense iteration workloads that could stress a given peak frequency. However, this should be minimal and its fuller use would be more in the traditional fashion when talking about full GPU volume workloads.
>Another benefit of State Mode is that when targeting power delivery to a smaller cluster of the GPU hardware and increasing the clock, clock-bound processes (pixel fillrate, instructions per second, primitives per second) see large gains, generally inverse of the decrease in active CU count. However, some other things such as L0$ and L1$ amounts will reduce, even if actual bandwidths have better-than-linear scaling respective of the total active silicon.
[PS6 - STATE MODE IMPLEMENTATION]
>SHADER ENGINES: 1
>SHADER ARRAYs (PER SE): 2
>CUs: 20
>SHADER CORES (PER CU): 128
>SHADER CORES (TOTAL): 2,560
>ROPs: 128
>TMUs (TOTAL): 160
>MAXIMUM WORKLOAD THREADS: 20,480
>MAXIMUM GPU CLOCK: 4113.449 MHz (shaved off some clock from earlier calcs to account for non-linear clock scaling with power scaling)
>PRIMITIVES (TRIANGLES) PER CLOCK (IN/OUT): Up to 8 PPC (IN), up to 6 PPC (OUT)
>PRIMITIVES PER SECOND (IN/OUT): Up to 32.9 billion PPS (IN), up to 24.675 billion PPS (OUT)
>GIGAPIXELS PER SECOND: Up to 263.26 G/pixels per second (4113.449 MHz * 64 ROPs)
>INSTRUCTIONS PER CLOCK: 2
>INSTRUCTIONS PER SECOND: 8.226898 billion IPC
>RAY INTERSECTIONS PER SECOND: 658.151 G/rays per second (4113.449 MHz * 20 CUs * 8 TMUs)
>THEORETICAL FLOATING POINT OPERATIONS PER SECOND: 21.06 TF
>CACHES:
>L2$: 24 MB
>L3$: 192 MB
>>TOTAL: 231.12 MB
>SHADER ARRAYs (PER SE): 2
>CUs: 20
>SHADER CORES (PER CU): 128
>SHADER CORES (TOTAL): 2,560
>ROPs: 128
>Future RDNA chiplet designs will probably keep the back-end to its own block. However, for design reasons ROP allocation would likely scale to per chiplet cluster evenly, so each chiplet (or if essentially a chiplet, SE) would have its own assigned group of ROPs. This equals 2x 64 ROPs for PS6.
>TMUs (PER CU): 8
>TMUs (TOTAL): 160
>MAXIMUM WORKLOAD THREADS: 20,480
>MAXIMUM GPU CLOCK: 4113.449 MHz (shaved off some clock from earlier calcs to account for non-linear clock scaling with power scaling)
>PRIMITIVES (TRIANGLES) PER CLOCK (IN/OUT): Up to 8 PPC (IN), up to 6 PPC (OUT)
>PRIMITIVES PER SECOND (IN/OUT): Up to 32.9 billion PPS (IN), up to 24.675 billion PPS (OUT)
>GIGAPIXELS PER SECOND: Up to 263.26 G/pixels per second (4113.449 MHz * 64 ROPs)
>INSTRUCTIONS PER CLOCK: 2
>INSTRUCTIONS PER SECOND: 8.226898 billion IPC
>RAY INTERSECTIONS PER SECOND: 658.151 G/rays per second (4113.449 MHz * 20 CUs * 8 TMUs)
>THEORETICAL FLOATING POINT OPERATIONS PER SECOND: 21.06 TF
>CACHES:
>L0$: 256 KB (per CU), 5.12 MB (total)
>L1$: 1 MB (per Dual CU), 10 MB (total)
>L2$: 24 MB
**Unified cache shared with both chiplets
>L3$: 192 MB
**Unified cache shared with both chiplets
>>TOTAL: 231.12 MB
Should be able to cover the CPU, memory, storage and audio stuff in the next post and then move on to Microsoft's possible design. If there's stuff anyone'd like to add on top or alter then do share, because a lot of these specifications I've come to settle on after having some discussion and insight from many people on these boards since posting.