No problem, thanks.
https://forum.beyond3d.com/posts/2177723/
Above is discussion using 2 scan converters per raster unit, a coarse one and a finer one for smaller triangles. I don’t expect high performance for 1 fragment sized triangle, however, I expect better performance than RDNA1. And a step in the right direction.
I'm not up to date on the thread, but there was talk on benchmarks.
Yes, sure. Capability at a unit level are building blocks to a variety of configurations. The unit block itself and its capabilities get upgraded over time. We see here differences at the unit block level between XSX and RDNA2.
I wouldn't be surprised one bit, where PS5s building blocks are around the same CUs as XSX. However, we still haven't seen a detailed block diagram of PS5 like we have for XSX and Navi21.
One other thing that baffles me is that Cerny discussed small triangles in his presentation. But if he doesn't use the RDNA2 Raster Units, then that is a missed opportunity when he had access to it.
XSX has 64 ROPs.
From the earlier Hotchips diagram, the yellow arrow highlights 116 Giga pixels/s:
64x1.825 = 116.8 Gpix/s
The driver leak for XSX/ Navi21 Lite has 4 RBs per Shader Engine. XSX has 2 Shader Engines, so 8 RBs in total. RDNA2 RB+ each output 8 pixels each:
8 RBs x 8 pixels = 64 pixels per cycle, which matches 64 ROPs and XSXs pixel fillrate.
4 triangles per clock x 16 fragments per triangle = 64 fragments per clock to match 64 ROPs.
4 Raster Units and 8 RB+ units align for XSX.
Okay, thanks. As mentioned earlier, i don't think there's much of a difference between RDNA1 and RDNA2 CUs. They seem to be tweaked for their respective RDNA1 or RDNA2 Raster Units.
Yes, you can look at it that way. However, workload is computed on wide SIMD units which make up CUs, which is a more appropriate measure.
Fast and narrow, or wide and slow are relative terms. You can already see preferences between Nvidia and AMD high-end GPUs.
Look at an RTX3090, it has more than 10000 Cuda Cores, where as AMDs flagship 6900XT has just over 5000 Shader Cores. Nvidia is already wider. Whereas AMD is relatively narrow, but clocks much faster compared to the slower Nvidia flagship.
We already see AMD as fast and narrow, relative to Nvidia being wide and slow.
You can see the bus widths in the driver leak below.
The num_tccs entry gives you the number of 16bit memory channels.
Navi21 Lite, 20x16 = 320 bit
Navi21, 16x16 = 256 bit
Navi22, 12x16 = 192 bit
Code:
Property Navi10 Navi14 Navi12 Navi21Lite Navi21 Navi22 Navi23 Navi31
num_se 2 1 2 2 4 2 2 4
num_cu_per_sh 10 12 10 14 10 10 8 10
num_sh_per_se 2 2 2 2 2 2 2 2
num_rb_per_se 8 8 8 4 4 4 4 4
num_tccs 16 8 16 20 16 12 8 16
num_gprs 1024 1024 1024 1024 1024 1024 1024 1024
num_max_gs_thds 32 32 32 32 32 32 32 32
gs_table_depth 32 32 32 32 32 32 32 32
gsprim_buff_depth 1792 1792 1792 1792 1792 1792 1792 1792
parameter_cache_depth 1024 512 1024 1024 1024 1024 1024 1024
double_offchip_lds_buffer 1 1 1 1 1 1 1 1
wave_size 32 32 32 32 32 32 32 32
max_waves_per_simd 20 20 20 20 16 16 16 16
max_scratch_slots_per_cu 32 32 32 32 32 32 32 32
lds_size 64 64 64 64 64 64 64 64
num_sc_per_sh 1 1 1 1 1 1 1 1
num_packer_per_sc 2 2 2 2 4 4 4 4
num_gl2a N/A N/A N/A 4 4 2 2 4
unknown0 N/A N/A N/A N/A 10 10 8 10
unknown1 N/A N/A N/A N/A 16 12 8 16
unknown2 N/A N/A N/A N/A 80 40 32 80
num_cus (computed) 40 24 40 56 80 40 32 80
Property Navi10 Navi14 Navi12 Navi21Lite Navi21 Navi22 Navi23 Navi31