Yeah I noticed the mismatch in GFLOP count. The SIMD unit must be different but still GPU-based, and may be there’s some overhead when loading/switching program and the wavefronts ? Or they need to wait for certain input from GPU/CPU ?
There would be overhead in switching wavefronts because there is time needed to load the necessary context for a new kernel into the CU's general purpose and system registers. For a standard CU, part of the transfer comes from the dispatch pipeline of the GPU, while other parts part of the initial code compiled into a shader. Part of the recommendation for having more than the minimum number of wavefronts is to help hide spin-up periods like that. The described level of parallelism for Tempest is close to no concurrency from the developer standpoint. Perhaps Sony expects audio to be consistently utilizing what the developer cannot, or that the developer's fraction is what scraps the system reserve would consider lost to switch overhead.
Some details from the AMD patent that was referenced earlier about a custom CU with persistent wavefronts would remove a good chunk of the launch overhead, which may be consistent with the idea that high utilization can be managed with such a limited number of wavefronts.
As for the 20% penalty, it sounds too high. Wondering if Leadbetter misspoke.
It's possible. 20 GB/s seems like a reasonable number, but is a worst-case 4% consumption enough of a problem to require special mention?
Saving CPU and GPU ressources ?
Maybe some of it, if the developer is that short on resources. The 3D audio/system wavefront would consume a large fraction of the throughput. 8 Zen 2 cores at 3.5 GHz are just shy of 0.9 TF, and the GPU is just short of 10.3 TF.
If the developer could use all of Tempest, that's 11% of CPU peak and 1% of the GPU, but I don't think the system-reserved features would allow that. With a likely single-digit percentage of CPU capability and less than a percent of the GPU, would developers be that tightly resource constrained to look to Tempest? The method for programming for it is going to be a third architectural target and source of implementation complexity, so does this modest amount of compute have specific advantages that can make it more appealing than finding a fraction of percent in GPU idle time?
I think just having a dedicated unit means that studios can't re-prioritize the silicon away from audio, and that in itself seems ideal.
If Sony was really only concerned about developer priorities, they could have reserved standard CUs for themselves. TrueAudio next makes provision for CU reservation where a developer can set aside CUs prior to graphics shaders getting access. It seems like a straightforward extension for Sony to take a reservation first without going through the effort of modifying the architecture. That they did change the architecture indicates there's certain capabilities missing or possible downsides to using what's already there.
What if its based on a CDNA cu even. modern and compute focused (and I know nothing about it).
Details aren't confirmed at this time, but there's a good chance that CDNA includes an additional vector unit with matrix multiplication capabilities and a wide range of formats.
While it's possible Sony could just mention the normal units since the matrix operations are more specialized and many form have lower precision, it would be neglecting to mention a large number of operations per clock.