AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

The only way to get twice the FP32 throughput per clock is if the SIMDs got chained together and executed consecutive instructions in a single cycle. Simply doubling the number I wouldn't equate to the meaning of twice the throughput. It should mean each ALU doubled throughput if that were the case. Or quoting native FP64 performance which doesn't appear to be the case.
They specifically talked about the throughput of one NCU compared to one traditional CU with 64 SPs. And doubling the number of SPs is the simple way of doing it. How they would feed a higher number of SPs and how they are organized, no idea. Could be dual issue to two separate vector ALUs each clock. Or something more out of the ordinary like dual issue to the same vALU and computing over 8 cycles to match a round robin scheme over 8 vALUs (I don't think this will happen) or something else. There is also the possibility they can somehow fuse certain combinations of ops in the scheduler and issue the fused ops (meaning the higher throughput is only usable for relatively specific cases). This is still unknown right now.
 
New slides have been coming up:

http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-37.jpg
11 polygons with 4 geometry engines?
Wut?
Up to 11 triangles get clipped/rejected per clock I guess. Up to now, the geometry throughput of AMD GPUs doesn't change that much depending on the visibility of the triangles.
getgraphimg.php
 
They specifically talked about the throughput of one NCU compared to one traditional CU with 64 SPs. And doubling the number of SPs is the simple way of doing it. How they would feed a higher number of SPs and how they are organized, no idea. Could be dual issue to two separate vector ALUs each clock. Or something more out of the ordinary like dual issue to the same vALU and computing over 8 cycles to match a round robin scheme over 8 vALUs (I don't think this will happen) or something else. There is also the possibility they can somehow fuse certain combinations of ops in the scheduler and issue the fused ops (meaning the higher throughput is only usable for relatively specific cases). This is still unknown right now.
Edited my response earlier, but it could be FMA4 style instructions with 4 operands. With all the packed math being performed that would make a lot of sense.

EDIT: It would also work well that that scalar per SIMD design I was theorizing. When not using the 4th operand, it could feed 16x4 scalar registers into L0 registers for a scalar. Translating the opcodes to do that shouldn't be difficult. Bulldozer had the FMA4 instructions, and I think GCN had the extra operands, but they were used to feed the single scalar or move data around.
 
Last edited:
I'm more curious about rop & geometry(the strange 11 triangle instead of 4 on fiji seem nice) .Fiji was already a "compute" monster imo...
 
It is quite interesting that they turn the local graphics memory into a cache. But it remains to be seen whether it is a page table and software magic (like Linux VMM) or a real hardware cache.
 
Last edited:
Back
Top