You got it wrong. It's 512kB in total or 16kB per VLIW group of ALUs. There are apparently 32 of these groups (32*5=160 ALUs), so the aggregate register file size is 32*16kB=512kB. All VLIW GPUs AMD ever produced have these 16kB registers per VLIW group (I think some call them "thread processors" or whatever).
Yeah ive been swimming in discombobulated info in my head... I was combining that with the info about there being four of those per vliw. So 4 16kb registers per vliw5 machine is what I interpereted from all that.... As I assumed it was 4 times what was considered normal whatever that may be, so I considered what was normal, as a single set, 'n' in my mind so 4n. I just dont see any reason for the information to ever have been stated in the first place otherwise, and we have heard multiple times it has an abnormally large register count for the line its based from.
What? As I wrote RV770 could handle up to 256 wavefronts (which would be 16384 "threads" or better work items). The 192 number being tossed around has to be some counting of how many wavefronts the command processor of Latte can handle. But 192 still appear like a lot compared to the 256 of RV770, or it is a bit unflexible and separate accounts exist for vertex, geometry and pixel shaders (64 tops for each, that would be a somewhat reasonable count) as bgassassin may have implied.
Without the appropriate context, it could mean almost anything. I mean, the register files of the VLIW architectures all have this 4way banked register file structure. And for each physical unit there are always four sets of registers for the four work items processed by one unit over 4 cycles (the latency is actually 8 cycles and two wavefronts are always executed in an interleaved manner, but anyway).
The E6760 has just 6 SIMDs (480 ALUs), so the aggregate size of the register files is 1,5MB, but per unit it is the same as every other VLIW GPU (including Latte).
Ha ha, yeah, my bad, earlier in the thread conversation I found out Nintendo has decided to refer to warps, as threads now. I then commented on how horribly confusing that must be, and then spread horrible confusion.
Yes, 192 warps, 12,000 something threads. I believe you stated before, that was strangely high for a gpu the size of latte? I agree with that, I figure there must be a reason why. The context is clearly whats missing, and what Im spitballing to try and get a handle on.
From what I gathered of BG's info, the geometry shaders if utilized, took 36 warps from the max warp count, while the pixel and vertex shaders shared what was left (no structure was given) of the 156 warps, while when geometry shaders are disabled they had full access to all 192.
It does seem like an awful lot, but according to this link (in Appendix D), even 80 shader parts have the 192 global limit. AMD must scale it funky or somehow lower shader parts can benefit from the spare wavefronts queued up in cache?
http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Could the rumour of extra register memory on latte be used with those queued up warps to move dependencies out of the alu, move in a warp without dependencies from queue, so that the alu's can operate at/closer to peak use, and then insert the formerly dependant warp back in once its been resolved?