Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
Here is the piece that I have about that. GPRs per SIMD - 256 vec4s (4 x 32-bit components)


Also for those interested here is another post I made elaborating a little more on the 192 threads.

http://www.neogaf.com/forum/showpost.php?p=89811956&postcount=699

Thats like 2 megabytes of registers? If its not being used to go wider whats its purpose? Im missing something obvious?

TEV's in wii u? Eh... I think something like that would have shown somewhere in all the build logs ive gone through.

Wait, in nintendoese threads are now wavefronts?

192 warps is a pretty big difference between 192 threads. Isnt that a pretty confusing use of nomenclature? Thats like, 12,000 something threads over 4 clocks....

Ooh, geometry shaders, could those be used for some computational tasks like compute shaders? So, if so chosen, you can dedicate a warp to compute? Is this where Iwata was coming from?
 
Last edited by a moderator:
There are more registers per shader compared with a normal R7xx design. 4x the amount it would seem (I think). This would probably explain why the shader blocks are larger than they should be for a 40nm fab. What would be the purpose of this? Compute perhaps? Compute doesn't seem like a good idea with only 160 shaders.

I doubt there is much (if any) legacy Hollywood stuff on the GPU other than the RAM pool identified on the die (the one that cannot be accessed by games). Nintendo themselves even said AMD was able to modify the new GPU to be compatible with the old. A frankenstein design with TEVs wouldn't make much sense anyways.
 
I doubt there is much (if any) legacy Hollywood stuff on the GPU other than the RAM pool identified on the die (the one that cannot be accessed by games).
There's RAM on the die that can't be accessed by games? What the hell!

Nintendo never ceases in its amazing ways to fuck up.
 
So theres too many registers, and too many threads, for what we believe to be the actual shader count.

What can make this make sense? Does that extra memory (4x????) really accelerate/salvage? performance of 160 shaders to make it worthwhile? Is that even a plausible situation?

Or are we missing a puzzle peice....

Okay, what about this: We know the wii u has multithreaded rendering. Its in use on project cars.

Direct3d got around gpu multithreading thread conflicts by adding secondary command buffers that could be saved, and used later. We have 4x 'to many' memory registers, perhaps.... the main (1), and a secondary buffer for each cpu thread (3)?

Yay? Nay? Scooby snacks?
 
Last edited by a moderator:
Thats like 2 megabytes of registers?
No, just 512kB or 16 kB per VLIW unit (possessing 5 ALUs with the VLIW5) as with all of AMD's VLIW architectures.

Btw., 192 wavefronts capacity for the command procesor sound an awful lot for such a small GPU. RV770 (which had 10 SIMDs) was able to handle 256 wavefronts, iirc.
 
There's RAM on the die that can't be accessed by games? What the hell!

Nintendo never ceases in its amazing ways to fuck up.

Not being usable by the game itself doesn't mean that memory isn't being used while the game is running. Nothing goes to waste, even the memory that's part of the old Wii GPU on the die.

Games get full control over the main 32 MB of on-die memory at 31.7 GB/s. Developers complain enough about having to manage two separate memory pools. Imagine the complaints about adding in a tiny 3rd pool of lower bandwidth memory.
 
No, just 512kB or 16 kB per VLIW unit (possessing 5 ALUs with the VLIW5) as with all of AMD's VLIW architectures.

Btw., 192 wavefronts capacity for the command procesor sound an awful lot for such a small GPU. RV770 (which had 10 SIMDs) was able to handle 256 wavefronts, iirc.

kB?

I was referring to in total between the 2 simd banks/32 alu's. The wholeeee enchilada.

so 512x32=16384kB=16 MB?? (so... no /8 for 2048 kB total? are we sure.....? I feel reeeeally old now....) But that completely answers my question regaurdless. Its supposed to be like that for vliw5, its normal, I was under the impression it was rather high.

BUt, the threads, IS abnormally high. Why do we have so many threads? Took a quick glance and confirmed the rv770 has 800 shaders, and only 1 more warp than whats been leaked for Latte. WTF?

Thats wierd right? I know thats wierd, its been bugging the crap outta me. Whats really wierd is that it came from the most reliable source I can imagine. Named and everything. Guy worked on nfsmwu, and is currently doing wii u work on project cars.... Should have never left the project cars forum, but it did, and here we are.
 
No, just 512kB or 16 kB per VLIW unit (possessing 5 ALUs with the VLIW5) as with all of AMD's VLIW architectures.

Apparently the SDK documentation has a diagram saying that it has "4 GPRs" per unit. Which seems to be a strange way to describe it. There is also this earlier leak:

http://beyond3d.com/showpost.php?p=1668212&postcount=2552

about the GPU...it is modeled on the R700 series, but it has significantly more GPRs. However, it seems to have fewer GPRs then the E6760, so...make your own conclusions

From somebody who was a reputable source, just probably not a technical one. There is still the issue of the shader blocks being larger (was it 90% larger?) than they should be for 160 shaders on a 40nm fab. Since we know that it is 160 shaders, could this 'mystery space' be taken up by larger registers ("more GPRs")?
 
Apparently the SDK documentation has a diagram saying that it has "4 GPRs" per unit. Which seems to be a strange way to describe it. There is also this earlier leak:

http://beyond3d.com/showpost.php?p=1668212&postcount=2552



From somebody who was a reputable source, just probably not a technical one. There is still the issue of the shader blocks being larger (was it 90% larger?) than they should be for 160 shaders on a 40nm fab. Since we know that it is 160 shaders, could this 'mystery space' be taken up by larger registers ("more GPRs")?

I suppose thats why I was under the impression it was so high. I recall something to that effect now. That and the fact that you can have things like 32 freaking megabytes embedded onto a microprocessor as an on die l2 and thats on a machine thats considered low end by todays standards. Do you have any idea how many floppy disks that is? Madness.

I guess that pulls the shotgun away from the head of my multithread rendering secondary command buffers pitch?

Or maybe they are used to stash away dependencies so the alu can still be used while the stored dependancy is in time out waiting to be resolved?
 
Last edited by a moderator:
kB?

I was referring to in total between the 2 simd banks/32 alu's. The wholeeee enchilada.

so 512x32=16384kB=16 MB?? (so... no /8 for 2048 kB total? are we sure.....? I feel reeeeally old now....) But that completely answers my question regaurdless. Its supposed to be like that for vliw5, its normal, I was under the impression it was rather high.
You got it wrong. It's 512kB in total or 16kB per VLIW group of ALUs. There are apparently 32 of these groups (32*5=160 ALUs), so the aggregate register file size is 32*16kB=512kB. All VLIW GPUs AMD ever produced have these 16kB registers per VLIW group (I think some call them "thread processors" or whatever).
BUt, the threads, IS abnormally high. Why do we have so many threads? Took a quick glance and confirmed the rv770 has 800 shaders, and only 1 more warp than whats been leaked for Latte. WTF?
What? As I wrote RV770 could handle up to 256 wavefronts (which would be 16384 "threads" or better work items). The 192 number being tossed around has to be some counting of how many wavefronts the command processor of Latte can handle. But 192 still appear like a lot compared to the 256 of RV770, or it is a bit unflexible and separate accounts exist for vertex, geometry and pixel shaders (64 tops for each, that would be a somewhat reasonable count) as bgassassin may have implied.
Apparently the SDK documentation has a diagram saying that it has "4 GPRs" per unit. Which seems to be a strange way to describe it. There is also this earlier leak:

http://beyond3d.com/showpost.php?p=1668212&postcount=2552

From somebody who was a reputable source, just probably not a technical one. There is still the issue of the shader blocks being larger (was it 90% larger?) than they should be for 160 shaders on a 40nm fab. Since we know that it is 160 shaders, could this 'mystery space' be taken up by larger registers ("more GPRs")?
Without the appropriate context, it could mean almost anything. I mean, the register files of the VLIW architectures all have this 4way banked register file structure. And for each physical unit there are always four sets of registers for the four work items processed by one unit over 4 cycles (the latency is actually 8 cycles and two wavefronts are always executed in an interleaved manner, but anyway).
The E6760 has just 6 SIMDs (480 ALUs), so the aggregate size of the register files is 1,5MB, but per unit it is the same as every other VLIW GPU (including Latte).
 
Unless they provide sources you can corroborate.

He's not 'our' eyeofcore. Eyeofcore was evicted because he's most definitely not one of 'us'.

Yeah, I was kidding. I read through his exchange in this thread right before seeing the Wiki actually.

Not being usable by the game itself doesn't mean that memory isn't being used while the game is running. Nothing goes to waste, even the memory that's part of the old Wii GPU on the die.

Games get full control over the main 32 MB of on-die memory at 31.7 GB/s. Developers complain enough about having to manage two separate memory pools. Imagine the complaints about adding in a tiny 3rd pool of lower bandwidth memory.

31.7 GB/s. How did they arrive at that number? eDRAM on separate clock than the rest of the GPU?

No, just 512kB or 16 kB per VLIW unit (possessing 5 ALUs with the VLIW5) as with all of AMD's VLIW architectures.

Btw., 192 wavefronts capacity for the command procesor sound an awful lot for such a small GPU. RV770 (which had 10 SIMDs) was able to handle 256 wavefronts, iirc.

It does seem like an awful lot, but according to this link (in Appendix D), even 80 shader parts have the 192 global limit. AMD must scale it funky or somehow lower shader parts can benefit from the spare wavefronts queued up in cache?

http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
 
You got it wrong. It's 512kB in total or 16kB per VLIW group of ALUs. There are apparently 32 of these groups (32*5=160 ALUs), so the aggregate register file size is 32*16kB=512kB. All VLIW GPUs AMD ever produced have these 16kB registers per VLIW group (I think some call them "thread processors" or whatever).

Yeah ive been swimming in discombobulated info in my head... I was combining that with the info about there being four of those per vliw. So 4 16kb registers per vliw5 machine is what I interpereted from all that.... As I assumed it was 4 times what was considered normal whatever that may be, so I considered what was normal, as a single set, 'n' in my mind so 4n. I just dont see any reason for the information to ever have been stated in the first place otherwise, and we have heard multiple times it has an abnormally large register count for the line its based from.

What? As I wrote RV770 could handle up to 256 wavefronts (which would be 16384 "threads" or better work items). The 192 number being tossed around has to be some counting of how many wavefronts the command processor of Latte can handle. But 192 still appear like a lot compared to the 256 of RV770, or it is a bit unflexible and separate accounts exist for vertex, geometry and pixel shaders (64 tops for each, that would be a somewhat reasonable count) as bgassassin may have implied.
Without the appropriate context, it could mean almost anything. I mean, the register files of the VLIW architectures all have this 4way banked register file structure. And for each physical unit there are always four sets of registers for the four work items processed by one unit over 4 cycles (the latency is actually 8 cycles and two wavefronts are always executed in an interleaved manner, but anyway).
The E6760 has just 6 SIMDs (480 ALUs), so the aggregate size of the register files is 1,5MB, but per unit it is the same as every other VLIW GPU (including Latte).

Ha ha, yeah, my bad, earlier in the thread conversation I found out Nintendo has decided to refer to warps, as threads now. I then commented on how horribly confusing that must be, and then spread horrible confusion.

Yes, 192 warps, 12,000 something threads. I believe you stated before, that was strangely high for a gpu the size of latte? I agree with that, I figure there must be a reason why. The context is clearly whats missing, and what Im spitballing to try and get a handle on.

From what I gathered of BG's info, the geometry shaders if utilized, took 36 warps from the max warp count, while the pixel and vertex shaders shared what was left (no structure was given) of the 156 warps, while when geometry shaders are disabled they had full access to all 192.


It does seem like an awful lot, but according to this link (in Appendix D), even 80 shader parts have the 192 global limit. AMD must scale it funky or somehow lower shader parts can benefit from the spare wavefronts queued up in cache?

http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Could the rumour of extra register memory on latte be used with those queued up warps to move dependencies out of the alu, move in a warp without dependencies from queue, so that the alu's can operate at/closer to peak use, and then insert the formerly dependant warp back in once its been resolved?
 
Last edited by a moderator:
1 Wavefront= 64 Threads.

1 Wavefront is completed in 4 cycles (16 threads per cycle).

In the old AMD architecture 1 thread=1 VLIW5.

Then:

192 Threads= 3 Wavefronts.
1 Cycle=48 Threads
48 Threads*5=240 Stream Processors?

Well, in the GPU it seems that we have 160 Stream Processors:

32 Threads per cycle= 128 Threads.

This means that the other 64 other threads are memory operations?
 
Could the rumour of extra register memory on latte be used with those queued up warps to move dependencies out of the alu, move in a warp without dependencies from queue, so that the alu's can operate at/closer to peak use, and then insert the formerly dependant warp back in once its been resolved?

The person who talked about having more GPRs was under the impression that RV770 only had 128. Meanwhile, Latte has 256 per SIMD. The 128 was a misreading of the info, however, as 128 registers is, I believe, what each thread has access to, and not total per SIMD. Maybe someone else can step in and state that more coherently.
 
The person who talked about having more GPRs was under the impression that RV770 only had 128. Meanwhile, Latte has 256 per SIMD. The 128 was a misreading of the info, however, as 128 registers is, I believe, what each thread has access to, and not total per SIMD. Maybe someone else can step in and state that more coherently.

THAT clears up the four fold confusion. Thank you.

I certainly hope some of the customizations Nintendo has made to latte does SOMETHING to adress vliw's dependency shortcomings. Its bad enough in vliw4, its pretty hairy with vliw5.

For all that talk Iwata gave about having a straight forward machine 'that just does what you expect' having your performance eaten up by idle alu's waiting on dependencies sometimes 3 and even 4 out of 5 machines... Well that does not fit that statement in my opinion.
 
BUt, the threads, IS abnormally high. Why do we have so many threads? Took a quick glance and confirmed the rv770 has 800 shaders, and only 1 more warp than whats been leaked for Latte. WTF?

It's just the way Nintendo is adding the threads.

The max number of threads for the Pixel and Vertex shaders when Geometry Shader threads are disabled, plus the max GS threads equals 192. In other words Nintendo took the highest number for each one, added them, and got 192. It's a weird math conclusion that I can say isn't worth over-thinking, IMO. After all you aren't using the GS threads if they are disabled, but they are added in to reach 192.

What? As I wrote RV770 could handle up to 256 wavefronts (which would be 16384 "threads" or better work items). The 192 number being tossed around has to be some counting of how many wavefronts the command processor of Latte can handle. But 192 still appear like a lot compared to the 256 of RV770, or it is a bit unflexible and separate accounts exist for vertex, geometry and pixel shaders (64 tops for each, that would be a somewhat reasonable count) as bgassassin may have implied.

The GS max is under 10. PS have the most, and each is almost a multiple of four of the next smaller one when GS threads are enabled.
 
Last edited by a moderator:
That and the fact that you can have things like 32 freaking megabytes embedded onto a microprocessor as an on die l2 and thats on a machine thats considered low end by todays standards. Do you have any idea how many floppy disks that is? Madness.
Depends on what kind of floppies of course. 3.5"HD, it's less than 25. ;) If you're talking actual floppy floppies (as in 5.25", or 8", without the rigid plastic shell), then potentially vastly more of course. 40-track formatted discs as used back in the 8-bit era held as little as 160kB IIRC. May be for a single side of course. You had to flip the discs over back in the day to access the reverse side...

The first harddrives had ~5MB capacity, had a stack of like 12" or larger platters and were roughly the size of a fridge. Prolly weighed one or a couple hundred kilos or thereabouts too.

...Of course, this is all irrelevant when talking about microprocessors, as on-die caches and other storage isn't non-volatile, and is used for different purposes in different ways. :) Still, it does put things into perspective.
 
It's just the way Nintendo is adding the threads.

The max number of threads for the Pixel and Vertex shaders when Geometry Shader threads are disabled, plus the max GS threads equals 192. In other words Nintendo took the highest number for each one, added them, and got 192. It's a weird math conclusion that I can say isn't worth over-thinking, IMO. After all you aren't using the GS threads if they are disabled, but they are added in to reach 192.



The GS max is under 10. PS have the most, and each is almost a multiple of four of the next smaller one when GS threads are enabled.

Ha really? Thats kinda wierd.

So what geometry shaders max warps is 9, vertex is 36, and pixel is 144? (plus another warp here or there to get from 189 to 192)

Okay. Whos nintendo trying to impress within their own documents?
 
Ha really? Thats kinda wierd.

So what geometry shaders max warps is 9, vertex is 36, and pixel is 144? (plus another warp here or there to get from 189 to 192)

Okay. Whos nintendo trying to impress within their own documents?

The multiples of four only work for when GS are enabled (look back to the post I linked to from GAF to see the totals for that), though your total max for VS and PS isn't that far off when GS are disabled. An example based on your guess would look like this: GS = 9, VS = 38, PS = 145. That will make up for the three you were missing.

My take is that when GS are enabled, the total usable threads go up to 160. Or even when they are disabled it doesn't exceed 160 usable threads. And as you can probably tell, none of the individual maxes exceed or reach 160 (Another hint: or 140 for that matter).
 
Last edited by a moderator:
My take is that when GS are enabled, the total threads go up to 160. And as you can probably tell, none of the individual maxes exceed or reach 160.
Are you mixing up wavefronts with "threads" (work items)? While the wavefronts are the actual threads of the hardware, usually the work items (from which you have 64 in a wavefront) are called that way in the context of GPUs. But there is no way you can fill even the very small Latte GPU with just 160 or 192 of these "threads". For starters, the VLIW architectures always interleave two wavefronts on a single SIMD to cover instruction latencies (the command processor keeps more wavefronts to swap in in case one hits a long latency instruction [memory access] or control flow). That means one needs already 256 "threads" (4 wavefronts) at minimum, to be even able to hide the ALU latencies for a tiny GPU with just two SIMDs. For running efficiently, you would want significantly (an order of magnitude or something in that range) more than that.
And there is actually no efficient way to run less "threads" than one has in a wavefront. So 10 "threads" for GS doesn't make the slightest sense, 10 wavefronts (640 "threads") do.
 
Status
Not open for further replies.
Back
Top