Wii U hardware discussion and investigation *rename

Discussion in 'Console Technology' started by TheAlSpark, Jul 29, 2011.

Thread Status:
Not open for further replies.
  1. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    Thats like 2 megabytes of registers? If its not being used to go wider whats its purpose? Im missing something obvious?

    TEV's in wii u? Eh... I think something like that would have shown somewhere in all the build logs ive gone through.

    Wait, in nintendoese threads are now wavefronts?

    192 warps is a pretty big difference between 192 threads. Isnt that a pretty confusing use of nomenclature? Thats like, 12,000 something threads over 4 clocks....

    Ooh, geometry shaders, could those be used for some computational tasks like compute shaders? So, if so chosen, you can dedicate a warp to compute? Is this where Iwata was coming from?
     
    #5481 creaks, Nov 15, 2013
    Last edited by a moderator: Nov 15, 2013
  2. Melqart

    Newcomer

    Joined:
    Nov 12, 2013
    Messages:
    36
    Likes Received:
    0
    There are more registers per shader compared with a normal R7xx design. 4x the amount it would seem (I think). This would probably explain why the shader blocks are larger than they should be for a 40nm fab. What would be the purpose of this? Compute perhaps? Compute doesn't seem like a good idea with only 160 shaders.

    I doubt there is much (if any) legacy Hollywood stuff on the GPU other than the RAM pool identified on the die (the one that cannot be accessed by games). Nintendo themselves even said AMD was able to modify the new GPU to be compatible with the old. A frankenstein design with TEVs wouldn't make much sense anyways.
     
  3. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    There's RAM on the die that can't be accessed by games? What the hell!

    Nintendo never ceases in its amazing ways to fuck up.
     
  4. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    So theres too many registers, and too many threads, for what we believe to be the actual shader count.

    What can make this make sense? Does that extra memory (4x????) really accelerate/salvage? performance of 160 shaders to make it worthwhile? Is that even a plausible situation?

    Or are we missing a puzzle peice....

    Okay, what about this: We know the wii u has multithreaded rendering. Its in use on project cars.

    Direct3d got around gpu multithreading thread conflicts by adding secondary command buffers that could be saved, and used later. We have 4x 'to many' memory registers, perhaps.... the main (1), and a secondary buffer for each cpu thread (3)?

    Yay? Nay? Scooby snacks?
     
    #5484 creaks, Nov 15, 2013
    Last edited by a moderator: Nov 15, 2013
  5. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    No, just 512kB or 16 kB per VLIW unit (possessing 5 ALUs with the VLIW5) as with all of AMD's VLIW architectures.

    Btw., 192 wavefronts capacity for the command procesor sound an awful lot for such a small GPU. RV770 (which had 10 SIMDs) was able to handle 256 wavefronts, iirc.
     
  6. BobbleHead

    Newcomer

    Joined:
    Sep 24, 2002
    Messages:
    58
    Likes Received:
    2
    Not being usable by the game itself doesn't mean that memory isn't being used while the game is running. Nothing goes to waste, even the memory that's part of the old Wii GPU on the die.

    Games get full control over the main 32 MB of on-die memory at 31.7 GB/s. Developers complain enough about having to manage two separate memory pools. Imagine the complaints about adding in a tiny 3rd pool of lower bandwidth memory.
     
  7. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    kB?

    I was referring to in total between the 2 simd banks/32 alu's. The wholeeee enchilada.

    so 512x32=16384kB=16 MB?? (so... no /8 for 2048 kB total? are we sure.....? I feel reeeeally old now....) But that completely answers my question regaurdless. Its supposed to be like that for vliw5, its normal, I was under the impression it was rather high.

    BUt, the threads, IS abnormally high. Why do we have so many threads? Took a quick glance and confirmed the rv770 has 800 shaders, and only 1 more warp than whats been leaked for Latte. WTF?

    Thats wierd right? I know thats wierd, its been bugging the crap outta me. Whats really wierd is that it came from the most reliable source I can imagine. Named and everything. Guy worked on nfsmwu, and is currently doing wii u work on project cars.... Should have never left the project cars forum, but it did, and here we are.
     
  8. Melqart

    Newcomer

    Joined:
    Nov 12, 2013
    Messages:
    36
    Likes Received:
    0
    Apparently the SDK documentation has a diagram saying that it has "4 GPRs" per unit. Which seems to be a strange way to describe it. There is also this earlier leak:

    http://beyond3d.com/showpost.php?p=1668212&postcount=2552

    From somebody who was a reputable source, just probably not a technical one. There is still the issue of the shader blocks being larger (was it 90% larger?) than they should be for 160 shaders on a 40nm fab. Since we know that it is 160 shaders, could this 'mystery space' be taken up by larger registers ("more GPRs")?
     
  9. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    I suppose thats why I was under the impression it was so high. I recall something to that effect now. That and the fact that you can have things like 32 freaking megabytes embedded onto a microprocessor as an on die l2 and thats on a machine thats considered low end by todays standards. Do you have any idea how many floppy disks that is? Madness.

    I guess that pulls the shotgun away from the head of my multithread rendering secondary command buffers pitch?

    Or maybe they are used to stash away dependencies so the alu can still be used while the stored dependancy is in time out waiting to be resolved?
     
    #5489 creaks, Nov 15, 2013
    Last edited by a moderator: Nov 15, 2013
  10. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    You got it wrong. It's 512kB in total or 16kB per VLIW group of ALUs. There are apparently 32 of these groups (32*5=160 ALUs), so the aggregate register file size is 32*16kB=512kB. All VLIW GPUs AMD ever produced have these 16kB registers per VLIW group (I think some call them "thread processors" or whatever).
    What? As I wrote RV770 could handle up to 256 wavefronts (which would be 16384 "threads" or better work items). The 192 number being tossed around has to be some counting of how many wavefronts the command processor of Latte can handle. But 192 still appear like a lot compared to the 256 of RV770, or it is a bit unflexible and separate accounts exist for vertex, geometry and pixel shaders (64 tops for each, that would be a somewhat reasonable count) as bgassassin may have implied.
    Without the appropriate context, it could mean almost anything. I mean, the register files of the VLIW architectures all have this 4way banked register file structure. And for each physical unit there are always four sets of registers for the four work items processed by one unit over 4 cycles (the latency is actually 8 cycles and two wavefronts are always executed in an interleaved manner, but anyway).
    The E6760 has just 6 SIMDs (480 ALUs), so the aggregate size of the register files is 1,5MB, but per unit it is the same as every other VLIW GPU (including Latte).
     
  11. Fourth Storm

    Newcomer

    Joined:
    Nov 13, 2012
    Messages:
    135
    Likes Received:
    10
    Yeah, I was kidding. I read through his exchange in this thread right before seeing the Wiki actually.

    31.7 GB/s. How did they arrive at that number? eDRAM on separate clock than the rest of the GPU?

    It does seem like an awful lot, but according to this link (in Appendix D), even 80 shader parts have the 192 global limit. AMD must scale it funky or somehow lower shader parts can benefit from the spare wavefronts queued up in cache?

    http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
     
  12. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    Yeah ive been swimming in discombobulated info in my head... I was combining that with the info about there being four of those per vliw. So 4 16kb registers per vliw5 machine is what I interpereted from all that.... As I assumed it was 4 times what was considered normal whatever that may be, so I considered what was normal, as a single set, 'n' in my mind so 4n. I just dont see any reason for the information to ever have been stated in the first place otherwise, and we have heard multiple times it has an abnormally large register count for the line its based from.

    Ha ha, yeah, my bad, earlier in the thread conversation I found out Nintendo has decided to refer to warps, as threads now. I then commented on how horribly confusing that must be, and then spread horrible confusion.

    Yes, 192 warps, 12,000 something threads. I believe you stated before, that was strangely high for a gpu the size of latte? I agree with that, I figure there must be a reason why. The context is clearly whats missing, and what Im spitballing to try and get a handle on.

    From what I gathered of BG's info, the geometry shaders if utilized, took 36 warps from the max warp count, while the pixel and vertex shaders shared what was left (no structure was given) of the 156 warps, while when geometry shaders are disabled they had full access to all 192.


    Could the rumour of extra register memory on latte be used with those queued up warps to move dependencies out of the alu, move in a warp without dependencies from queue, so that the alu's can operate at/closer to peak use, and then insert the formerly dependant warp back in once its been resolved?
     
    #5492 creaks, Nov 15, 2013
    Last edited by a moderator: Nov 15, 2013
  13. Urian

    Regular

    Joined:
    Aug 23, 2003
    Messages:
    622
    Likes Received:
    55
    1 Wavefront= 64 Threads.

    1 Wavefront is completed in 4 cycles (16 threads per cycle).

    In the old AMD architecture 1 thread=1 VLIW5.

    Then:

    192 Threads= 3 Wavefronts.
    1 Cycle=48 Threads
    48 Threads*5=240 Stream Processors?

    Well, in the GPU it seems that we have 160 Stream Processors:

    32 Threads per cycle= 128 Threads.

    This means that the other 64 other threads are memory operations?
     
  14. Fourth Storm

    Newcomer

    Joined:
    Nov 13, 2012
    Messages:
    135
    Likes Received:
    10
    The person who talked about having more GPRs was under the impression that RV770 only had 128. Meanwhile, Latte has 256 per SIMD. The 128 was a misreading of the info, however, as 128 registers is, I believe, what each thread has access to, and not total per SIMD. Maybe someone else can step in and state that more coherently.
     
  15. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    THAT clears up the four fold confusion. Thank you.

    I certainly hope some of the customizations Nintendo has made to latte does SOMETHING to adress vliw's dependency shortcomings. Its bad enough in vliw4, its pretty hairy with vliw5.

    For all that talk Iwata gave about having a straight forward machine 'that just does what you expect' having your performance eaten up by idle alu's waiting on dependencies sometimes 3 and even 4 out of 5 machines... Well that does not fit that statement in my opinion.
     
  16. bgassassin

    Regular

    Joined:
    Aug 12, 2011
    Messages:
    507
    Likes Received:
    0
    It's just the way Nintendo is adding the threads.

    The max number of threads for the Pixel and Vertex shaders when Geometry Shader threads are disabled, plus the max GS threads equals 192. In other words Nintendo took the highest number for each one, added them, and got 192. It's a weird math conclusion that I can say isn't worth over-thinking, IMO. After all you aren't using the GS threads if they are disabled, but they are added in to reach 192.

    The GS max is under 10. PS have the most, and each is almost a multiple of four of the next smaller one when GS threads are enabled.
     
    #5496 bgassassin, Nov 16, 2013
    Last edited by a moderator: Nov 16, 2013
  17. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    Depends on what kind of floppies of course. 3.5"HD, it's less than 25. ;) If you're talking actual floppy floppies (as in 5.25", or 8", without the rigid plastic shell), then potentially vastly more of course. 40-track formatted discs as used back in the 8-bit era held as little as 160kB IIRC. May be for a single side of course. You had to flip the discs over back in the day to access the reverse side...

    The first harddrives had ~5MB capacity, had a stack of like 12" or larger platters and were roughly the size of a fridge. Prolly weighed one or a couple hundred kilos or thereabouts too.

    ...Of course, this is all irrelevant when talking about microprocessors, as on-die caches and other storage isn't non-volatile, and is used for different purposes in different ways. :) Still, it does put things into perspective.
     
  18. creaks

    Newcomer

    Joined:
    Apr 9, 2013
    Messages:
    81
    Likes Received:
    0
    Ha really? Thats kinda wierd.

    So what geometry shaders max warps is 9, vertex is 36, and pixel is 144? (plus another warp here or there to get from 189 to 192)

    Okay. Whos nintendo trying to impress within their own documents?
     
  19. bgassassin

    Regular

    Joined:
    Aug 12, 2011
    Messages:
    507
    Likes Received:
    0
    The multiples of four only work for when GS are enabled (look back to the post I linked to from GAF to see the totals for that), though your total max for VS and PS isn't that far off when GS are disabled. An example based on your guess would look like this: GS = 9, VS = 38, PS = 145. That will make up for the three you were missing.

    My take is that when GS are enabled, the total usable threads go up to 160. Or even when they are disabled it doesn't exceed 160 usable threads. And as you can probably tell, none of the individual maxes exceed or reach 160 (Another hint: or 140 for that matter).
     
    #5499 bgassassin, Nov 16, 2013
    Last edited by a moderator: Nov 16, 2013
  20. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Are you mixing up wavefronts with "threads" (work items)? While the wavefronts are the actual threads of the hardware, usually the work items (from which you have 64 in a wavefront) are called that way in the context of GPUs. But there is no way you can fill even the very small Latte GPU with just 160 or 192 of these "threads". For starters, the VLIW architectures always interleave two wavefronts on a single SIMD to cover instruction latencies (the command processor keeps more wavefronts to swap in in case one hits a long latency instruction [memory access] or control flow). That means one needs already 256 "threads" (4 wavefronts) at minimum, to be even able to hide the ALU latencies for a tiny GPU with just two SIMDs. For running efficiently, you would want significantly (an order of magnitude or something in that range) more than that.
    And there is actually no efficient way to run less "threads" than one has in a wavefront. So 10 "threads" for GS doesn't make the slightest sense, 10 wavefronts (640 "threads") do.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...