Wii U hardware discussion and investigation *rename

Discussion in 'Console Technology' started by TheAlSpark, Jul 29, 2011.

Thread Status:
Not open for further replies.
  1. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Thanks.

    4x4 matrix multiply involves basically 32 loads, 16 stores, 16 MULs, and 48 MADDs. This is the instruction breakdown..

    PPC750:
    15 lfs (load single fp32)
    1 lfsx (load single fp32 using reg index - compiler uses this w/constant 0 register instead of using an imm offset of 0)

    16 ps_merge00 (used to create a paired single from two separate singles; compiler is using this like a 2x broadcast instruction)
    8 ps_mul (2x multiply)
    24 ps_madd (2x madd)

    8 psq_lx (loads a paired single)
    8 psq_stx (stores a paired single)
    4 add (pointer arithmetic)
    2 slwi (left shifts, pointer arithmetic)
    1 bdnz (flow control, probably free)

    x86 version is:

    1 add (pointer arithmetic)
    1 mov (pointer arithmetic)
    1 lea (pointer arithmetic)
    2 sal (left shift, pointer arithmetic)

    1 sub (flow control)
    1 jne (flow control)

    12 addps (add 4x FP32)
    16 movaps (move 4x FP32 - 4 of these are loads and 4 are stores, the other 8 are reg/reg)
    16 movss (move 1x FP32 - these are all used as loads)
    16 mulps (MUL 4x FP32)
    16 shufps (re-organize lanes of 4x FP32)

    I haven't made any attempt at timing analysis for the PC750 version so I don't yet know how many dependency stalls it's hitting (is probably not terribly high). The FLOP count for the x86 version is the same, but the x86 version is doing some extra work moving stuff around and due to not having FMADDs. And it doesn't attempt folding loads with the adds or muls. Both are using a bunch of scalar loads + broadcasts to get the scalar multiplication, instead of using vector loads. But Bobcat gets artificially penalized because its shufps instructions take 2 cycles like the addps/mulps do, and alarmingly, so do the movss loads, so it's paying a lot more for its broadcasts.

    Moving to Jaguar's improved ISA support could improve things. That gives you three-address arithmetic with AVX128, as well as broadcasts. FMA would improve things more.

    A more heavily optimized version would probably consider some friendlier storage formats if at all possible. Bet a lowly Cortex-A8 would fare nicely here with some hand-written ASM, since it has vector * scalar FMADD.. that is, if you could hide the huge latency..

    You can find some good examples of AMD vs nVidia GPUs where the difference in peak FLOP count painted a very different picture from the difference in typical game performance. That should answer your question, I hope.
     
    #4741 Exophase, Feb 12, 2013
    Last edited by a moderator: Feb 12, 2013
  2. Inuhanyou

    Regular

    Joined:
    Dec 23, 2012
    Messages:
    995
    Likes Received:
    189
    Location:
    New Jersey, USA
    Correct, but i had assumed that Nvidia "flops" and AMD "flops" were usually different in proportion to actual performance values.

    I'm just wondering if that still applies when one is comparing two GPU's of the same vendor.
     
  3. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    They're the same kind of FLOPs as the ones people talk about with XBox 360 or PS3, or the ones people are speculating about in this thread. It's just a matter of how high utilization can be. Different designs are going to have different potential for utilizing those FLOPs - and having worse FLOP efficiency doesn't necessarily mean the design is worse because those FLOPs may not have been as expensive to implement (for instance AMD's 4-wide or 5-wide VLIW gives more peak perf/cycle than single-issue or dual-issue scalar)

    Going from XBox 360 level all the to GCN changes the uarch at least as much as any same-generation difference between AMD and nVidia has ever been (probably a lot more), so yes it can still apply when comparing different GPU families from the same vendor.
     
  4. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,551
    Likes Received:
    735
    Location:
    Texas
    Yuck, that code is ugly.

    Why are there so many mul/madd ops? A matric multiply should only be 16 mults and 12 adds. Are multiple matrix multiplies interleaved (loop in main unrolled)? If only the inner loop was operating on one matrix and properly vectorized, I would expect something around vector 2 mults and 6 madds.
     
  5. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Enh? I think you're confusing it with a vec4 transform (vec4 * mat4). A 4x4 matrix multiply is 4x4 dot products, which are each 4 MUL + 3 ADD. So 64 MUL and 48 ADD total. Of course the order of operations has been rearranged in order to prevent needing horizontal ops like dot product.

    I meant to be describing the FLOP load for the algorithm, but accidentally used 2x SIMD figures instead of scalar ones. Fixed it in my post
     
  6. darkblu

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,642
    Likes Received:
    22
    I think you overlooked something. A 4x4 matmul is 16 MULs and 48 MADDs - 1 MUL + 3 MADDs per element, 112 FLOPs altogether.

    edit: Ah, I see what you did. You fixed it already.

    I'm not sure how you can avoid the shuffles in this algorithm. If you want to preserve the access pattern the shuffles are mandatory. The fact shuffle are expensive on bobcat is unfortunate, but I would not pin the entire fault on the compiler, given the register file size restrictions, and the register pressure stemming from that. Keeping yet another register for shuffles source wold not help the pressure.

    I hope they did address the serious shorcomings of the SSE ISA. We'll see.

    I have run the same test on an A8 as well - yes, its does FMADDs just as the PPC, but it's not performing stellar by any measure - it had the worst normalized performance among the three CPUs. I might try to sit an analyze what its issues are some day, but not today.
     
  7. Inuhanyou

    Regular

    Joined:
    Dec 23, 2012
    Messages:
    995
    Likes Received:
    189
    Location:
    New Jersey, USA


    Ah, thanks, that was very informative.

    Does that mean that the previous indications of 320 SPU's are now in flux?
     
  8. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    I fixed it :p

    Yes, the shuffles are unavoidable if you don't have a broadcast instruction. The scalar loads are the bigger fault. I haven't worked out if you can avoid any without worsening register pressure but I imagine you can meet it halfway at worst.

    AVX and SSE4 do address some, and Jaguar has both. The 256-bit version isn't too useful but it should fit well with AVX128.

    I don't trust the compiler's ability to generate NEON at all (scheduling NEON well is.. hard. It has a lot of weird caveats). But it's probably just too hard to avoid ~8 cycles of latency for an FMADD..
     
  9. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,551
    Likes Received:
    735
    Location:
    Texas
    Wait, is it a 4x4x4 times a 4x4x4 matrix? (4x4 matrix of 4 element vectors?)
     
  10. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Currently it at least seems like 160 is plausible.

    We don't really know the nature of these SPs, who knows what kind of special instructions AMD could have added for Nintendo.. fast normalization was brought up as one possibility..

    It's a 4x4 matrix multiplied by another 4x4 matrix.
     
  11. darkblu

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,642
    Likes Received:
    22
    Here, help yourself.
     
  12. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,551
    Likes Received:
    735
    Location:
    Texas
    Ooops. Who knew that a 4x4 matrix has 16 elements and not 4. Thanks.
     
  13. EpyonXYZ

    Newcomer

    Joined:
    Feb 12, 2013
    Messages:
    13
    Likes Received:
    0
    A Question.

    Hello.

    I am brand new here so fist a great hello!

    I am not into this technical stuff but one thing i noticed. I measured the surface with the logic
    of these illustrations (The areas without a visible structure) ----->http://www.neogaf.com/forum/showpost.php?p=47369662&postcount=1486

    The difference between the two areas is ~12.6%.

    So if "Latte" is at 40nm than it is maybe more like in the area > 160sp.

    If Latte is 55 nm or so than is like = 160sp.

    Or the scaling of the illustration is wrong.

    And why is everyone talking about ~15Watt GPU in Wii U and not 20 or 25 Watt GPU ? Is the CPU and the other components so energy hungry?

    I hope this aren´t to stupid questions and thoughts. Bye :oops:
     
  14. Ika

    Ika
    Newcomer Subscriber

    Joined:
    Jun 3, 2012
    Messages:
    67
    Likes Received:
    1
    And also look at the average power consumption figures of the DDR3 version (I know it's for the whole card, but still). It looks low enough.
     
  15. (((interference)))

    Veteran

    Joined:
    Sep 10, 2009
    Messages:
    2,499
    Likes Received:
    70
    FWIW I was asking Richard from DF what he made of function's analysis and he came back with this

     
  16. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,229
    Likes Received:
    2,490
    Location:
    Wrong thread
    He's mistaken then!

    To be absolutely clear, it's about vastly lower frame rates and / or resolutions (the two are somewhat interchangeable) than on supposedly similarly powerful ~352 gflop, 8 ROP PC parts such as the HD 5550. The Xbox 360 <-> PC multi-platforms are just a standout indicator of how massively a 320 shader 8 ROP part can outperform the 360 even in a less than optimal setting.

    To emphasise; it's *not* about the Wii U being slightly slower or faster than the Xbox 360. If a Wii U port had a slightly higher frame rate or resolution a few percent higher then it wouldn't change anything. We should actually expect that sometimes - its ROPs and TMUs are faster after all, and even the HD 6450 can give the 360 a run for its money. If for example Wii U ports started to run at a 50 ~ 75% higher resolution though, while still maintaining the same frame rate, then we'd be looking at compelling evidence of HD 5550 or "320 shader" performance.

    So nope, the argument isn't based entirely on "lower frame-rates" (and I hope he didn't mean "than the 360")!

    Edit: And you could actually have a 160 shader GPU that was stronger than the PS360. If the GDDR5 variant of HD 6450 had 16 TMUs and 8 ROPs it would probably be a very good candidate.
     
    #4756 function, Feb 12, 2013
    Last edited by a moderator: Feb 12, 2013
  17. DRS

    DRS
    Newcomer

    Joined:
    May 22, 2009
    Messages:
    135
    Likes Received:
    0
    Honestly, even though it may be plausible to have 160, I don't really leverage this idea because of my own hypothesis:) In my opinion, to achieve maximum fillrate with 8 ROPS and 8 SIMDs calculating 4 pixels each, shader programs are limited to 4 clocks (or instructions). Since I'd guess programs usually are bigger than that, 8 ROPS wouldn't make sense. Looking at AMD's offerings, I don't see any 160:16:8 config available either.

    Besides that, most texture units would load the same texture, most certainly resulting in higher texel bandwidth requirements to the external world. Therefore, I find it more plausible that we are looking at a 8x8 SIMD design at least. It would be more efficient in reaching its goals.

    About the ports: None of these online comparisons show models close up, so how can we judge texture quality? BLOPS 2 has better shadow resolution on Wii U BTW! It looks like they use mipmapping on the cubemaps or so. The proof is on youtube, look at the shadow of the mic and it's cable on the WiiU and how it is missing on 360.

    Ok burn me down if I'm wrong.

    Another thing that keeps bugging me. Hacking a console is one thing, tracing down the clockspeeds something else and grepping on a directory the complete opposite. Why do we take for granted that the GPU is clocked at 550MHz? Someone else pointed out that the eDram is not likely to be asynchronous. So why would the CPU and GPU be when they're on the same die? It makes no sense to me at all!
     
  18. AzaK

    Newcomer

    Joined:
    Jun 10, 2012
    Messages:
    43
    Likes Received:
    0
    But can't that be explained away due to memory bandwidth and sub-optimal engines? The Wii U having low main memory bandwidth and PC parts generally having dedicated fast VRAM. Therefore an engine/game that assumes VRAM and/or MAIN memory access is of a certain minimum bandwidth for reading and writing (ala 360) would suffer on Wii U regardless of the shaders would it not?
     
  19. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,229
    Likes Received:
    2,490
    Location:
    Wrong thread
    Nintendo don't really have to be limited to commercially available configurations. RV730 (a.k.a. Mario) actually has a configuration of 320 shaders to 32 TMUs. If you're going to be running Xbox ports you're probably going to want 8 ROPs - and the ROPs are independent of the SIMD pipelines.

    Assuming Wii U does have 8 conventional ROPs, of course. There was some interesting talk about shader transparencies and no-one has identified the ROPs on the die with certainty!

    That depends on what you anticipate your workload being. If your projections show you'll be texture bound with 80 shaders to 4 TMUs you'd probably want a different ratio. RV730 has 40 shader to 4 TMUs, and AMD's low end GPUs maintained this ratio right up to the HD 6350.

    I'm not aware of BLOPs 2 on the Wii U having higher res shadows,and DF doesn't mention it. Early screenshots showed the opposite, but that was probably fixed before release. Do you have a particular comparison in mind?

    CPU and GPU are on the same package, but different die and different clockspeeds, going by Marcan's figures. Wouldn't be interesting if the edram was clocked differently to the GPU. Renesas edram is supposed to go up to 800 mhz!
     
  20. terraincognita

    Newcomer

    Joined:
    Feb 12, 2013
    Messages:
    16
    Likes Received:
    0
    Teasy was probably referring to the dynamic resolution on the PS3/360 versions:

    http://www.eurogamer.net/articles/digitalfoundry-trine-2-face-off

    From DF (context is about PS3's FXAA implementation). Trine 2 uses Frozenbyte's own proprietary dx9 engine (I think) and they claim they were able to port the game from PC -> Wii U in two days, further tweaking/optimizing it from there. They also claim they were able to port the expansion without trouble which wouldn't be the case with the PS3/360 (possibly because of the higher RAM pool, obviously if they really wanted to they could port it to PS3/360 with optimizations/modifications).

    In the DF face-offs for BO2, Batman, and ME3 they mention the assets being a close match for the 360 version. On a superficial level the Wii U is a closer match to the 360 than it is to the PS3 (3 PPC cores and a radeon GPU + eDRAM). Now at this point in the lifecycle of current gen consoles, cross-platform engines/middleware are optimized to the different hardware of PS3/360 and devs are able to achieve somewhat a level of parity with many multi-platform games.

    Last year a GAF user named IdeaMan mentioned poor CPU performance being reported with multiplatform ports and a need to optimize in the middleware/engine (probably not something many publishers/devs would be interested in). The context indicated (for me anyways) 360 -> Wii U issues. ME3 and Batman are both UE3 games and Epic lists Wii U as an officially supported platform for UE3. After the last E3, DF wrote a bit about Batman on Wii U. The draw distance was lower but there were improvements in textures and lighting including stuff not in the PC version (IIRC). The game that shipped though had none of these extras and again matched the 360 assets (with a poor framerate during combat and no 3d).

    This might sound like a dumb question (forgive me if I sound ignorant, it's my first post). Since the CPU/GPU/RAM architecture is a bit different than the 360, if (big if) the engines/middleware used were based off of a 360 "path" what kind of an impact would it have? If that's even how things like that work. If Trine 2 was really ported from the PC version rather than from the 360 version, could that be a potential reason it fared better than others? That's assuming the other titles were even ported from the 360 "path" to begin with of course.

    I get what you guys are saying about how if it was a better GPU compared with the 360's (better meaning more than the 360's 240 shaders) the results should be better than they were. I'm curious if there are any 'tricks' being used by the 360 which would give a later/improved architecture with slightly more shaders some issues in porting. I remember my nvidia 460 gtx getting a lower 3dmark 06 score than an 8800gt that it replaced. I think the drivers improved it, but I had the feeling that DX10/DX11 cards potentially had issues with DX9 games due to architecture differences. Then again, in the case of the Wii U that wouldn't apply in Trine 2's case if it's a DX9 game (I think it is anyways, not that Wii U uses DX of course but you know what I mean in terms of feature set).

    I'm doubting the 320 shaders based on the size comparison, but if it is 40nm and if the SIMD blocks are larger than expected perhaps it has more than the anemic 160 shaders you're speculating. If it does only have 160 shaders, the 3rd party stuff is more impressive than I thought (for the hardware, unimpressive compared to the current gen). Nintendo fans with GPGU hopes will be even more disappointed though.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...