Wii U hardware discussion and investigation *rename

Exophase · Feb 12, 2013

darkblu said:
As promised, here are the assembly listings of the mat4 x mat4 testcase testvect_intrinsic.cpp. Apologies for the wonky syntax highlight - apparently that is not among pastebin's strengths.

ppc750cl assembly - the timed innermost loop starts at L103
bobcat assembly - the timed innermost loop starts at L98

Thanks.

4x4 matrix multiply involves basically 32 loads, 16 stores, 16 MULs, and 48 MADDs. This is the instruction breakdown..

PPC750:
15 lfs (load single fp32)
1 lfsx (load single fp32 using reg index - compiler uses this w/constant 0 register instead of using an imm offset of 0)

16 ps_merge00 (used to create a paired single from two separate singles; compiler is using this like a 2x broadcast instruction)
8 ps_mul (2x multiply)
24 ps_madd (2x madd)

8 psq_lx (loads a paired single)
8 psq_stx (stores a paired single)
4 add (pointer arithmetic)
2 slwi (left shifts, pointer arithmetic)
1 bdnz (flow control, probably free)

x86 version is:

1 add (pointer arithmetic)
1 mov (pointer arithmetic)
1 lea (pointer arithmetic)
2 sal (left shift, pointer arithmetic)

1 sub (flow control)
1 jne (flow control)

12 addps (add 4x FP32)
16 movaps (move 4x FP32 - 4 of these are loads and 4 are stores, the other 8 are reg/reg)
16 movss (move 1x FP32 - these are all used as loads)
16 mulps (MUL 4x FP32)
16 shufps (re-organize lanes of 4x FP32)

I haven't made any attempt at timing analysis for the PC750 version so I don't yet know how many dependency stalls it's hitting (is probably not terribly high). The FLOP count for the x86 version is the same, but the x86 version is doing some extra work moving stuff around and due to not having FMADDs. And it doesn't attempt folding loads with the adds or muls. Both are using a bunch of scalar loads + broadcasts to get the scalar multiplication, instead of using vector loads. But Bobcat gets artificially penalized because its shufps instructions take 2 cycles like the addps/mulps do, and alarmingly, so do the movss loads, so it's paying a lot more for its broadcasts.

Moving to Jaguar's improved ISA support could improve things. That gives you three-address arithmetic with AVX128, as well as broadcasts. FMA would improve things more.

A more heavily optimized version would probably consider some friendlier storage formats if at all possible. Bet a lowly Cortex-A8 would fare nicely here with some hand-written ASM, since it has vector * scalar FMADD.. that is, if you could hide the huge latency..

Inuhanyou said:
So now our expectation is on 170 gflops? That's actually pretty intresting to me. Is it really possible that GPU architectures have advanced so much in the time from Xenos's conception that when optimizing, you can extract comparable results from a GPU with only a fraction of the raw processing power? With the same number of ROPs to boot and a slower main bandwidth, along with weaker CPU.

I guess it bodes well for 720 and Orbis's architecture efficiencies over their predecessors.

You can find some good examples of AMD vs nVidia GPUs where the difference in peak FLOP count painted a very different picture from the difference in typical game performance. That should answer your question, I hope.

Inuhanyou · Feb 12, 2013

Exophase said:
You can find some good examples of AMD vs nVidia GPUs where the difference in peak FLOP count painted a very different picture from the difference in typical game performance. That should answer your question, I hope.

Correct, but i had assumed that Nvidia "flops" and AMD "flops" were usually different in proportion to actual performance values.

I'm just wondering if that still applies when one is comparing two GPU's of the same vendor.

Exophase · Feb 12, 2013

Inuhanyou said:
Correct, but i had assumed that Nvidia "flops" and AMD "flops" were usually different in proportion to actual performance values.

I'm just wondering if that still applies when one is comparing two GPU's of the same vendor.

They're the same kind of FLOPs as the ones people talk about with XBox 360 or PS3, or the ones people are speculating about in this thread. It's just a matter of how high utilization can be. Different designs are going to have different potential for utilizing those FLOPs - and having worse FLOP efficiency doesn't necessarily mean the design is worse because those FLOPs may not have been as expensive to implement (for instance AMD's 4-wide or 5-wide VLIW gives more peak perf/cycle than single-issue or dual-issue scalar)

Going from XBox 360 level all the to GCN changes the uarch at least as much as any same-generation difference between AMD and nVidia has ever been (probably a lot more), so yes it can still apply when comparing different GPU families from the same vendor.

McHuj · Feb 12, 2013

Exophase said:
Thanks.

4x4 matrix multiply involves basically 32 loads, 16 stores, 8 MULs, and 24 MADDs. This is the instruction breakdown..

Yuck, that code is ugly.

Why are there so many mul/madd ops? A matric multiply should only be 16 mults and 12 adds. Are multiple matrix multiplies interleaved (loop in main unrolled)? If only the inner loop was operating on one matrix and properly vectorized, I would expect something around vector 2 mults and 6 madds.

Exophase · Feb 12, 2013

McHuj said:
Yuck, that code is ugly.

Why are there so many mul/madd ops? A matric multiply should only be 16 mults and 12 adds. Are multiple matrix multiplies interleaved (loop in main unrolled)? If only the inner loop was operating on one matrix and properly vectorized, I would expect something around vector 2 mults and 6 madds.

Enh? I think you're confusing it with a vec4 transform (vec4 * mat4). A 4x4 matrix multiply is 4x4 dot products, which are each 4 MUL + 3 ADD. So 64 MUL and 48 ADD total. Of course the order of operations has been rearranged in order to prevent needing horizontal ops like dot product.

I meant to be describing the FLOP load for the algorithm, but accidentally used 2x SIMD figures instead of scalar ones. Fixed it in my post

darkblu · Feb 12, 2013

Exophase said:
Thanks.

4x4 matrix multiply involves basically 32 loads, 16 stores, 8 MULs, and 24 MADDs. This is the instruction breakdown..

I think you overlooked something. A 4x4 matmul is 16 MULs and 48 MADDs - 1 MUL + 3 MADDs per element, 112 FLOPs altogether.

edit: Ah, I see what you did. You fixed it already.

I haven't made any attempt at timing analysis for the PC750 version so I don't yet know how many dependency stalls it's hitting (is probably not terribly high). The FLOP count for the x86 version is the same, but the x86 version is doing some extra work moving stuff around and due to not having FMADDs. And it doesn't attempt folding loads with the adds or muls. Both are using a bunch of scalar loads + broadcasts to get the scalar multiplication, instead of using vector loads. But Bobcat gets artificially penalized because its shufps instructions take 2 cycles like the addps/mulps do, and alarmingly, so do the movss loads, so it's paying a lot more for its broadcasts.

I'm not sure how you can avoid the shuffles in this algorithm. If you want to preserve the access pattern the shuffles are mandatory. The fact shuffle are expensive on bobcat is unfortunate, but I would not pin the entire fault on the compiler, given the register file size restrictions, and the register pressure stemming from that. Keeping yet another register for shuffles source wold not help the pressure.

Moving to Jaguar's improved ISA support could improve things. That gives you three-address arithmetic with AVX128, as well as broadcasts. FMA would improve things more.

I hope they did address the serious shorcomings of the SSE ISA. We'll see.

A more heavily optimized version would probably consider some friendlier storage formats if at all possible. Bet a lowly Cortex-A8 would fare nicely here with some hand-written ASM, since it has vector * scalar FMADD.. that is, if you could hide the huge latency..

I have run the same test on an A8 as well - yes, its does FMADDs just as the PPC, but it's not performing stellar by any measure - it had the worst normalized performance among the three CPUs. I might try to sit an analyze what its issues are some day, but not today.

Inuhanyou · Feb 12, 2013

Exophase said:
They're the same kind of FLOPs as the ones people talk about with XBox 360 or PS3, or the ones people are speculating about in this thread. It's just a matter of how high utilization can be. Different designs are going to have different potential for utilizing those FLOPs - and having worse FLOP efficiency doesn't necessarily mean the design is worse because those FLOPs may not have been as expensive to implement (for instance AMD's 4-wide or 5-wide VLIW gives more peak perf/cycle than single-issue or dual-issue scalar)

Going from XBox 360 level all the to GCN changes the uarch at least as much as any same-generation difference between AMD and nVidia has ever been (probably a lot more), so yes it can still apply when comparing different GPU families from the same vendor.

Ah, thanks, that was very informative.

Does that mean that the previous indications of 320 SPU's are now in flux?

Exophase · Feb 12, 2013

darkblu said:
I think you overlooked something. A 4x4 matmul is 16 MULs and 48 MADDs - 1 MUL + 3 MADDs per element, 112 FLOPs altogether.

I fixed it

darkblu said:
I'm not sure who you can avoid the shuffles in this algorithm. If you want to preserve the access pattern the shuffles are mandatory. The fact shuffle are expensive on bobcat is unfortunate, but I would not pin the entire fault on the compiler, given the register file size restrictions, and the register pressure stemming from that. Keeping yet another register for shuffles source wold not help the pressure.

Yes, the shuffles are unavoidable if you don't have a broadcast instruction. The scalar loads are the bigger fault. I haven't worked out if you can avoid any without worsening register pressure but I imagine you can meet it halfway at worst.

darkblu said:
I hope they did address the serious shorcomings of the SSE ISA. We'll see.

AVX and SSE4 do address some, and Jaguar has both. The 256-bit version isn't too useful but it should fit well with AVX128.

darkblu said:
I have run the same test on an A8 as well - yes, its does FMADDs just as the PPC, but it's not performing stellar by any measure - it had the worst normalized performance among the three CPUs. I might try to sit an analyze what its issues are some day, but not today.

I don't trust the compiler's ability to generate NEON at all (scheduling NEON well is.. hard. It has a lot of weird caveats). But it's probably just too hard to avoid ~8 cycles of latency for an FMADD..

McHuj · Feb 12, 2013

Exophase said:
Enh? I think you're confusing it with a vec4 transform (vec4 * mat4). A 4x4 matrix multiply is 4x4 dot products, which are each 4 MUL + 3 ADD. So 64 MUL and 48 ADD total. Of course the order of operations has been rearranged in order to prevent needing horizontal ops like dot product.

I meant to be describing the FLOP load for the algorithm, but accidentally used 2x SIMD figures instead of scalar ones. Fixed it in my post

Wait, is it a 4x4x4 times a 4x4x4 matrix? (4x4 matrix of 4 element vectors?)

Exophase · Feb 12, 2013

Inuhanyou said:
Ah, thanks, that was very informative.

Does that mean that the previous indications of 320 SPU's are now in flux?

Currently it at least seems like 160 is plausible.

We don't really know the nature of these SPs, who knows what kind of special instructions AMD could have added for Nintendo.. fast normalization was brought up as one possibility..

McHuj said:
Wait, is it a 4x4x4 times a 4x4x4 matrix? (4x4 matrix of 4 element vectors?)

It's a 4x4 matrix multiplied by another 4x4 matrix.

darkblu · Feb 12, 2013

McHuj said:
Wait, is it a 4x4x4 times a 4x4x4 matrix? (4x4 matrix of 4 element vectors?)

Here, help yourself.

McHuj · Feb 12, 2013

darkblu said:
Here, help yourself.

Ooops. Who knew that a 4x4 matrix has 16 elements and not 4. Thanks.

EpyonXYZ · Feb 12, 2013

A Question.

Hello.

I am brand new here so fist a great hello!

I am not into this technical stuff but one thing i noticed. I measured the surface with the logic
of these illustrations (The areas without a visible structure) ----->http://www.neogaf.com/forum/showpost.php?p=47369662&postcount=1486

The difference between the two areas is ~12.6%.

So if "Latte" is at 40nm than it is maybe more like in the area > 160sp.

If Latte is 55 nm or so than is like = 160sp.

Or the scaling of the illustration is wrong.

And why is everyone talking about ~15Watt GPU in Wii U and not 20 or 25 Watt GPU ? Is the CPU and the other components so energy hungry?

I hope this aren´t to stupid questions and thoughts. Bye

Ika · Feb 12, 2013

function said:
Success! This review has both the GDDR5 and DDR3 versions.

Green bar is 625 with mHz DDR3 @ 1333 (so less bandwidth in total than the Wii U main memory alone).
Blue bar is 750 mHz with GDDR 5 @ 3600.

And for anyone else reading this, remember these tests are with all settings on max (so Crysis 2 suffers), and that these are 8:160:4 parts. That's right folks, only 4 ROPs!

http://www.techpowerup.com/reviews/Sapphire/HD_6450_Passive/9.html

And also look at the average power consumption figures of the DDR3 version (I know it's for the whole card, but still). It looks low enough.

(((interference))) · Feb 12, 2013

FWIW I was asking Richard from DF what he made of function's analysis and he came back with this

I’m looking into the Wii U thing again when I can track down some people more qualified to look at the die. It would suggest that the sparse SDK documentation that has leaked is incorrect if it’s not a straight 4xxx variant and it would also contradict the sources we do have saying it has a stronger GPU than current-gen consoles.

I am running a story tomorrow that does explain why the ports are so poor though, and shows a state of the art current gen title looking better on Wii U. The 160SP argument seems entirely based on lower frame-rates on Wii U. Well tomorrow (maybe Thursday) a lot of stuff will be put into perspective. This doesn’t rule out 160SP however.

function · Feb 12, 2013

DF said:
The 160SP argument seems entirely based on lower frame-rates on Wii U.

He's mistaken then!

To be absolutely clear, it's about vastly lower frame rates and / or resolutions (the two are somewhat interchangeable) than on supposedly similarly powerful ~352 gflop, 8 ROP PC parts such as the HD 5550. The Xbox 360 <-> PC multi-platforms are just a standout indicator of how massively a 320 shader 8 ROP part can outperform the 360 even in a less than optimal setting.

To emphasise; it's *not* about the Wii U being slightly slower or faster than the Xbox 360. If a Wii U port had a slightly higher frame rate or resolution a few percent higher then it wouldn't change anything. We should actually expect that sometimes - its ROPs and TMUs are faster after all, and even the HD 6450 can give the 360 a run for its money. If for example Wii U ports started to run at a 50 ~ 75% higher resolution though, while still maintaining the same frame rate, then we'd be looking at compelling evidence of HD 5550 or "320 shader" performance.

So nope, the argument isn't based entirely on "lower frame-rates" (and I hope he didn't mean "than the 360")!

Edit: And you could actually have a 160 shader GPU that was stronger than the PS360. If the GDDR5 variant of HD 6450 had 16 TMUs and 8 ROPs it would probably be a very good candidate.

DRS · Feb 12, 2013

function said:
The R6xx series are all more modern than Xenon, and they're VLIW5 designs and should be more capable at a given clock speed/shader count. Thanks to DRS and Barbarian we also now know that the 360 doesn't have early Z either, while I'm pretty sure R6xx do, so that should improve efficiency.

There might well be nothing in the idea of the Wii U tracing its evolution back to an R6xx based design, but I really think there's something in the idea of 160 shaders.

Honestly, even though it may be plausible to have 160, I don't really leverage this idea because of my own hypothesis

In my opinion, to achieve maximum fillrate with 8 ROPS and 8 SIMDs calculating 4 pixels each, shader programs are limited to 4 clocks (or instructions). Since I'd guess programs usually are bigger than that, 8 ROPS wouldn't make sense. Looking at AMD's offerings, I don't see any 160:16:8 config available either.

Besides that, most texture units would load the same texture, most certainly resulting in higher texel bandwidth requirements to the external world. Therefore, I find it more plausible that we are looking at a 8x8 SIMD design at least. It would be more efficient in reaching its goals.

About the ports: None of these online comparisons show models close up, so how can we judge texture quality? BLOPS 2 has better shadow resolution on Wii U BTW! It looks like they use mipmapping on the cubemaps or so. The proof is on youtube, look at the shadow of the mic and it's cable on the WiiU and how it is missing on 360.

Ok burn me down if I'm wrong.

Another thing that keeps bugging me. Hacking a console is one thing, tracing down the clockspeeds something else and grepping on a directory the complete opposite. Why do we take for granted that the GPU is clocked at 550MHz? Someone else pointed out that the eDram is not likely to be asynchronous. So why would the CPU and GPU be when they're on the same die? It makes no sense to me at all!

AzaK · Feb 12, 2013

function said:
He's mistaken then!

To be absolutely clear, it's about vastly lower frame rates and / or resolutions (the two are somewhat interchangeable) than on supposedly similarly powerful ~352 gflop, 8 ROP PC parts such as the HD 5550. The Xbox 360 <-> PC multi-platforms are just a standout indicator of how massively a 320 shader 8 ROP part can outperform the 360 even in a less than optimal setting.

But can't that be explained away due to memory bandwidth and sub-optimal engines? The Wii U having low main memory bandwidth and PC parts generally having dedicated fast VRAM. Therefore an engine/game that assumes VRAM and/or MAIN memory access is of a certain minimum bandwidth for reading and writing (ala 360) would suffer on Wii U regardless of the shaders would it not?

function · Feb 12, 2013

DRS said:
Honestly, even though it may be plausible to have 160, I don't really leverage this idea because of my own hypothesis In my opinion, to achieve maximum fillrate with 8 ROPS and 8 SIMDs calculating 4 pixels each, shader programs are limited to 4 clocks (or instructions). Since I'd guess programs usually are bigger than that, 8 ROPS wouldn't make sense. Looking at AMD's offerings, I don't see any 160:16:8 config available either.

Nintendo don't really have to be limited to commercially available configurations. RV730 (a.k.a. Mario) actually has a configuration of 320 shaders to 32 TMUs. If you're going to be running Xbox ports you're probably going to want 8 ROPs - and the ROPs are independent of the SIMD pipelines.

Assuming Wii U does have 8 conventional ROPs, of course. There was some interesting talk about shader transparencies and no-one has identified the ROPs on the die with certainty!

Besides that, most texture units would load the same texture, most certainly resulting in higher texel bandwidth requirements to the external world. Therefore, I find it more plausible that we are looking at a 8x8 SIMD design at least. It would be more efficient in reaching its goals.

That depends on what you anticipate your workload being. If your projections show you'll be texture bound with 80 shaders to 4 TMUs you'd probably want a different ratio. RV730 has 40 shader to 4 TMUs, and AMD's low end GPUs maintained this ratio right up to the HD 6350.

About the ports: None of these online comparisons show models close up, so how can we judge texture quality? BLOPS 2 has better shadow resolution on Wii U BTW! It looks like they use mipmapping on the cubemaps or so. The proof is on youtube, look at the shadow of the mic and it's cable on the WiiU and how it is missing on 360.

I'm not aware of BLOPs 2 on the Wii U having higher res shadows,and DF doesn't mention it. Early screenshots showed the opposite, but that was probably fixed before release. Do you have a particular comparison in mind?

Another thing that keeps bugging me. Hacking a console is one thing, tracing down the clockspeeds something else and grepping on a directory the complete opposite. Why do we take for granted that the GPU is clocked at 550MHz? Someone else pointed out that the eDram is not likely to be asynchronous. So why would the CPU and GPU be when they're on the same die? It makes no sense to me at all!

CPU and GPU are on the same package, but different die and different clockspeeds, going by Marcan's figures. Wouldn't be interesting if the edram was clocked differently to the GPU. Renesas edram is supposed to go up to 800 mhz!

terraincognita · Feb 12, 2013

function said:
Same resolution, and the Wii U won't have the copy-out overhead that the 360 does for at least 2 buffers.

Teasy was probably referring to the dynamic resolution on the PS3/360 versions:

http://www.eurogamer.net/articles/digitalfoundry-trine-2-face-off

All three console versions render in 720p [Update: Frozenbyte has now confirmed dynamic resolution scaling on PS3 and 360 to sustain frame-rate vs. locked native resolution on Wii U], so resolution certainly isn't the issue with regards to the overly soft image on the PS3.

From DF (context is about PS3's FXAA implementation). Trine 2 uses Frozenbyte's own proprietary dx9 engine (I think) and they claim they were able to port the game from PC -> Wii U in two days, further tweaking/optimizing it from there. They also claim they were able to port the expansion without trouble which wouldn't be the case with the PS3/360 (possibly because of the higher RAM pool, obviously if they really wanted to they could port it to PS3/360 with optimizations/modifications).

In the DF face-offs for BO2, Batman, and ME3 they mention the assets being a close match for the 360 version. On a superficial level the Wii U is a closer match to the 360 than it is to the PS3 (3 PPC cores and a radeon GPU + eDRAM). Now at this point in the lifecycle of current gen consoles, cross-platform engines/middleware are optimized to the different hardware of PS3/360 and devs are able to achieve somewhat a level of parity with many multi-platform games.

Last year a GAF user named IdeaMan mentioned poor CPU performance being reported with multiplatform ports and a need to optimize in the middleware/engine (probably not something many publishers/devs would be interested in). The context indicated (for me anyways) 360 -> Wii U issues. ME3 and Batman are both UE3 games and Epic lists Wii U as an officially supported platform for UE3. After the last E3, DF wrote a bit about Batman on Wii U. The draw distance was lower but there were improvements in textures and lighting including stuff not in the PC version (IIRC). The game that shipped though had none of these extras and again matched the 360 assets (with a poor framerate during combat and no 3d).

This might sound like a dumb question (forgive me if I sound ignorant, it's my first post). Since the CPU/GPU/RAM architecture is a bit different than the 360, if (big if) the engines/middleware used were based off of a 360 "path" what kind of an impact would it have? If that's even how things like that work. If Trine 2 was really ported from the PC version rather than from the 360 version, could that be a potential reason it fared better than others? That's assuming the other titles were even ported from the 360 "path" to begin with of course.

I get what you guys are saying about how if it was a better GPU compared with the 360's (better meaning more than the 360's 240 shaders) the results should be better than they were. I'm curious if there are any 'tricks' being used by the 360 which would give a later/improved architecture with slightly more shaders some issues in porting. I remember my nvidia 460 gtx getting a lower 3dmark 06 score than an 8800gt that it replaced. I think the drivers improved it, but I had the feeling that DX10/DX11 cards potentially had issues with DX9 games due to architecture differences. Then again, in the case of the Wii U that wouldn't apply in Trine 2's case if it's a DX9 game (I think it is anyways, not that Wii U uses DX of course but you know what I mean in terms of feature set).

I'm doubting the 320 shaders based on the size comparison, but if it is 40nm and if the SIMD blocks are larger than expected perhaps it has more than the anemic 160 shaders you're speculating. If it does only have 160 shaders, the 3rd party stuff is more impressive than I thought (for the hardware, unimpressive compared to the current gen). Nintendo fans with GPGU hopes will be even more disappointed though.

Wii U hardware discussion and investigation *rename

Exophase

Inuhanyou

Exophase

McHuj

Exophase

darkblu

Inuhanyou

Exophase

McHuj

Exophase

darkblu

McHuj

EpyonXYZ

Ika

(((interference)))

function

None functional

DRS

AzaK

function

None functional

terraincognita

Similar threads