Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
Okay, let's see the arguments presented:

- function shows that an HD5550, with a core configuration matching what people are speculating for Wii U (320SP, 16TMU, 8ROP, 550MHz) vastly outperforms an XBox 360, and that XBox 360 slightly outperforms Wii U in most cases (games from several developers)

- You show that an HD4670 (also 320:16:8) at 800MHz performs similarly to Wii U while pushing twice as many pixels

The only thing that makes this comparison tricky is the big difference in memory hierarchy. The video cards have something very different from Wii U, that much we know for sure.

Memory hierarchy is far from the only thing that lets these comparisons down. We're talking about taking a game from one system (PC), comparing it to a version of the same game on another system (360) and then using that to compare to different games on a third system. Then using that to come to a conclusion on one particular part of the third systems architecture :???:
 
Last edited by a moderator:
Memory hierarchy is far from the only thing that lets the comparison down. We're talking about taking a game from one system (PC), comparing it to a version of the same game on another system (360) and then using that to compare to different games on a third system. Then using that to come to a conclusion on one particular part of the third systems architecture. I agree the other argument against was no better, neither one is worth anyone's time IMO.

I meant hardware comparison. But I don't agree with you that these cross-platform comparisons aren't worth anyone's time. The entire argument about SPs is presupposing at least a moderately modern-Radeon like design, and I don't see why AMD would be tasked with something grossly different. Furthermore, I don't see why it'd have worse driver overhead than the PC. Developers might need some work to better utilize the eDRAM but they should have no problem using it for render targets at least as competently as they would have on XBox 360.
 
- function shows that an HD5550, with a core configuration matching what people are speculating for Wii U (320SP, 16TMU, 8ROP, 550MHz) vastly outperforms an XBox 360, and that XBox 360 slightly outperforms Wii U in most cases (games from several developers)

function showed a link to a review showing FPS numbers for some games with a HD5550, using IQ settings that very few people in the world would know how they relate to the IQ settings used in X360.

Furthermore, there's the "even though it's in a PC" argument that I find equally flawed. Even considering that "horrible, inefficient" API overhead (which BTW I'm starting to think it's terribly overrated), the PC used in that review had a 4x larger than Xenon (in number of transistors) Core i7 overclocked to 3.8GHz along with 3-channel 6GB of 1500MHz DDR3 memory.

That said, function didn't really show anything that can be objectively used to compare Xenos to a HD5550, Xenos to Latte or Latte to a HD5550.

Yes, a HD5550 should be a lot faster than Xenos at the same clock, at least because it has about twice as many transistors inside and uses an architecture that is reportedly more efficient, but that review doesn't really say anything about it.


I do think the CPU+RAM is holding back the entire system, and that maybe they built the system counting with GPGPU in mind.
Iwata did claim that GPGPU was an important point in the console. And athough I think he and many others need to go because the Wii U is a terribly short-sighted console from the hardware POV, he wouldn't be so stupid as to suggest that developers whould be using the GPGPU capabilities of a 160-ALU VLIW5 GPU.
 
It's been brought up like a million times now. If the Wii U games were were consistently CPU limited (including due to not enough bandwidth) and had GPU resources to spare they'd be running at higher resolutions. CPU utilization doesn't scale with resolution. No one is saying "horrible, inefficient" API overhead per se, just that the PC is probably not going to have a software advantage.

The review does say that all the games were tested at the highest quality settings, and I think it's pretty safe to say that it'll be at least as high quality as XBox 360. It's not important to determine an exact ratio in performance here, all that matters is establishing how much better this PC is doing at minimum.

Comments about GPGPU from Nintendo don't really mean anything. They amount to little more than a feature bullet point, not a statement that they play a big role in how they expect games to be developed (or that that's how games actually are being developed).
 
Right, it was my point that the GPU is 800GFLOPs, has 3times the ram available to it, and a much faster CPU.

800 GFlolps? Where do you get that the 4670 is an 800 Gflop card? It's 480 at standard clocks, and so double the 360, but has only 50% more theoretical fillrate. With the Wii U specs you're pushing (16:320:8) the 4670 has only 35% more gflops and fillrate, but while the 4670 can run at > twice the resolution of the 360 the Wii U can't quite keep up with the 360 at the same.

The closest configuration to the one you think is in the Wii U is the HD 5550 btw, as it (probably) has the same number of TMUs.

Your GDDR5 comment makes no sense btw, as 360 and Wii U lack this as well.

That 4670 is much faster than the 360 and the Wii U, but without GDDR5 will be bandwidth constrained and not able to fully show its stuff. The 360 and Wii U have edram. The 360 has a monster load of BW for the ROPs, and the Wii U probably does too. I originally thought there was a BW constraint on the Wii U ROPs affecting transparencies and MSAA but now I'm not so sure.

I am starting to think you are another poster on Neogaf...

I like NeoGaf, but I don't have an account.

... just come here to sound like you are more technically apt, not sure it is working though as your inability to read posts properly and notice that that PC has much more power than the Wii U, would give you some pause before stating that it hurts my point. That build is clearly struggling with the game with over twice the power we are assuming Wii U to have, well 4times the power in your case. Even given the resolution change, you can't make up that much ground with only 352GFLOPs and a much slower CPU.

I suggest you run the numbers on the 4670 again, you're well off. When you check 'em you'll see that you're actually supporting my point.

The reality is, the only thing that points to 160SPs right now and completely dismiss 320SPs is a group of fanatics that clearly have never ported games to different platforms and would like nothing more than Wii U to lack a basic ability to out perform last generation consoles. I've left you to your ridiculous theories of R600 integration and 160SP, from developers comments we have been told that it is R700, Matt from NeoGAF clarified it to me directly in the latte thread, and quite a number of developers have said the GPU is about 50% better, well... considering 320 is 50% more SPs than 240, and 160 is 30% in the opposite direction, I think it is premature to say Wii U is 160SPs and 320SPs is out of the question.

My R6xx origin theory might be far out, but I've never claimed it was fact and I had fun exploring the idea. I'm not particularly attached to the idea, but if it was true I'd probably be more inclined to give Nintendo a pass on the incredibly low level of performance that the Wii U offers (for a 2012, $350, multiplatform "radically better graphics" dream box). Not sure where your developer chums would get info on the architectural background of the Wii U though, if detailed documentation is as hard to come by as Eurogamer reckon.

Meanwhile you should check out the Wikipedia page with the 4670 specs:

http://en.wikipedia.org/wiki/Compar...essing_units#Radeon_R700_.28HD_4xxx.29_series
 
Doesn't eDRAM draw quite a bit of power, especially with so much of it on the die?
Im assuming that it accounts for like 5W of the entire chip or something at least. Dram needs continuous power to refresh the cells and decent amount of power to access IIRC.

Also to anyone using a e6750 or any AMD embedded chip for comparison. These chips are binned very very thoroughly for low power consumption from normal radeons. You can not do something like that for a custom chip like the Latte unless you want to throw out 90% of the working chips.

If you want an console to console environment of Perf/W without binned chips. Look at the 360. It at 40nm(?) draws 90W max power for CPU and GPU. If you assume 100W TPD and 60% of the power to the GPU since its larger, you get 240GPFLOPs/60W = 4GFLOPs/W. This is with less eDRAM eating power.

How would it be realistic for the Wii U's GPU to tripple the Perf/W of the 360?
 
Has anyone taken a stab at SRAM density to try to see if this could really be made on TSMC 40nm? It looks to me like the 8KB cell making up the 1MB of SRAM in the upper left hand corner is about 0.0306mm^2. That gives a cell size of 0.467um^2 per bit.

TSMC claims 0.202um^2 to 0.324um^2 for their 40G process, vs around 0.5um^2 for 65nm. So this doesn't seem like TSMC 40nm. TSMC does have very good density (you can see it for instance in comparison with Cortex-A9 die size in Tegra 2 and 3 vs something in a 45nm Samsung die) so it's still possible that this is what they're getting with NEC's 40nm process.

EDIT: The 16KB cell in the 128KB block near the Starlet port (Marcan's claims seem to fit very well when compared with the Hollywood die shot he scrounged up) seems to be about 0.354mm^2, the 2KB block in the Starlet cache about 0.5mm^2 Maybe those are dual-ported? They look different.. Can someone check my numbers to make sure I'm not totally off on these?

Renesas says ~0.3um^2 for their 40nm process: http://www.nec.com/en/global/techrep/journal/g09/n01/pdf/090113.pdf
 
Last edited by a moderator:
Doesn't eDRAM draw quite a bit of power, especially with so much of it on the die?
Im assuming that it accounts for like 5W of the entire chip or something at least. Dram needs continuous power to refresh the cells and decent amount of power to access IIRC.

That eDRAM is going to draw a lot less W/GB/s than external DDR3 or especially GDDR5. Those things are included in the TDP for discrete GPUs.

If you want an console to console environment of Perf/W without binned chips. Look at the 360. It at 40nm(?) draws 90W max power for CPU and GPU. If you assume 100W TPD and 60% of the power to the GPU since its larger, you get 240GPFLOPs/60W = 4GFLOPs/W. This is with less eDRAM eating power.

How would it be realistic for the Wii U's GPU to tripple the Perf/W of the 360?

360 and Cell use very inefficient CPU designs. No reason to assume 60% of the power consumption is in the GPU. If you shrunk a Pentium 4 to a modern process but kept it near its top clocks the perf/W would probably be terrible too.
 
It's been brought up like a million times now. If the Wii U games were were consistently CPU limited (including due to not enough bandwidth) and had GPU resources to spare they'd be running at higher resolutions. CPU utilization doesn't scale with resolution.

DDR3 is not really the perfect solution for 1080p gaming even with a 256bit wide bus (let alone narrower ones which is the case here for sure), so it might be that there are "spare" GPU resources indeed, but only shader performance/processing power and not bandwidth wise.

360 and Cell use very inefficient CPU designs. No reason to assume 60% of the power consumption is in the GPU.
TDP is != power consumption, it's close in cases with full load, but still far from the same thing.
 
Doesn't eDRAM draw quite a bit of power, especially with so much of it on the die?
Im assuming that it accounts for like 5W of the entire chip or something at least. Dram needs continuous power to refresh the cells and decent amount of power to access IIRC.

The opposite seems be true these days, although I'm now wandering too far afield from where I could characterize which factors are likely to most strongly affect something like WiiU's design.

SRAM at finer geometries and at the usual speeds it operates leaks quite a bit. DRAM or the equivalent on-die can be configured such that the comparison is between 6-8 leaky SRAM transistors per cell that leak all the time versus a capacitor and transistor pair that occasionally need refresh.
 
DDR3 is not really the perfect solution for 1080p gaming even with a 256bit wide bus (let alone narrower ones which is the case here for sure), so it might be that there are "spare" GPU resources indeed, but only shader performance/processing power and not bandwidth wise.

But that's completely ignoring eDRAM/eSRAM. Guess we'll see first hand how much of a difference it makes when reviews compare Haswell GT3 with and without Crystalwell, although that'll probably be using more than 32MB.
 
The 160SP idea seems bizarre (in a "why would they do that" kind of way) but I'm inclined to agree with function here, the performance we're seeing fits it much better.. you can find HD 6450 (160SP part, although higher core clock than Wii U) results that seem competitive with XBox 360 too, although I haven't looked that deeply into it..

That's a good idea, I'll try and find some 6450 benches. Wikipedia reckons 4 ROPs so I'll need some low res results, and a 64-bit bus so I'll have to look for the GDDR 5 variant.

A good matrix multiply kernel will push down main memory access overhead asymptotically towards zero because it grows at n^2 while FMADDs grow at n^3. It's a good test for when you want to try to show off something near peak FLOP performance.

I'd like to see generated assembly for both cases (I saw blu offered, yes I'm interested!) but it's not hard to imagine how Broadway could get better IPC than Bobcat here. It has the advantage of FMADDs, three-way addressing, and more registers (well, more register addressing flexibility anyway, 32 2x32-bit registers vs 16 4x32 - this is assuming x86-64 was used). While the peak FLOPs are similar the FMADDs free up a dual-issue opportunity for loads, stores, and flow control stuff, plus it can issue branches outside the normal dual-issue. x86 can soak up some of that back with load + op but the compiler might be afraid to do that since IIRC it requires alignment for SSE.

This doesn't mean that it applies to general purpose code though, although it might somewhat transfer to some other FP heavy stuff.

Perhaps I was underestimating the CPU. Having seen Bobcat vs Atom benches (were Atom wins at flops, but Bobcat wins at just about everything else) I had it in my head that it would be a walkover for Bobcat. But I guess Bobcat has different design goals and constraints and so you can't just make assumptions like that.

Hopefully blu can do some additional tests to give us a more rounded picture. Access latency for the edram vs the GDDR 3 would be interesting, as it might give us some insight into the Wii U and possibly even Durango. That leaked Xbox 360 diagram reckoned 525 cycles for main memory access for Xenon, which seems huge (although maybe it was a little lower after the CPU down-clock to 3.2 gHz).
 
Its also worth mentioning Trine 2. A game which runs with better shader quality on WiiU and at a higher resolution than 360 or PS3.

Same resolution, and the Wii U won't have the copy-out overhead that the 360 does for at least 2 buffers.
 
That's a good idea, I'll try and find some 6450 benches. Wikipedia reckons 4 ROPs so I'll need some low res results, and a 64-bit bus so I'll have to look for the GDDR 5 variant.

If you could find some comparisons of the DDR3 and GDDR5 versions that'd give us an interesting data point of how bad things can get near the bottom of memory bandwidth limits.

Most of the reviews seem to use the GDDR5 version. Phoronix covered the DDR3 one (http://www.phoronix.com/scan.php?page=article&item=amd_radeon_hd6450&num=1) but since that's Linux-only it's hard to draw much of a comparison, except between it and the other cards tested..
 
Success! This review has both the GDDR5 and DDR3 versions.

Green bar is 625 with mHz DDR3 @ 1333 (so less bandwidth in total than the Wii U main memory alone).
Blue bar is 750 mHz with GDDR 5 @ 3600.

And for anyone else reading this, remember these tests are with all settings on max (so Crysis 2 suffers), and that these are 8:160:4 parts. That's right folks, only 4 ROPs!

http://www.techpowerup.com/reviews/Sapphire/HD_6450_Passive/9.html

Despite being only 20% faster the GDDR 5 version is pulling in figures 40% + higher. And check out the Hawx 1024 x 768:

http://www.techpowerup.com/reviews/Sapphire/HD_6450_Passive/16.html

Holy crap! 76% faster for the GDDR 5 version!

So yeah, double up the ROPs and TMUs and give it a chunk of fast memory and it looks like a 16:160:8 Wii U could hang with the 360 pretty well. The GDDR5 equipped 6450 can do it with only 8 TMUs and 4 ROPs.
 
If you're bandwidth limited adding more TMUs and ROPs won't help you - they'll be starved. Of course, that only applies to homogeneous situations.. it could make up for parts of the frame where it isn't, so you probably would see some benefit.. Don't know how much the workload changes.

XBox 360 CoD:MW is 1024x600 w/2xMSAA. I expect the XBox 360 to pay less for the AA than the 6450 does, meaning I could easily see this being a bit above XBox 360 performance ignoring that (for this game anyway). Apparently most Wii U games aren't using MSAA.
 
As promised, here are the assembly listings of the mat4 x mat4 testcase testvect_intrinsic.cpp. Apologies for the wonky syntax highlight - apparently that is not among pastebin's strengths.

ppc750cl assembly - the timed innermost loop starts at L103
bobcat assembly - the timed innermost loop starts at L98
 
So now our expectation is on 170 gflops? That's actually pretty intresting to me. Is it really possible that GPU architectures have advanced so much in the time from Xenos's conception that when optimizing, you can extract comparable results from a GPU with only a fraction of the raw processing power? With the same number of ROPs to boot and a slower main bandwidth, along with weaker CPU.

I guess it bodes well for 720 and Orbis's architecture efficiencies over their predecessors.
 
Status
Not open for further replies.
Back
Top