Wii U hardware discussion and investigation *rename

Goodtwin · Jan 2, 2014

TheWretched said:
So then... they just removed two players from the Online Mode just because... why? I am not saying it's impossible, but there's a reason. And since the GPU and the RAM size is better than on PS360, it more or less only leaves the CPU as the culprit.

I am quite sure that the WiiU "budget" for those games is miniscule compared to the rest, so... there's the reason of "making it work".

Its certainly possible that the time it would have required to optimize the online mode simply wasnt worth it. They only spent about 3 months on the Wii U build. So its possible that the sacrifice of a couple players online made gave them the performance they wanted with no extra work, and that allowed them to focus on the single player portion of the game. I also doubt this game is multi threaded. It runs just as good on the PS3, a single core processor, so I kind of doubt the game scales across all three Espresso cores. Its probably maxing out the main Espresso core with the other two doing very little. I do believe a single core on the Xenon can do more work than a single Espresso core, but since the Xenon was in order with very limited L2 cache, it couldnt keep multiple threads fed. The Espresso on the other hand is out of order (although limited out of order) with tons of L2 cache. Flops performance will likely still be anemic compared to the Xenon and especially Cell, but developers who effectively scale across all three Espresso cores should find respectable performance.

function · Jan 2, 2014

Goodtwin said:
I also doubt this game is multi threaded. It runs just as good on the PS3, a single core processor ...

Stop.

Goodtwin · Jan 2, 2014

function said:
Stop.

Your such a tool. Contribute something for goodness sake. Members here on these boards have already spoken of the fact that the limited L2 cache didnt allow the Xenon to keep multiple threads fed. Multi threading wasnt done very effectively on the 360, and PS3 is a single core with SPE's that are limited in the type of work they can effectively do.

Shifty Geezer · Jan 2, 2014

Goodtwin said:
Multi threading wasnt done very effectively on the 360, and PS3 is a single core with SPE's that are limited in the type of work they can effectively do.

You are so utterly, utterly wrong with that, function was right in his terse reply. There's no way any sane technology enthusiast who believes themselves capable of discussing consoles on a technological level can be that confused over Cell at this point. And I in turn am extremely suspicious that you're an already banned member...

creaks · Jan 2, 2014

wtf is going on with these last few posts?

TheWretched said:
So then... they just removed two players from the Online Mode just because... why? I am not saying it's impossible, but there's a reason. And since the GPU and the RAM size is better than on PS360, it more or less only leaves the CPU as the culprit.

I am quite sure that the WiiU "budget" for those games is miniscule compared to the rest, so... there's the reason of "making it work".

As function reminded me, most mainstream genre online console games use peer to peer networks. That means one system is chosen as the host, the user chosen is the one determined to on average be closer to all the other clients.

Now, in a high player density environment, this works well enough, when there are hundreds to thousands of people online in your state or nearby states, the likelyhood of good lag free sessions and performance are high. a few outliers can be handled, sure sometimes you run into THAT GUY whos probably trying to play from korea to na or something and is jumping all over the place... But more often than not its good enough.

When you are in a low player density environment, your host system is likely to be undesirably far away from many of the clients, add in problems like users with spotty connections, and you will soon find with the more players you add in such a stretched network that everybodies becoming 'that guy', and every person you add thats not in a good range makes the problem worse.

The latency caused by such low player density was likely determined to be unacceptable at 8 people in a match, so it was lowered to 6 to mitigate the low player density issue.

The Wii u's launch userbase, and continued low user base was the reason for the change.

Deleted member 11852 · Jan 2, 2014

creaks said:
wtf is going on with these last few posts?

I don't know but I was a couple of weeks ahead of Shifty in terms of suspicions.

The latency caused by such low player density was likely determined to be unacceptable at 8 people in a match, so it was lowered to 6 to mitigate the low player density issue.

Interesting hypothesis, which is also logical and fits the facts. GET OUT OF THIS THREAD NOW!

Goodtwin · Jan 2, 2014

Shifty Geezer said:
You are so utterly, utterly wrong with that, function was right in his terse reply. There's no way any sane technology enthusiast who believes themselves capable of discussing consoles on a technological level can be that confused over Cell at this point. And I in turn am extremely suspicious that you're an already banned member...

Not so much. The reason I joined this website was because I knew I could learn things from members here. I dont mind me being told I am wrong, as long as there is some explanation. I dont claim to be anything more than a Nintendo enthusiast who is trying to acquire a better understanding of the Wii U's hardware. So many people just want to say its weak, end of conversation.

If this forum is only for seasoned techies then you can close my account. I have browsed through this thread, and while there has been tons of discussion, very little was ever considered to be conclusive. At one point it was 320 SPU's on the GPU, and then came the 160 SPU numbers. Early on everyone claimed the Wii U is memory bandwidth starved, but now between games and developer comments, that doesnt seem to be true at all.

Shifty Geezer · Jan 2, 2014

Discussion we have no problem with, but if you don't really know what you're talking about, it's far, far better to say so up front rather than make ridiculous 'technological' observations. What you did was comparable to walking into an archaeological convention and telling them your theory of the latest fossil find based on an assumption that the universe is 5000 years old - such creationists are entitled to their opinions but not to bring them to a community that's dealing with...different knowledge. Wandering into B3D with the theory that Cell was a single-core processor is almost equally incredulous.

Cell is a heterogeneous multiprocessor CPU. SPE's are capable of running any CPU jobs, just some better than others. Cell in PS3 has a single core, dual threaded PPE and 6 single core SPEs available for devs to use. Devs have been wrestling with multithreaded code for the past 7 years on both last-gen consoles.

The discussion on Wii U has stalled because there hasn't be any new info and there's really not much to go on. The latest discussion has resorted to me software comparisons with dubious comparisons between different titles on different platforms. When the number of vehicles between platforms on the same game aren't considered a fair comparison, I'm not sure there's any way forward with this discussion. The idea of using comparable software is going to at least need a modicum of understanding of the software to interpret the (generally inaccurate) observations. If one's interpretation is based on a level of understanding that PS3 games aren't even multithreaded, one has to do a lot of background learning before one can really start to try to pick apart the software signs regards Wii U.

Goodtwin · Jan 2, 2014

Thats all fair. My knowledge is pretty limited to articles written on websites like Eurogamer and IGN, so more often than not, those article dont go in depth on certain details. For example, I had always read that the SPE's on the Cell werent very good at branching code, and thus were more or less limited to graphics rendering. I have personally accepted that Wii U fits pretty close to 360 and PS3 in terms of outright performance.

I do have a question on the GPU, what is the consensus on the GPU? Tecmo KOEI, Fronzebyte, and Shin'en havent been hesitant to say the Wii U has a more powerful GPU, so how does developers impression line up with what is believed to be a 176Glfop GPU compared to a 240Gflop X360 gpu? I am curious since most websites dont talk very highly of the VLIW5 efficiency? Why are developers getting more from less?

creaks · Jan 2, 2014

Goodtwin said:
Thats all fair. My knowledge is pretty limited to articles written on websites like Eurogamer and IGN, so more often than not, those article dont go in depth on certain details. For example, I had always read that the SPE's on the Cell werent very good at branching code, and thus were more or less limited to graphics rendering. I have personally accepted that Wii U fits pretty close to 360 and PS3 in terms of outright performance.

I do have a question on the GPU, what is the consensus on the GPU? Tecmo KOEI, Fronzebyte, and Shin'en havent been hesitant to say the Wii U has a more powerful GPU, so how does developers impression line up with what is believed to be a 176Glfop GPU compared to a 240Gflop X360 gpu? I am curious since most websites dont talk very highly of the VLIW5 efficiency? Why are developers getting more from less?

I would wager the main reason cell became known for graphics rendering was because the rsx, well, needed the assistance to get that multiplatform good enough parity, rather than any inherent shortcoming, though it is rather on the nose to point out its spu's dont excel at tasks that dont benefit from paralellization.

Flops are not consistant between architectures. Thats really the problem with the run away marketing/buzz wording on flops these days.

This is the reason why we have standardized 'X86 flops', to use as, well standardized measurement between architectures. What you would do is compare the architectures 'native flops', to how many 'x86 flops' the work would be, and then you could compare to other architectures that had done the same.

Which inevitably brings us to the realization that this recent flop pushing is an unfinished picture.

The flop count is only how many floating point operations can be performed in a certain measurement of time... But it doesnt tell us how many flops are needed to complete a certain operation.

For example square root is typically a 1 flop operation for gpu's.

But its a 15 flop operation on x86 architecture.

(following this for this example http://ai.stanford.edu/~paskin/slam/javadoc/javaslam/util/Flops.html)

Finding an apples to apples comparison is not as easy as all the flop slinging would lead one to believe.

jlippo · Jan 3, 2014

Goodtwin said:
Thats all fair. My knowledge is pretty limited to articles written on websites like Eurogamer and IGN, so more often than not, those article dont go in depth on certain details. For example, I had always read that the SPE's on the Cell werent very good at branching code, and thus were more or less limited to graphics rendering.

As always, when searching for answers go to the source.
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE70D5852575CA0074E8ED
That should cover most of your questions on architecture.

Let's not derail the thread more.

creaks · Jan 5, 2014

Perhaps we should stop focusing on the width of the bus, and focus on the length, the latency. It must be awful short compared to the systems we are comparing the bandwidth too.

What effect does such a short distance/fast turnaround have on performance, never really considered latency too much with graphics performance, but maybe its time to take a look. How much of a boon is info arriving faster over sending more at once? how fast would it have to be to even out with a wider but slower/longer bus? To surpass? Can we get an estimate to the time in ns?

Smaller transfers would see the most benefit, what does this mean for optimizing throughput? More unique smaller tiles to be assembled into larger ones after transfer?

Shifty Geezer · Jan 5, 2014

creaks said:
What effect does such a short distance/fast turnaround have on performance?

Virtually none. See the DF interview with the XB1 architects when asked about the low-latency benefits of ESRAM in XB1. GPUs are designed around high-latency stores and mitigating them with massive pipelining and cache structures. Because Wii U is using a conventional AMD architecture AFAWK, it'll have the same memory management and so not have anything much to gain from extremely low latency eDRAM.

creaks · Jan 5, 2014

Shifty Geezer said:
Virtually none. See the DF interview with the XB1 architects when asked about the low-latency benefits of ESRAM in XB1. GPUs are designed around high-latency stores and mitigating them with massive pipelining and cache structures. Because Wii U is using a conventional AMD architecture AFAWK, it'll have the same memory management and so not have anything much to gain from extremely low latency eDRAM.

Im going to definately read that interview now. However nothing is a VERY big pill you are asking me to swallow. I agree that amd will have similar ddr3 memory controllers to main memory, but how may 7xx-8xx can you think of with 32Mb of edram embedded onto the processor itself? Doesnt that already change the architecture from conventional commercial gpu products?

Okay, ive read this interview, and I just dont see this as an applicable situation.For starters, we are looking at great bandwidth, so, who cares about latency, its a traditional gpu, no need to consider latency sensitivity. And past that, the gulf in system power is just too great, the assets these systems are going to be playing with are just going to be too different. With bandwidth like this, traditional throughput optimizations will work just fine. Why work towards a latency sensitive gpu when youve set up such high bandwidth? No amount of short latency in the world can bridge the gap between xbones and wii u's bandwidth unless it has a flux capacitor on the mc that sends the data back in time.

The wii u I see a very different situation, it does NOT have bandwidth like the xbones esram, and I see a very real situation where a large chunk of data sits waiting for bandwidth, where, had it been organized in smaller more numerous packages it would have been sent, and recieved much, much more quickly. Usually, the 'start up cost' of individual transfers is a pain, which is why it is mitigated by moving more at once. Here, we cant move as much at once, but the low latency mitigates the start up cost.

Being latency has greater impact on smaller more numerous data transfers, I see it being something that is of concern to how to approach the wii u, but not the xbone/ps4, that deal with much larger transfers of data... Too large of transfers, I was more thinking along the lines of ps360 as comparators, than xbone/ps4

Shifty Geezer · Jan 5, 2014

creaks said:
Being latency has greater impact on smaller more numerous data transfers, I see it being something that is of concern to how to approach the wii u, but not the xbone/ps4, that deal with much larger transfers of data...

Not really. They're all dealing with mesh data and textures and shader programmes. The GPUs are architecturally the same in how they queue workloads and preload required data into caches. Don't look at XB1 but every other AMD (and nVidia) GPU - they all work the same way. And XB1's GPU also works the same despite have low-latency ESRAM.

Unless AMD have completely rejigged the underlying GPU architecture, it'll work exactly the same. The GPU will work on whatever it has in its cache and precache what it's going to work on next. In discussing ESRAM's low latency for XB1, some devs suggested it may be useful for some compute workloads. For general graphics though it brings nothing. The hardware isn't designed to use it.

creaks · Jan 5, 2014

Shifty Geezer said:
Not really. They're all dealing with mesh data and textures and shader programmes. The GPUs are architecturally the same in how they queue workloads and preload required data into caches. Don't look at XB1 but every other AMD (and nVidia) GPU - they all work the same way. And XB1's GPU also works the same despite have low-latency ESRAM.

Unless AMD have completely rejigged the underlying GPU architecture, it'll work exactly the same. The GPU will work on whatever it has in its cache and precache what it's going to work on next. In discussing ESRAM's low latency for XB1, some devs suggested it may be useful for some compute workloads. For general graphics though it brings nothing. The hardware isn't designed to use it.

Compute has been on my mind, definately not compute shaders in wii u, but geometry shaders could be used fairly flexibly for some compute.... low latency is definately a boon for interaction... But im not looking into it right now. Too much at once I suppose.

32Mb isnt usually what one talks about with a gpu cache. Usually its what, a few texels worth of data for texture sampling? Which is why I guess gpu compute focuses on er, well, compute power where seqential and random access pattern memory performance arent such a big deal. Look at this, I specifically said I wouldnt talk about compute, I am so damn easily distracted.

All right, usually a gpu has to get texel data off chip when texture fetching. To hide that latency we get to exactly what you brought up, the gpu works on the next fragment in the meantime, until the data arrives... But with so much memory now on chip, arent we looking at a situation where we dont have nearly as much latency to hide?

Shifty Geezer · Jan 5, 2014

creaks said:
But with so much memory now on chip, arent we looking at a situation where we dont have nearly as much latency to hide?

Possibly, but it makes no difference. If a GPU architecture is capable of hiding/dealing with 100 ms RAM latency, attaching that same GPU to 10 ms RAM won't really get you anything. You'll still be working within the limits of the GPU's internal memory management.

creaks · Jan 5, 2014

Shifty Geezer said:
Possibly, but it makes no difference. If a GPU architecture is capable of hiding/dealing with 100 ms RAM latency, attaching that same GPU to 10 ms RAM won't really get you anything. You'll still be working within the limits of the GPU's internal memory management.

Which begs the question WHY attach to said low latency memory, which of course nintendo aint answering... So we get to foolaround for a bit. I still think there is something here, there is just too dang much memory on die for me to dismiss it. Perhaps its not just about texture memory. How many tlb levels could be in all that memory? Maybe its because amd memory latency kind of sucks compared to the competition and it affects performance...

I dont know how drastically a gpu would need to be changed to specifically target low memory latency, but I dont believe its necessary just to see benefits of fixing bad latency issues. I recall amd getting spanked quite bad in Sandra benchmarks, because of its gpu latency. 6850 and llano were horribad. Guess hiding that latency wasnt working out to well either.

Constant memory performance was horrible because of the latency. Suggested just dont use it. I remember that, it was harsh.

Shared memory latency was another amd failure. But not nearly as abysmal as constant, suggested copying constant to shared to improve performance by 50%.

Private memory failed, hard, overspill was nightmare fuel.

Im thinking AMD could greatly benefit from fixing memory latency issues on its gpu's.
Maybe nintendo figured the same. That damn die sure has a lot of its footprint taken up by memory.

function · Jan 5, 2014

To match the edram's likely bandwidth of 35 GB/s using Nintendo's preferred DDR3 1600 memory would have required another 192-bit external bus and another 12 memory chips. That'd be a 256-bit bus total.

The GPU would have needed to be physically larger (and more expensive) just to accommodate the I/O from the chip, the board would need to be larger (look at the 360's DDR3 memory arrangement), and power draw would almost certainly have been higher too.

What Nintendo chose was almost certainly cheaper and smaller and cooler (and therefore smaller and cheaper) than using an external memory bus to do the job, and it also gave the possibility of shrinking at some point in the future.

They could probably have got away with DDR 2133 and a 128-bit bus, going by Richland performance, but that's two steps of memory speed up from the DDR1600 that Nintendo chose, and would still have given them lower frame buffer B/W.

HTupolev · Jan 5, 2014

creaks said:
Which begs the question WHY attach to said low latency memory

It's possible you're just looking at this from the wrong angle.

There doesn't necessarily need to be a reason for the low latency, when the bandwidth advantages are a sufficient explaination. Embedded DRAM allows for bandaiding of DDR3 memory bandwidth while maintaining a simple PCB, overall inexpensive BOM, low power consumption, etc...

Low latency might be a non-bothersome side effect, as opposed to a goal in itself.

Wii U hardware discussion and investigation *rename

Goodtwin

function

None functional

Goodtwin

Shifty Geezer

uber-Troll!

creaks

Deleted member 11852

Guest

Goodtwin

Shifty Geezer

uber-Troll!

Goodtwin

creaks

jlippo

creaks

Shifty Geezer

uber-Troll!

creaks

Shifty Geezer

uber-Troll!

creaks

Shifty Geezer

uber-Troll!

creaks

function

None functional

HTupolev

Similar threads