Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
It's possible you're just looking at this from the wrong angle.

There doesn't necessarily need to be a reason for the low latency, when the bandwidth advantages are a sufficient explaination. Embedded DRAM allows for bandaiding of DDR3 memory bandwidth while maintaining a simple PCB, overall inexpensive BOM, low power consumption, etc...

Low latency might be a non-bothersome side effect, as opposed to a goal in itself.

Most likely, but its hard to overlook how bad amd gpu's compare specifically because of memory latency.


http://www.sisoftware.net/?d=qa&f=gpu_mem_latency


Putting a band aid on that would be a non bothersome side effect as well.
 
Which begs the question WHY attach to said low latency memory, which of course nintendo aint answering...
As HTupolev says, it's there to address the issue of (cheap) bandwidth. The latency of the eDRAM is immaterial to the design. If a GPU has to wait on main RAM latency 1 millionth of its memory reads, then a large improvement in main RAM latency is only going to yield minuscule gains because they are such a small part of a GPUs operation.

Most likely, but its hard to overlook how bad amd gpu's compare specifically because of memory latency.
I won't rule out the possibility 100%, but MS are in exactly the same position with exactly the same option. That ESRAM sure takes up a lot of the die space, like Wii U. AMD GPUs sure are slow accessing main RAM, but then all GPUs are, and that sure could be improved. So how come MS + AMD hasn't resulted in an uber-efficient ESRAM enhanced memory architecture utilising the low latency to fabulous effect? Whatever the answer, the same philosophy has to apply to Wii U. Even more so, as there's more intention to use GPU compute on XB1 than Wii U going by the silicon design. At the end of the day, GPUs aren't affected much by the slowness of main RAM because they have been designed around this system-level limitation over the past 2 decades. You can't even improve on that design's performance by using lower-latency RAM because GPUs just aren't hitting that RAM latency limit. It will be beneficial when a GPU has to access RAM directly, but that's basically for GPGPU and not graphics work. Any time you are using a GPU that way, you'd probably be better off moving that workload onto the CPU.

So in conclusion, there's no great logic nor evidence to the notion of the eDRAM providing a latency advantage that yields performance gains to the platform for the GPU. If eDRAM is directly accessible by CPU then it will yield some small benefits on those L2 cache misses.
 
As HTupolev says, it's there to address the issue of (cheap) bandwidth. The latency of the eDRAM is immaterial to the design. If a GPU has to wait on main RAM latency 1 millionth of its memory reads, then a large improvement in main RAM latency is only going to yield minuscule gains because they are such a small part of a GPUs operation.

I won't rule out the possibility 100%, but MS are in exactly the same position with exactly the same option. That ESRAM sure takes up a lot of the die space, like Wii U. AMD GPUs sure are slow accessing main RAM, but then all GPUs are, and that sure could be improved. So how come MS + AMD hasn't resulted in an uber-efficient ESRAM enhanced memory architecture utilising the low latency to fabulous effect? Whatever the answer, the same philosophy has to apply to Wii U. Even more so, as there's more intention to use GPU compute on XB1 than Wii U going by the silicon design. At the end of the day, GPUs aren't affected much by the slowness of main RAM because they have been designed around this system-level limitation over the past 2 decades. You can't even improve on that design's performance by using lower-latency RAM because GPUs just aren't hitting that RAM latency limit. It will be beneficial when a GPU has to access RAM directly, but that's basically for GPGPU and not graphics work. Any time you are using a GPU that way, you'd probably be better off moving that workload onto the CPU.

So in conclusion, there's no great logic nor evidence to the notion of the eDRAM providing a latency advantage that yields performance gains to the platform for the GPU. If eDRAM is directly accessible by CPU then it will yield some small benefits on those L2 cache misses.

I dont think 'super efficient' and 'fabulous effect' really apply here. Maybe more along the lines of 'this doesnt suck horribly'. I was referring specifically to gpu cache heiarchy latency rather than main ram access. (8650 and llano is so bad its l1 is as slow as fermi's l3). The amd latency shortcomings im talking about are the results of the past 20 years of designing around main ram wait times you are talking about. They just did it poorly in comparison to others.


Those Embedded ram pools are sure to show great improvement in say, a SANDRA GPU/APU Cache and Memory Latency bench mark.


But I am at a loss for what that improvement would actually do. I got nothing.

Been thinking about it. The volume of stream processors hides latencies by the volume of things in flight. The less stream processors you have, the less you can have in flight, the less latency you can hide, the desire for lower latencies increases.
 
Last edited by a moderator:
Those Embedded ram pools are sure to show great improvement in say, a SANDRA GPU/APU Cache and Memory Latency bench mark.

But I am at a loss for what that improvement would actually do. I got nothing.

If the L1 cache latency sucks in the GPU then the eDRAM latency will probably be even worse, regardless of what the actual access time is for the eDRAM macro. It just goes to show that latency isn't just about how fast the RAM is, I'm sure the SRAM cells they use for L1 can be accessed very quickly but the underlying uarch dictates a much higher latency. That doesn't necessarily mean that it sucks either, just that it's not optimized for low latency accesses. Fermi may have this great L1 cache but going back just one generation shows no general purpose cache at all.

We don't really know where the Wii U's GPU eDRAM fits into the uarch picture. It could be managed by the memory controller. The latency could be almost as bad or even the same as main RAM.
 
We don't really know where the Wii U's GPU eDRAM fits into the uarch picture. It could be managed by the memory controller. The latency could be almost as bad or even the same as main RAM.
And it doesn't make any difference, as you say, because the GPU is working from other memory caches. If the various small GPU caches in SRAM are as slow as the 32 MB eDRAM, they are very poor caches indeed!
 
Been thinking about it. The volume of stream processors hides latencies by the volume of things in flight. The less stream processors you have, the less you can have in flight, the less latency you can hide, the desire for lower latencies increases.
Not exactly. They hide latencies by usually operating on a large amount of patterned/sequential data, so it's only the kick-off to a major operation that's "slow".

Let's consider a (very) simplistic hypothetical situation. Suppose we have a chip with a very high-bandwidth memory that correctly pre-fetches things for our task, but has an initial latency of 100 cycles. Suppose that it takes a single processor in the chip 10 cycles to execute our program over 1 data.

Suppose we have two parts: one with a single built-in processor, and one with ten built-in processors.

Now, suppose we're going to run the program over 100 units of data.
For the smaller part, we have 100 cycles of initial latency plus 10 cycles on each of the 100 units of data, for a total of 100+(10*100) = 1100 cycles. For the larger part, we have 100 cycles of initial latency, but the execution takes only 1/10th as long since we have ten processors, and our total is 200 cycles. In this scenario, a 10x larger part is only 5.5x faster; the initial latency consistitutes a larger fraction of the execution time, so the utilization is actually lower than for the small part. So, for tasks of the same size, wider parts can actually be more susceptible to latency issues.

Now, suppose we make the large part execute the program over 1000 units of data. Then we have 100 cycles of initial latency plus 10*1000/10 cycles of execution time, for a total of 1100 cycles. This is the same as the time it takes the /10-size part to execute over a /10-size data set. We see that if we expand the task proportionally with the width of the part, utilization stays the same.

I think you might have been imagining that wider parts will operate on bigger data sets, and so the initial latency represents a smaller fraction of the overall memory bus usage time. The problem is that this thinking is in terms of cycles rather than datas, and wider parts usually have wider memory busses to keep their processors just as well-fed as those in smaller parts. So even though there's more data being transfered for a larger task on a larger part, the fraction of memory bus usage taken up by the initial latency might be the same, since the data transfer has a higher bitrate. And internally, whereas in the smaller part you're only stalling 1 processor during the initial access, in the larger part you're stalling 10 processors.

Hopefully this all makes some semblance of sense.
 
I think you might have been imagining that wider parts will operate on bigger data sets, and so the initial latency represents a smaller fraction of the overall memory bus usage time.
Yes.... Yes I was.


The problem is that this thinking is:
Well then. Suppose ill be chewing on that for a while. Thanks.
 
And it doesn't make any difference, as you say, because the GPU is working from other memory caches. If the various small GPU caches in SRAM are as slow as the 32 MB eDRAM, they are very poor caches indeed!

But they're not poor if the driving point is to act as bandwidth amplification, not latency enhancement.
 
Most likely, but its hard to overlook how bad amd gpu's compare specifically because of memory latency.

http://www.sisoftware.net/?d=qa&f=gpu_mem_latency
I wrote this several times already and repeat it here:
These benchmarks are probably just wrong, they didn't measure what they claimed to measure. While nV GPUs certainly had some advantages in that area versus the old VLIW designs, the latency numbers for the AMD GPUs are still wrong (the LDS latency is a blatant example).
 
Since GPU architecture and preferred methods have co-developed for a long time, it stands to reason that improved latency through the memory subsystem will have limited impact for legacy code.

A more interesting question might be if having access to a relatively large pool of lower latency memory allows alternative approaches. It would seem reasonable that this would be the case.

(Furthermore, the EDRAM pool of memory is accessible to the CPU as well. Even though it has healthy amounts of private cache, there may be scenarios where keeping a larger set of data close could be useful. Also, as has already been pointed out, the latency advantage isn't necessarily all that great compared to main memory, although it would be strange if it wasn't at least roughly a factor of two or so. Note to Globalisateur: this is another example of rounding/approximation/"fuzzy numbers" as opposed to "fuzzy logic". :))
 
For graphics work I don't see how the EDRAM latency can affect the device unless they use shaders that randomly access a larger dataset. For GPGPU, yes it can be beneficial for some workloads. So the question then arises if programmable behaviour with random memory access can lead to something faster than conventional rendering. I doubt it.
 
For graphics work I don't see how the EDRAM latency can affect the device unless they use shaders that randomly access a larger dataset. For GPGPU, yes it can be beneficial for some workloads. So the question then arises if programmable behaviour with random memory access can lead to something faster than conventional rendering. I doubt it.

Oh, I'm rather convinced that a large pool of fast memory can be quite useful. But it won't be particularly useful if you are running typically latency tolerant code on hardware that is otherwise optimized for exactly those use cases. Having fast (bandwidth) local memory can relatively straightforwardly speed up buffer accesses and be utilized as local cache. Coming up with algorithms to do NewNeatStuff(tm) however requires more, and it is pretty much impossible to evaluate the possibilities from the perspective of the old paradigm, and therefore, the armchair. Someone in the trenches might want to chip in.

My guess is that the bulk of code will be business as usual, but that there will be instances where someone comes up with ideas to do specific things that suddenly has become viable when you don't have to trudge all the way out to main memory to poke around in your data. I would be particularly curious about whether having the CPU and GPU accessing the same area in EDRAM could be used for something worthwhile. Quick Read/Test/Modify/Write turnaround has to be useful for something, even in graphics. :) But it's not my field, so I have no idea for what.
 
Oh, I'm rather convinced that a large pool of fast memory can be quite useful. But it won't be particularly useful if you are running typically latency tolerant code on hardware that is otherwise optimized for exactly those use cases. Having fast (bandwidth) local memory can relatively straightforwardly speed up buffer accesses and be utilized as local cache. Coming up with algorithms to do NewNeatStuff(tm) however requires more, and it is pretty much impossible to evaluate the possibilities from the perspective of the old paradigm, and therefore, the armchair. Someone in the trenches might want to chip in.

My guess is that the bulk of code will be business as usual, but that there will be instances where someone comes up with ideas to do specific things that suddenly has become viable when you don't have to trudge all the way out to main memory to poke around in your data. I would be particularly curious about whether having the CPU and GPU accessing the same area in EDRAM could be used for something worthwhile. Quick Read/Test/Modify/Write turnaround has to be useful for something, even in graphics. :) But it's not my field, so I have no idea for what.

I'm curious about this. Are the memory pools still broken down into virtual system and graphics sectors? Isn't this a big thing with PS4 and AMD's APU's? That there's a unified memory architecture, but we're just now getting to the point where CPU and GPU can actually work on the same data set in memory?
 
In other news, the GamePad has been "hacked":

https://docs.google.com/presentatio...v7-YH_A0LZO0Phxedh9deiE/edit?pli=1#slide=id.p

Some high/low lights:

- Custom ARM SoC with embedded H.264 codec. Camera images are also encoded into H.264.
- ST micro for sensors
- Broadcom 5 GHz Wifi
- WPA2/AES-CCMP is used with WPS, they managed to get bypass WPS (weakest link)
- Streaming to the GamePad works from a PC, but still buggy

I'm a bit surprised, they didn't use encrypted firmware if they are so keen on protecting their stuff... the WPS got basically hacked because code could be analyzed... I wouldn't call reading out a SPI NOR flash "expert tooling".
Maybe they found the code because the entire console has been hacked already? I wonder why they aren't interested in converting the console into a multimedia center or something like that. It could be a neat piece of hardware for those functions.

On another note, Slightly Mad Studios say that their Project Cars game will show the hidden power of the WiiU to the world.

http://gamingbolt.com/project-cars-...ii-u-slightly-mad-studios#aVm7w6wcLVsbFF7T.99
 
Bah. Car games are mindnumbingly boring. Unless the cars have machine guns, missiles and mine launchers, nitro boosts and race through an obstacle-filled course a la Rock'n'Roll Racing or such.

Also, car games = typically all rigid polygon models, usually no deformation whatsoever, certainly all-rigid track and terrain. Not technically exciting in the slightest.

The guy sounds like he's PR spinning for the fanboys to hype his game. Very likely the "hidden power" he's talking about is simply good art and not much more, it's really hard to see where there would be much in the way of hidden power in a 35W, 40nm silicon process system.
 
Bah. Car games are mindnumbingly boring. Unless the cars have machine guns, missiles and mine launchers, nitro boosts and race through an obstacle-filled course a la Rock'n'Roll Racing or such.

Also, car games = typically all rigid polygon models, usually no deformation whatsoever, certainly all-rigid track and terrain. Not technically exciting in the slightest.

The guy sounds like he's PR spinning for the fanboys to hype his game. Very likely the "hidden power" he's talking about is simply good art and not much more, it's really hard to see where there would be much in the way of hidden power in a 35W, 40nm silicon process system.
You are basically saying that there is anything worse than racing games for showcasing graphics and they are always eye candy that never show off how the best games on the system will actually look.... Maybe, maybe not.


Idk, this is the first official screengrab of the WiiU version. You can't discern much in such a low quality picture but it looks okay to me. -perhaps a bit too aliased but it could easily be due to the jpg compression-.

qNpz8tB.jpg


Also, this is what they said recently on the matter.

http://www.nintendolife.com/news/20...e_core_project_cars_experience_says_developer
 
Nintendo really should moneyhat Sega to get a decent sized team to port that Aliens game onto the Wii U. Firstly, to prove to the masses that the Wii is not less capable than a 360, despite what most 3rd party ports are suggesting, but also because if there's perfect gimmick if you will for the Wii U tablet then it has to be the trademark sonar from Aliens. I remember when the PS3 was in a similar position with people doubting it could match the 360 and then Sony released Uncharted which the first proper showcase of the system's 'powah'.
 
Last edited by a moderator:
You are basically saying that there is anything worse than racing games for showcasing graphics and they are always eye candy that never show off how the best games on the system will actually look....
I don't think I was saying that, what I meant was that car games are technically relatively uncomplicated rendering-wise (physics and simulation may be a different matter depending on how accurate the underlying driving model is); you have basically a fixed perspective - the car follows the track at all times - and can thus accurately predict what the game will be rendering at all points during the race, devs can allocate sufficient resources to ensure a stable framerate. A great-looking car game is therefore not as technically impressive as a great-looking free-roaming title because of this.

Also that screenshot is not even an in-game shot, it's some kind of mostly static garage/showroom scene. Of course it's going to look great. I doubt the in-race car models are as detailed TBH.
 
At a very basic level, look at the power draw taken by the next-gen consoles compared to the Wii U. The PlayStation 4 draws over 100W more from the mains than Nintendo's console, and it does so using the latest, most power-efficient x86 cores from AMD in concert with a much larger GPU that's a generation ahead and runs on a much smaller fabrication process - 28nm vs. what I'm reliably informed is the 55nm process from Japanese company Renasas.
This may explain the unexpected size of the Wii 's GPU die.

As far as the CPU optimisations went, yes we did have to cut back on some features due to the CPU not being powerful enough. As we originally feared, trying to support a detailed game running in HD put a lot of strain on the CPUs and we couldn't do as much as we would have liked. Cutting back on some of the features was an easy thing to do, but impacted the game as a whole. Code optimised for the PowerPC processors found in the Xbox 360 and PlayStation 3 wasn't always a good fit for the Wii U CPU, so while the chip has some interesting features that let the CPU punch above its weight, we couldn't fully take advantage of them. However, some code could see substantial improvements that did mitigate the lower clocks - anything up to a 4x boost owing to the removal of Load-Hit-Stores, and higher IPC (instructions per cycle) via the inclusion of out-of-order execution.

On the GPU side, the story was reversed. The GPU proved very capable and we ended up adding additional "polish" features as the GPU had capacity to do it. There was even some discussion on trying to utilise the GPU via compute shaders (GPGPU) to offload work from the CPU - exactly the approach I expect to see gain traction on the next-gen consoles - but with very limited development time and no examples or guidance from Nintendo, we didn't feel that we could risk attempting this work. If we had a larger development team or a longer timeframe, maybe we would have attempted it, but in hindsight we would have been limited as to what we could have done before we maxed out the GPU again. The GPU is better than on PS3 or Xbox 360, but leagues away from the graphics hardware in the PS4 or Xbox One.
The relevant quote regarding cpu performance.
 
Last edited by a moderator:
That's an interesting article on the WiiU. Confirms what we'd already worked out: better GPU, main memory bandwidth not a problem, and a really weak CPU.

The new and really interesting stuff is just how bad Nintendo are at supporting 3rd party developers (particularly Western developers) in using the hardware. Undocumented changes and fixes to hardware, slow to use and broken tools, no way to debug some critical stuff, late with final OS (and no way to properly test and debug with it), and a truly shitty 1980's era way of handling requests for information on the hardware.

The Wii U: hardware specs chosen so "mum wouldn't mind having it in the living room". And true enough, there are not many mums bothered about it being in the living room.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top