Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
Do we know the native resolution of New Super Mario Bros. U ?


As for the GPU, I'm thrilled that is has 32 MB eDRAM. Will be interesting to find out if it's on a seperate die, as with XB360, or if it's part of the GPU die.
 
Do we know the native resolution of New Super Mario Bros. U ?

All Nintendo games will bve 720p:

While the Wii U technically supports 1080p resolution, current Wii U games only run in 720p, Wii U Daily can confirm. Yesterday, Nintendo released screenshots of New Super Mario Bros U, which were in the native resolution of 720p. So were the screenshots of ZombiU and Project P-100. We initially thought that those were simply the 720p versions of the game and that 1080p would be available. But as it turns out, all of Nintendo’s upcoming first party Wii U titles will be in 720p.

Speaking to a Nintendo representative on the E3 show floor today, I was told that all of the playable Wii U games at E3 will ship with native 720p support, including Nintendo Land, New Super Mario Bros, and Pikmin 3. However, the representative said that some of those titles, like Pikmin 3, will run at 60 frames per second. The representative couldn’t confirm whether the games can be upscaled to 1080p, only that the “native resolution is 720p for all titles showcased”.

The Wii U hardware is still very new and it’s likely that it’ll take developers some time before they can push it to its full potential. But for now, it seems that Nintendo and its first party Wii U games will be sticking to 720p.

As for the GPU, I'm thrilled that is has 32 MB eDRAM. Will be interesting to find out if it's on a seperate die, as with XB360, or if it's part of the GPU die.

But the eDRAM will be on CPU, isn't? I guess the eDRAM as in Xbox 360 is a MS design, and they cannot use it in the same way.
 
Last edited by a moderator:
Issues weren't my concern though. Efficiency is.
The issue was feature support. And it is supported.

And as long as you don't hit some limit of the shader model, it is very well possible to run the shaders efficiently. Btw., that is a very general statement. It applies to all shader types. Or are you also claiming that with pixel shader model 4 you can't run them efficiently so it does not make sense to use them and one absolutely needs SM5 to run pixel shaders well? :rolleyes:

There is no inherent difference between PS and CS (or every other shader type) in this regard. Of course SM5 offers a few more features and different hardware has different performance, but this applies in principle to all shader types, not only compute shaders. The actual execution of a CS in the ALUs does not differ from other shader types. What differs is for instance how parameters are passed, how the threads/work items (fragments in a pixel shader) get enumerated and therefore also how one adresses data in buffers. But the other shader types also differ there.

After all the way I see it if it didn't work well, MS would have swept it under the rug and pretended like it never happened.
MS didn't. I don't get what you want to say here.
Everything I've seen in regards to that suggests that level of hardware is not as efficient due to lacking certain features. If you have have something that shows otherwise, I'm willing to back away from that belief.
What do you want? Proof, that there is code out there, where the added features of SM5 are not needed? That's basically a given fact. ;)
Remember, that people already did compute wrapped in pixel shaders before the advent of compute shaders? :LOL:
 
Last edited by a moderator:
All Nintendo games will be 720p:
This makes sense. You want a quality experience. Trying to get 1080p on a launch title means running the risk of inconsistent framerates. 720p will preserve the Nintendo consistent framerate standard for the first titles while devs learn how to best develop for the machine. Remember, Nintendo's 1st party developers have zero experience with a modern architecture. Their programming experience and knowledge of architecture is a dozen years out of date.

But the eDRAM will be on CPU, isn't? I guess the eDRAM as in Xbox 360 is a MS design, and they cannot use it in the same way.
I'm not tight on all the rumours, but I think we have eDRAM on both processors. There's eDRAM on the GPU as you'd expect. We're also told by IBM that there's eDRAM on the CPU. This could just be a smaller amount for cache, rather than be related to the GPU. eDRAM shared between GPU and CPU doesn't make sense to me, and would be an added cost for no benefit.
 
Outside of the difficulty writing efficient compute shaders, and the efficiency they run on any given piece of hardware, devs will think long and hard before stealing GPU resources from rendering.
Compute shaders are very useful for graphics rendering as well. If you can perform some graphics rendering steps more efficiently using compute shaders, you are naturally going to do it. In addition to speeding up existing processing steps, compute shaders allow completely new graphics algorithms to be implemented. Current pixel shader based rendering pipelines rely heavily on brute force rendering (huge regular sampling grid, etc). Compute shaders allow more clever algorithms to be implemented instead. In many cases you also want to run some algorithm steps on GPU because GPU->CPU->GPU latency roundtrip is too long.
There is no inherent difference between PS and CS (or every other shader type) in this regard.
I do not agree with this. Since PS3.0 you have been able to do basically anything you wanted in pixel shader. PS4.0 and 5.0 didn't bring any features that allowed anything radically new, except for maybe integer support. And you can emulate 24 bit integers with 32 bit floating point values very well (I have been doing for example image compression / bit packing in pixel shaders using floating points as integers on consoles, and the performance is good).

CS4_0/4_1 however lack many very important compute features (picked the most important):
- A thread can only access its own region in groupshared memory for writing
- SV_GroupIndex or SV_DispatchThreadID must be used when accessing groupshared memory for writing
- A single thread is limited to a 256 byte region of groupshared memory for writing
- Only one unordered-access view can be bound to the shader
- No atomic instructions are available

Without atomics or/and scatter to groupshared memory many algorithms are impossible (or very difficult/inefficient) to write. Groupshared memory is the most important feature that differs compute shader from pixel shader, and unfortunately group shared memory usage is very much crippled in CS4_0/4_1.

Also none of the GPUs that support CS_4_0/4_1 have generic read/write caches, or other new features (parallel kernel execution, context switching, etc) that make compute shaders so much more usable. DX10 chips might be technically "compute shader capable", but this doesn't mean they have anywhere near efficiency, flexibility and feature set compared to GCN/Fermi/Kepler. Personally I have never counted DX10 cards being "compute capable", and I would likely run these cards using a traditional pixel shader code path instead (group DX10 cards together with DX9 cards).
 
Compute shaders are very useful for graphics rendering as well. If you can perform some graphics rendering steps more efficiently using compute shaders, you are naturally going to do it. In addition to speeding up existing processing steps, compute shaders allow completely new graphics algorithms to be implemented. Current pixel shader based rendering pipelines rely heavily on brute force rendering (huge regular sampling grid, etc). Compute shaders allow more clever algorithms to be implemented instead. In many cases you also want to run some algorithm steps on GPU because GPU->CPU->GPU latency roundtrip is too long.

Absolutely but I was largely referring to the use of compute to compensate for processor deficiencies specifically.

I actually do think it has value in this space if it's implemented properly, GPU's waste a lot of Compute resources because not everything in a frame is ALU bound. I'd guess that most renderers don't exceed 60% utilization of the ALU's when looking at the frame in it's entirety. There are too many parts of a frame that are fill bound (shadows, the first pass of a deferred renderer etc) or texture bound to do much n
better.

Any good solution needs to let you exploit those "idle" ALU cycles without a large impact on the on going rendering.

And even a good solution doesn't make writing efficient compute shaders any easier, though I think a lot of that is just experience. You just have to think very wide, and understand that sometimes it's faster to do a lot more work reformatting the problem than just generating a solution directly.
 
Are there situations where a non-graphics job on the CPU relates so closely to the GPU's workload that efficiency gains can be made shifting the work to the GPU? eg. If by moving physics to the GPU, data can be recycled and used elsewhere like...motion vectors or something during the graphics rendering, such that instead of using 5 ms of CPU time and 5 ms of GPU time to calculate physics and render motion blur, instead 7 ms of GPU time is used and the CPU can be applied on other jobs.
 
Are there situations where a non-graphics job on the CPU relates so closely to the GPU's workload that efficiency gains can be made shifting the work to the GPU? eg. If by moving physics to the GPU, data can be recycled and used elsewhere like...motion vectors or something during the graphics rendering, such that instead of using 5 ms of CPU time and 5 ms of GPU time to calculate physics and render motion blur, instead 7 ms of GPU time is used and the CPU can be applied on other jobs.
Modern GPUs have significant advantage in both raw flops and memory BW compared to modern CPUs. For example top of the line Ivy Bridge peak = 8*2*4*3.5GHz = 224 GFLOP/s. Top of the line Radeon 7970 (GHz edition) peak = 4300 GFLOP/s. That's near 20x difference.

If you sacrifice 5% of your GPU FLOP/s to help with the non-graphics rendering tasks, you have 215 GFLOP/s more for non-graphics rendering tasks. That's basically double of the original 224 GFLOP/s the Ivy Bridge could do on it's own.

Those extra 215 GFLOP/s need to be used for straightforward stream processing tasks, such as viewport culling, batch matrix math (in scene setup / animation), particle animation, etc. Basically most tasks that are usually processed by SSE/AVX in SoA style batch processing are possible to process in GPU with good (near 1:1) efficiency. You can use GPU for more complex tasks as well, but then you are likely not getting 1:1 usage of resources. For example number sorting can be done by GPU, but the efficiency will be worse than 4:1 (GPU FLOP/s needed for one CPU FLOP/s). Even with efficiency this bad, the GPU can still sort around 5x faster than CPU (as it has 20x more FLOP/s). So it might be still viable for some scenarios (especially if CPU sorting would require you to move data from GPU to CPU and back).

The biggest problem in using GPU to help the CPU (to batch process tasks) is the latency. GPU is running asynchronously, and it's fed by a ring buffer. In order to keep the GPU fed at all times, there must be several work items waiting in the ring buffer all the time. Because of this, there is often up to a half frame (8/16 ms) latency in getting the GPU to start the tasks you send it. Newest compute capable GPUs (Kepler and GCN) support multiple hardware ring buffers and context switching, so you can push the priority tasks (compute tasks) to a separate ring buffer. This functionality is not yet available however in DirectX API. Even if you generate separate command lists by multithreaded command buffer devices, you must merge them all to a single hardware device (single hardware ring buffer). Basically currently (with current APIs and all except the newest GPUs) you can only use GPU compute for tasks that are later consumed by the GPU (do not need results be send back to CPU) or tasks that have no latency requirements (results can be collected next frame by CPU). Lower level console APIs and unified memory would of course improve the situation, and allow more tasks to be offloaded to the GPU.
 
The issue was feature support. And it is supported.
No it wasn't. Go back and look at where I started this. I said it was "weird/wasteful". Wasteful meaning inefficient to implement on 10.1-level hardware.

And as long as you don't hit some limit of the shader model, it is very well possible to run the shaders efficiently. Btw., that is a very general statement. It applies to all shader types. Or are you also claiming that with pixel shader model 4 you can't run them efficiently so it does not make sense to use them and one absolutely needs SM5 to run pixel shaders well? :rolleyes:

Not understanding the context of the eyeroll. That's what I'm trying to understand based on what I know so I may not be saying it clear enough for you. So let me try again. Compute shaders were implemented in DX11. (Only) Through DX11 API, DX10.1 hardware now can utilize compute shaders. 10.1 hardware is not as efficient as 11 hardware. And in turn compute shaders usage is not as efficient on 10.1 hardware as it is on 11 hardware. The previous sentence also came from reading to try to understand CS usage better. And from there because of how Nintendo is so picky on efficiency, I'm not understanding the logic of employing compute shaders if the hardware would be inefficient at it when a more efficient alternative could be employed with customization. In other words for CS to be mentioned, I have a tough time seeing some continue to focus on the R700 aspect of the GPU as if there will be virtually no changes.

There is no inherent difference between PS and CS (or every other shader type) in this regard. Of course SM5 offers a few more features and different hardware has different performance, but this applies in principle to all shader types, not only compute shaders. The actual execution of a CS in the ALUs does not differ from other shader types. What differs is for instance how parameters are passed, how the threads/work items (fragments in a pixel shader) get enumerated and therefore also how one adresses data in buffers. But the other shader types also differ there.

Sebbbi blows away any type of response I would have made (because I don't see them being virtually similar even though that's with my limited understanding) so I'll lean on his response. :p

MS didn't. I don't get what you want to say here.

Exactly. Which is what I said that. And that's probably because I hadn't thoroughly explained the premise for my view.


What do you want? Proof, that there is code out there, where the added features of SM5 are not needed? That's basically a given fact. ;)
Remember, that people already did compute wrapped in pixel shaders before the advent of compute shaders? :LOL:

Right. Like I mentioned earlier I have a general understand on the history of GPGPU. :smile:

But I guess based on the way you asked it, what I would be looking for, if possible, is something comparing compute performance of a DX11 GPU vs a DX10.1 GPU through DirectCompute since OpenGL only recently added compute shaders.
 
This makes sense. You want a quality experience. Trying to get 1080p on a launch title means running the risk of inconsistent framerates. 720p will preserve the Nintendo consistent framerate standard for the first titles while devs learn how to best develop for the machine. Remember, Nintendo's 1st party developers have zero experience with a modern architecture. Their programming experience and knowledge of architecture is a dozen years out of date.

But what about this?

Zelda-HD-Demo.jpg


It comes from the same Nintendo developers, and it is great looking, on pair or even better than actual gen.
 
But what about this?

It comes from the same Nintendo developers, and it is great looking, on pair or even better than actual gen.
I don't understand your point. :???: Nintendo's developers have been working on fixed-pipeline Gamecube hardware for over a decade. They don't do multiplatform so have no experience with modern hardware. Ergo it's going to be harder for them to eek out performance from day one, just as it was devs experienced using EE+GS switching to Cell+RSX. That Nintendo's RnD can build a short tech demo utilising modern techniques on modern hardware doesn't change the situation with their many studios. There's no reason to expect Nintendo's developers to transition to new hardware better than any others, and no reason to expect them to get more performance from their hardware from day one than any other developers. That flies in the face of reason.
 
I don't understand your point. :???: Nintendo's developers have been working on fixed-pipeline Gamecube hardware for over a decade. They don't do multiplatform so have no experience with modern hardware. Ergo it's going to be harder for them to eek out performance from day one, just as it was devs experienced using EE+GS switching to Cell+RSX. That Nintendo's RnD can build a short tech demo utilising modern techniques on modern hardware doesn't change the situation with their many studios. There's no reason to expect Nintendo's developers to transition to new hardware better than any others, and no reason to expect them to get more performance from their hardware from day one than any other developers. That flies in the face of reason.

Interesting, I hadn't given it much thought but I would assume Nintendo would have the resources to train their staff to be up to date.
They would have have to see into the future and be ready for the technology they would eventually use. On the other hand they seem to be stingy so maybe they would only do what they need to at the time.
So you think their software developers only work on their most current hardware?
 
I don't understand your point. :???: Nintendo's developers have been working on fixed-pipeline Gamecube hardware for over a decade. They don't do multiplatform so have no experience with modern hardware. Ergo it's going to be harder for them to eek out performance from day one, just as it was devs experienced using EE+GS switching to Cell+RSX. That Nintendo's RnD can build a short tech demo utilising modern techniques on modern hardware doesn't change the situation with their many studios. There's no reason to expect Nintendo's developers to transition to new hardware better than any others, and no reason to expect them to get more performance from their hardware from day one than any other developers. That flies in the face of reason.

Are you saying that people at Nintendo are not capable of getting some juice from their hardware? They have people very tech capable, like people on Retro Studios, or they can contract some techies, or they can buy UE3 for some in-house titles.

Microsoft first party devs did a great work with really new hardware at 2005.
 
Interesting, I hadn't given it much thought but I would assume Nintendo would have the resources to train their staff to be up to date.

So you think their software developers only work on their most current hardware?
If you're making a game, you're working on the hardware targets. Their coders would be working their days writing GC and Wii code while those were the platforms being targeted rather than spending time writing GPU code on PC. A shift to a new hardware mean those coders move from one architecture to another with zero experience on the new hardware. They'll have various RnD, but they are no different to every other developer AFAIK. Which means new hardware means a whole load of new lessons, and the more different the hardware, the longer it's gonna take to make the most of the new hardware - affected also by the ease of the hardware and development tools. Efficient hardware usage can't be learnt with two weeks training. Training teaches how to get the hardware to render some graphics. Experience (and shared experience) teaches how to understand the hardware at a lower level and make the most of it.

Are you saying that people at Nintendo are not capable of getting some juice from their hardware? They have people very tech capable, like people on Retro Studios, or they can contract some techies, or they can buy UE3 for some in-house titles.

Microsoft first party devs did a great work with really new hardware at 2005.
But it was still well below what the machine was capable of! Launch titles on any console aren't a patch on what it can do further down the line. I'm not in any way belittling Nintendo's developers here - all devs are in the same boat. MS's first party, and Epic, went from shader-based graphics to shader-based graphics, so for them the transition was easier. To next-gen, they'll hit the floor running. Well, jogging. Sony's devs went from simple PlayStation code to the impenetrable wall of PS2 which took them forever to work out (isn't the reference anecdote several weeks just to get a triangle to draw for the first time?), to the complexities of Cell and the unknowns of GPU shaders. Nintendo's devs had the luxury of the same hardware from GC to Wii, meaning no learning to do. But that also means no experience with modern systems, and they'll be hitting a major learning wall. Modern hardware is easy to use. Clever developers with a copy of GPU Gems can produce a decent looking demo in a short time. But making the most of the hardware, doing things efficiently, is going to take a lot of experience. Thus us makes sense not to try to produce a fifth-year quality game, but aim for a launch-title quality game for launch, which means setting more realistic goals.

If UE3 is supported on a platform, then some games will benefit from that and make life easier for the devs in getting something up and running, but UE3 is hardly known for its efficiency so you also won't be getting particularly good utilisation. Unless there was a lot of extra performance under the hood, it's unlikely a game could hit 1080p30 with suitable eye-candy through UE3. Thus, it makes sense no matter what engine you are using to aim for a starting game with simpler goals, use that to gain experience, and then become more adventurous in subsequent titles when you have a better feel for how to approach problems more efficiently with the hardware you have.
 
I cant believe people think Nintendo wasn't getting ready for HD development

In 2009, it was revealed that Nintendo was expanding both its Redmond and Kyoto offices. The new office building complex of Nintendo of America in Redmond is 275,250 square feet (25,572 m2) and would expand its localization, development, debugging, production, and clerical teams. Nintendo Co., Ltd. announced the purchase of 40,000 square-meter lot that would house an all new research and development (R&D) office ($141M) that would make it easier for the company's two other Kyoto R&D offices to collaborate easier as well as expand the total work force on new upcoming console development and new software for current and future hardware
 
A lot of Retro guys have left...


The following is a breakdown of recent additions and losses of personnel for Retro Studios Inc.

2001
(+) Ryan Harris (Nintendo of America) (Production)

2008
(+) Shane Lewis (Nintendo of America) (Production)

Employees Acquired 2008-2012
Chris Torres, Reed Ketcham, Jonathan Delange, Stephen Dupree , Andrew Orlando, Brad Taylor, Robert Kovach, Nathan Nordfelt, Tony Bernardin, Dominic Pallotta, Kyle Ruegg, Timothy Wilson, Sylvia Rowland, Eric Koslowski, Gray Ginther, Crystel Land, Adam Schulman, Aaron Black, Nestor Hernandez, Paul Schwanz,Chris Carroll,Allison Theus,Jessica Spence,Toph Gorham,Mookie Weisbrod,Rhett Baldwin

Employees Dismissed 2008-2012
Jay Epperson, Bryan Walker, Mike Wikan, Kynan Pearson, Mike Miller

Openings: 0 Positions
Total : 79 employees

http://kyoto-report.wikidot.com/retro-studios-personnel-tracker


They may have lost some people, but they were replaced and then some.
 
I cant believe people think Nintendo wasn't getting ready for HD development
No-one has said that. If you want to believe that Nintendo's developers can enter into a new hardware architecture and gain very effective utilisation from it from their very first titles in a way no other developers can, that's your prerogative.
 
Status
Not open for further replies.
Back
Top