Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
The issue with memory bandwidth isn't just the CPU, in fact it's primarily the GPU.
Even with EDRAM the textures have to come from main memory, most of the recent performance captures I've seen for high end GPU's they are limited by texture memory bandwidth, not ALU's, and they have > 10x the bandwidth that the WiiU has.
The EDRAM alleviates the frame buffer bandwidth, and I'll be nice and assume it can be used for intermediate buffers, but it's still going to be an issue.

If I remember well the peak read bandwidth from the main memory in the Xbox 360 is just 10.8 GB/s (10.8 GB/s read and 10.8 GB/s write). On the other hand, I believe the Wii U can use 12.8 GB/s in either direction. Assuming that most of the memory accesses are read requests (e.g., for texturing), memory bandwidth might not be an issue when porting current generation titles. Also, I assume that larger caches on the CPUs leave more bandwidth free for the GPU to use.
 
If I remember well the peak read bandwidth from the main memory in the Xbox 360 is just 10.8 GB/s (10.8 GB/s read and 10.8 GB/s write). On the other hand, I believe the Wii U can use 12.8 GB/s in either direction. Assuming that most of the memory accesses are read requests (e.g., for texturing), memory bandwidth might not be an issue when porting current generation titles. Also, I assume that larger caches on the CPUs leave more bandwidth free for the GPU to use.

I was wrong, only the FSB on the Xbox 360 is limited to 10.8 GB/s in each direction. The GPU can read/write from/to the memory using all of the 128 bit interface (hence, the peak is 22.4 GB/s).

Still, one would assume that the newer GPU in the Wii U, paired with a larger and possibly more usable RAM pool, would make better use of the available bandwidth. I think Nintendo made pretty bad decisions this time around, which is disappointing. Eventually, I know I will still get a Wii U once a next generation Zelda is released.
 
If, for the sake of argument, the WiiU had 16 texture sampling units pulling 4x4 bit dxtc/s3tc texels from main memory per clock at 500 mHz (like Xenon) then that would almost saturate then entire main memory bus. Assuming I've got this right of course.

Unless WiiU has a massively more efficient memory controller than the Xbox 360 it seems likely that either the system will be bandwidth starved, or that the GPU is running at less than 500 mHz and has no more than 16 or 20 TMUs.

I think Nintendo made pretty bad decisions this time around, which is disappointing. Eventually, I know I will still get a Wii U once a next generation Zelda is released.

Know that feel bro.
 
If, for the sake of argument, the WiiU had 16 texture sampling units pulling 4x4 bit dxtc/s3tc texels from main memory per clock at 500 mHz (like Xenon) then that would almost saturate then entire main memory bus.
Assuming no reuse is quite unrealistic. Texture caches are there for a reason.
Or just compare it with a performance/high end GPU from the PC space:
GTX680 has 128 TMUs running at ~1GHz and has 192 GB/s memory bandwidth. Is the ratio any better there? Wait, it's even worse! :LOL:

12.8 GB/s / (16 TMUs * 0.5 GHz) = 1.6 Bytes per Texel
192 GB/s / (128 TMUs * 1.0 GHz) = 1.5 Bytes per Texel

The Wii U has to serve the CPU from the same bandwidth, but it also has the eDRAM (reducing the bandwidth requirements).
 
Well the Xbox CPU and GPU dropped from ~2 x 180 mm^2 to one ~180 mm^2 SoC in two full process nodes (90nm to 45nm), so it's quite possible that scaling is rather less than linear.

What was the Wii CPU die size @ 90nm?

I didn't use the Broadway die size for calculations because it is an extreme outlier, probably demonstrating why scaling down an already small CPU may not be an ideal course of action. The terrible scaling between the GC Gekko (180nm) and the Wii Broadway (90nm), isn't justified by for instance IBM SRAM cell sizes on the two processes. The likely culprit is that Broadway has to drive the same external I/O pins. Conspicuously, I haven't been able to find a high res die shot of Broadway, but judging from other members of the family, it has to have issues there, to the point that it may even have some unused space (or unleaked extra qualities).

As I said, the straight scaling example actually includes three sets of external I/O circuitry, whereas the WiiU CPU doesn't have to drive any off package data at all! So there you actually save quite a bit of die area. The 3MB of L2-cache (if we believe the rumors) is more than the 768kB of L2 that three Broadways would add up to, but then it uses eDRAM instead of SRAM which (for complete arrays) is 2-4 times denser. For back of the envelope estimation purposes, its pretty much a wash in terms of die size for the L2.

So there is a more thorough explanation of why I think there is more to the WiiU CPU than straight shrinks of Gekko/Broadway + minimum support for MP.
 
If the 12.8 GB/s memory bandwidth is indeed true, that can surely be problematic. AMDs last generation Llano APUs (Radeon HD 6550D) already have 29.9 GB/s memory bandwidth and are highly bandwidth starved (proven by memory overclocks that give almost linear performance improvements). Llano is running most console ports slightly faster (~40 fps) than Xbox 360 & PS3 with similar IQ settings and resolution.

In comparison Radeon 4000 series have memory bandwidth of 115 GB/s. And that's just for the GPU alone. WiiU is supposed to have EDRAM (Wikipedia), so that of course helps. But if the main memory bandwidth is really that low (9x lower than Radeon 4000), the EDRAM needs to be extensively used throughout the rendering. If the EDRAM supports read&write it's possible to limit the memory traffic a lot, but it requires lots of algorithm tuning / compromises. For example you might want to render shadow maps to EDRAM and read them directly from there (to save all shadow map rendering and sampling bandwidth). However this technique would require more passes if you have lots of lights or want to use high resolution shadow maps (EDRAM has limited space).
Assuming no reuse is quite unrealistic. Texture caches are there for a reason.
The four bits (0.5 bytes) per pixel (DXT1) figure is realistic. That already includes reuse.

Without reuse (+trilinear and +bad access pattern) the worst case figure is: Eight 4x4 DXT1 blocks per sample. Four blocks for filtering if the sample is on DXT block border, multiplied by two, because trilinear filtering uses two mipmaps. That's 8*8=64 bytes per pixel. If the pixel is not on DXT block border (a more common case for random accessing), you need to fetch 2*8=16 bytes per pixel instead.

The main purpose of GPU texture cache is to keep the filtering data (and the remaining pixels from DXT blocks) in the cache. With good cache utilization you can achieve the 4 bit per pixel ratio. Anything better than that is uncommon for generic cases as the GPU texture caches are very small (the cache is most likely completely reused before the next draw call).
 
If the 12.8 GB/s memory bandwidth is indeed true, that can surely be problematic. AMDs last generation Llano APUs (Radeon HD 6550D) already have 29.9 GB/s memory bandwidth and are highly bandwidth starved (proven by memory overclocks that give almost linear performance improvements). Llano is running most console ports slightly faster (~40 fps) than Xbox 360 & PS3 with similar IQ settings and resolution.


Question,
What benefits does placing the GPU and CPU so close together accomplish?

Nintendo states:
An MCM is where the aforementioned Multi-core CPU chip and the GPU chip10 are built into a single component. The GPU itself also contains quite a large on-chip memory. Due to this MCM, the package costs less and we could speed up data exchange among two LSIs while lowering power consumption.

Doesn't this mean that bandwidths, clock speeds, etc can be reduced but still provide the same performance? Doesn't this also help in offloading processing to the GPU as much as possible?
 
Talking about the "slow cpu" issue...

Is GPGPU a good solution?
Can a GPU do the GPGPU thing and make graphics operations at the same time without performance penalties?
 
The four bits (0.5 bytes) per pixel (DXT1) figure is realistic. That already includes reuse.
If you assume one needs to fetch only about 0.5 bytes per filtered texel including the reuse, it doesn't saturate the memory bandwidth (as it is less than a third of the available one). ;)
Without reuse (+trilinear and +bad access pattern) the worst case figure is: Eight 4x4 DXT1 blocks per sample. Four blocks for filtering if the sample is on DXT block border, multiplied by two, because trilinear filtering uses two mipmaps. That's 8*8=64 bytes per pixel. If the pixel is not on DXT block border (a more common case for random accessing), you need to fetch 2*8=16 bytes per pixel instead.
The worst case is not sustainable for any realistic scenario. My guess would be, that with trilinear filtering and reuse through the texture cache one will probably arrive not too much above 5 bits/filtered texel (that's, what the trilinear filtering/LOD algorithm shoots for) or more general, one needs to fetch about 1.25 individual unfiltered texels for ech filtered one (edit: after thinking about it a bit, it's probably a bit more).
The main purpose of GPU texture cache is to keep the filtering data (and the remaining pixels from DXT blocks) in the cache. With good cache utilization you can achieve the 4 bit per pixel ratio. Anything better than that is uncommon for generic cases as the GPU texture caches are very small (the cache is most likely completely reused before the next draw call).
True.
Of course one can construct situations where one needs a bit less (or significantly more), but I assume it works quite well on average when one considers the reluctancy of nV or AMD to increase the size of the texture caches. L1 caches are stuck for more than a decade in the same size region. It stood at 6-8 kB per quad TMU in most GPUs. Only Southern Islands increased it to 16kB (Kepler to 12kB), probably mainly because it doubles now as general purpose data cache (read only in case of Kepler) and it gets quite expensive to scale the bandwidth of the texture L2 with the rising number of consumers.
 
Last edited by a moderator:
Assuming no reuse is quite unrealistic. Texture caches are there for a reason.

I was looking at a worst case scenario - it doesn't need to be happening all the time to cause issues for anything else on the memory bus (i.e. the CPU). Even with only 16 4-bit texel fetches per clock at 500 mHz your looking at 8 Bytes x 500 mHz = ~ 4GB/s, or about a third of a theoretical 12.8GB/s (in practice this may be significantly lower). If you wanted to sample from 24-bit or 32-bit texture then you'd be in pretty bad shape.

Or just compare it with a performance/high end GPU from the PC space:
GTX680 has 128 TMUs running at ~1GHz and has 192 GB/s memory bandwidth. Is the ratio any better there? Wait, it's even worse! :LOL:

12.8 GB/s / (16 TMUs * 0.5 GHz) = 1.6 Bytes per Texel
192 GB/s / (128 TMUs * 1.0 GHz) = 1.5 Bytes per Texel

And what's to say that high end GPUs aren't bottlenecked by texture fetch bandwidth? Here's ERP, yesterday:

The issue with memory bandwidth isn't just the CPU, in fact it's primarily the GPU.
Even with EDRAM the textures have to come from main memory, most of the recent performance captures I've seen for high end GPU's they are limited by texture memory bandwidth, not ALU's, and they have > 10x the bandwidth that the WiiU has.
 
Question,
What benefits does placing the GPU and CPU so close together accomplish?

Nintendo states:


Doesn't this mean that bandwidths, clock speeds, etc can be reduced but still provide the same performance? Doesn't this also help in offloading processing to the GPU as much as possible?

It think the most important takeaway from that statement is that it costs less and consumes less power.
 
I was looking at a worst case scenario - it doesn't need to be happening all the time to cause issues for anything else on the memory bus (i.e. the CPU). Even with only 16 4-bit texel fetches per clock at 500 mHz your looking at 8 Bytes x 500 mHz = ~ 4GB/s, or about a third of a theoretical 12.8GB/s (in practice this may be significantly lower). If you wanted to sample from 24-bit or 32-bit texture then you'd be in pretty bad shape.

And what's to say that high end GPUs aren't bottlenecked by texture fetch bandwidth? Here's ERP, yesterday:
I completely agree, the bandwidth of the Wii U looks to be a bit anemic. But so does the whole machine (the bandwidth may be an order of magnitude behind a good performance GPU, but so is the raw texturing speed). What I wanted to say is that it's probably not completely off balance, it is just slow. ;)

And as a side note, texture fetch bandwidth does not have to mean memory bandwidth. For example, if you take a HD7970 and vary the memory bandwidth, performance tend to scale not so much. More TMUs increase the usable texture fetch bandwidth (as basically each TMU comes with it's own L1 cache) and the L2 cache and its bandwidth may also play a crucial role (the L1 caches are quite tiny as already mentioned, they need the backup of the L2).
 
From the limited amount of information that we have I think the design of the WII U is more likely a composition of the low cost / high specific power console.

The memory has a low latency, but it has a lot of it.
So say you can render 300 MB worth of data on one frame, and you can change the rendered scene by 30 MB/frame,and you can show 900 MB of data within one sec.

So if you write the game onto the WII U,and if it is not a port then you can show nice things,kind that not possible on the xb/ps

From the other side, the CPU is weak ,but it should has a bandwidth with the GPU (and with the 32 megs edram) in the range of the 30-100 GB/sec with low latency.

So it is possible to get the result from the shader,use it on the CPU and send it back to the shader again,and all of this can happen on the cpu.
If the bandwidth limited then they can have lot of GPU capacity.
 
I completely agree, the bandwidth of the Wii U looks to be a bit anemic. But so does the whole machine (the bandwidth may be an order of magnitude behind a good performance GPU, but so is the raw texturing speed). What I wanted to say is that it's probably not completely off balance, it is just slow. ;)

Yeah, I think you're right. Nintendo seem to have a reputation for making fairly well balanced machines and the bandwidth is likely to be representative of their approach as a whole rather than a single outstanding problem.

And as a side note, texture fetch bandwidth does not have to mean memory bandwidth. For example, if you take a HD7970 and vary the memory bandwidth, performance tend to scale not so much. More TMUs increase the usable texture fetch bandwidth (as basically each TMU comes with it's own L1 cache) and the L2 cache and its bandwidth may also play a crucial role (the L1 caches are quite tiny as already mentioned, they need the backup of the L2).

Thanks, I've been (inaccurately) talking about texture fetch bandwidth as being the amount of main memory bandwidth available (or needed) to keep the TMUs fed with anything not in the caches. As you say, this is only part of the total possible texturing bandwidth available or used.
 
So what do we know about the Ausio DSP at this point? Anything new that wasn't known before launch?

Will it have a real effect on the CPU to have Audio handled separately?
 
I didn't use the Broadway die size for calculations because it is an extreme outlier, probably demonstrating why scaling down an already small CPU may not be an ideal course of action. The terrible scaling between the GC Gekko (180nm) and the Wii Broadway (90nm), isn't justified by for instance IBM SRAM cell sizes on the two processes. The likely culprit is that Broadway has to drive the same external I/O pins. Conspicuously, I haven't been able to find a high res die shot of Broadway, but judging from other members of the family, it has to have issues there, to the point that it may even have some unused space (or unleaked extra qualities).

As I said, the straight scaling example actually includes three sets of external I/O circuitry, whereas the WiiU CPU doesn't have to drive any off package data at all! So there you actually save quite a bit of die area. The 3MB of L2-cache (if we believe the rumors) is more than the 768kB of L2 that three Broadways would add up to, but then it uses eDRAM instead of SRAM which (for complete arrays) is 2-4 times denser. For back of the envelope estimation purposes, its pretty much a wash in terms of die size for the L2.

So there is a more thorough explanation of why I think there is more to the WiiU CPU than straight shrinks of Gekko/Broadway + minimum support for MP.

I've been thinking about what you've said here and I've got a couple of questions about it.

Regarding IO, I know you specifically mention that it isn't likely to be an issue, but I'm not sure why. If we were to assume that Broadway was pad limited for IO based on its scaling from Gekko, couldn't it also be that the WiiU CPU is IO limited? It could potentially need 6 times the data that Broadway did (3 cores x twice the speed, even assuming no other increases). Couldn't that mean a potentially greater number of pads that overwhelmed the benefits of only needing on package communication?

On that subject, what is it about on package communication reduces IO area requirements? Is it that fewer pads are needed because you can signal faster over the shorter distance, or that smaller contact points are needed because you use less power per 'pin'? Or something else?

Finally, what do you think Nintendo have added to the cores, or are they different cores entirely? I think you're probably correct and I've been wondering what the changes might be. Some kind of beefed up SIMD / Vector support seems desirable, especially given the expected low clocks.

Sorry for the all the questions, but this is quite an interesting topic!
 
When I first read recently about OOOE, I was surprised as the CPU was rumoured to have access to 'fast local memory', whilst OOOE is obviously more advantageous for dealing with slow memory.

Is it possible that Nintendo have somehow tweaked the "memory control logic" to make the system prioritize GPU requests, taking advantage of OOOE on the CPU to avoid that component stalling?
 
Pretty sure the DDR3 isn't clocked at 800MHz. It should be 729MHz. I was told the DSP would be running at 120MHz. Looking at Nintendo's MO, it's probably not really 120MHz, but 121.5MHz - same base clock as the Wii. Nintendo likes clean multipliers, so I would assume the RAM to be clocked at 729MHz (6 x 121.5). Same as the Wii CPU. Nintendo likes to keep RAM and CPU in sync, so the CPU should be running at 1458MHz (12 x 121.5). Accordingly, the GPU would be clocked at 486MHz (4 x 121.5), and the eDRAM at either 486 or 729MHz.

I don't know why Nintendo always seems to do this. I guess using a single fixed base clock and only changing multipliers for various components is simpler. And it definitely gives more predictable results. I don't see Nintendo giving that up.
 
I've been thinking about what you've said here and I've got a couple of questions about it.

Regarding IO, I know you specifically mention that it isn't likely to be an issue, but I'm not sure why. If we were to assume that Broadway was pad limited for IO based on its scaling from Gekko, couldn't it also be that the WiiU CPU is IO limited? It could potentially need 6 times the data that Broadway did (3 cores x twice the speed, even assuming no other increases). Couldn't that mean a potentially greater number of pads that overwhelmed the benefits of only needing on package communication?
No, not really. The my argument was two-fold, with the first being that a chip like Gekko/Broadway needs off-chip connection, but if you make a tricore version, the number of connections aren't going to triple. As you point out, the off-chip data communications needs would increase but that part is addressed by keeping signals on-package.
On that subject, what is it about on package communication reduces IO area requirements? Is it that fewer pads are needed because you can signal faster over the shorter distance, or that smaller contact points are needed because you use less power per 'pin'? Or something else?
Bearing in mind that I'm no IC designer, but a computational scientist, to the best of my knowledge both of your points above are correct. What I don't have is hard numbers, that is, if you want to really want to push the signaling speed per connection, how does that affect the necessary area for the associated drive circuitry? On the other hand, I can't really see that it would be an issue here, and in the cases where I've heard it described in more detail, they've claimed both benefits - much faster signaling at lower cost in die area.

Finally, what do you think Nintendo have added to the cores, or are they different cores entirely? I think you're probably correct and I've been wondering what the changes might be. Some kind of beefed up SIMD / Vector support seems desirable, especially given the expected low clocks.

Sorry for the all the questions, but this is quite an interesting topic!
Although the thread title says GPU, I'm inclined to agree. :)
As to your question, I'll be damned if I know. No developer has yet been heard gnashing his teeth about having to rewrite all SIMD code so Nintendo/IBM adding SIMD blocks to facilitate ports is a possibility. On the other hand Iwata has publicly made vague noises that could be interpreted as that the GPU would be the way to go for parallel FP. Or not. They could also have made a complete rework of the core, a la how different manufacturers produce ARMv7 cores of differing complexity. That would cost a bit though. Or they could have spent gates to beef up only what they deem to be key areas - after all, they have quite a bit of experience by now with where the bottlenecks have proven to be for their particular application space.

While the lack of information is frustrating for the curious, we do know a few things. We know the that the die area is 33mm2 on 45nmSOI, and that the power draw is in the ballpark of 5W. We also know that is going to be compatible with Wii titles, which makes it an open question (but not impossible) if IBM has used a completely unrelated PPC core with sufficient performance headroom per core that performance corner cases can be avoided. "Enhanced" Broadway may indeed be the case.

It's not going to be a powerhouse under any circumstances in raw ALU capabilities compared to contemporary processors. It spends roughly a fifth of the process adjusted die size per core (logic+cache) as the Apple A6 for instance. On the other hand the Cell PPE or the Xenos cores aren't particularly strong either for anything but vector-parallel FP codes that fit into local storage or L1 cache respectively. (Imperfect example could be that for instance the iPhone5 trumps the PS3 in both Geekbench integer and floating point tests). The take home message being that even if the WiiU CPU isn't a powerhouse, it isn't necessarily at much of a disadvantage vs. the current HD twins in general processing tasks even if we think of it as a tweaked Broadway design. If the more modern GPU architecture of the WiiU indeed makes some of the applications that the SIMD units were used for unnecessary, maybe it is a better call to simply skip CPU SIMD. This is a game console, after all.

I have to say though that given what we know today, it seems to punch above its weight even at this point in time. There are a number of multi platform ports on the system, at launch day with what that implies, that perform roughly on par with the established competitors. And those games are not developed with the greater programmability, nor the memory organization of the WiiU in mind. So even without having its strengths fully exploited, it does a similar job at less than half the power draw of its competitors at similar lithographic processes! And its backwards compatible. To what extent its greater programmability and substantial pool of eDRAM can be exploited to improve visuals further down the line will be interesting to follow.
How what we have seen so far can be construed as demonstrating hardware design incompetence on the part of Nintendo is an enigma to me.
 
Status
Not open for further replies.
Back
Top