Wii U hardware discussion and investigation *rename

Discussion in 'Console Technology' started by TheAlSpark, Jul 29, 2011.

Thread Status:
Not open for further replies.
  1. Hornet

    Newcomer

    Joined:
    Nov 28, 2009
    Messages:
    120
    Likes Received:
    0
    Location:
    Italy
    If I remember well the peak read bandwidth from the main memory in the Xbox 360 is just 10.8 GB/s (10.8 GB/s read and 10.8 GB/s write). On the other hand, I believe the Wii U can use 12.8 GB/s in either direction. Assuming that most of the memory accesses are read requests (e.g., for texturing), memory bandwidth might not be an issue when porting current generation titles. Also, I assume that larger caches on the CPUs leave more bandwidth free for the GPU to use.
     
  2. Hornet

    Newcomer

    Joined:
    Nov 28, 2009
    Messages:
    120
    Likes Received:
    0
    Location:
    Italy
    I was wrong, only the FSB on the Xbox 360 is limited to 10.8 GB/s in each direction. The GPU can read/write from/to the memory using all of the 128 bit interface (hence, the peak is 22.4 GB/s).

    Still, one would assume that the newer GPU in the Wii U, paired with a larger and possibly more usable RAM pool, would make better use of the available bandwidth. I think Nintendo made pretty bad decisions this time around, which is disappointing. Eventually, I know I will still get a Wii U once a next generation Zelda is released.
     
  3. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    If, for the sake of argument, the WiiU had 16 texture sampling units pulling 4x4 bit dxtc/s3tc texels from main memory per clock at 500 mHz (like Xenon) then that would almost saturate then entire main memory bus. Assuming I've got this right of course.

    Unless WiiU has a massively more efficient memory controller than the Xbox 360 it seems likely that either the system will be bandwidth starved, or that the GPU is running at less than 500 mHz and has no more than 16 or 20 TMUs.

    Know that feel bro.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Assuming no reuse is quite unrealistic. Texture caches are there for a reason.
    Or just compare it with a performance/high end GPU from the PC space:
    GTX680 has 128 TMUs running at ~1GHz and has 192 GB/s memory bandwidth. Is the ratio any better there? Wait, it's even worse! :lol:

    12.8 GB/s / (16 TMUs * 0.5 GHz) = 1.6 Bytes per Texel
    192 GB/s / (128 TMUs * 1.0 GHz) = 1.5 Bytes per Texel

    The Wii U has to serve the CPU from the same bandwidth, but it also has the eDRAM (reducing the bandwidth requirements).
     
  5. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    I didn't use the Broadway die size for calculations because it is an extreme outlier, probably demonstrating why scaling down an already small CPU may not be an ideal course of action. The terrible scaling between the GC Gekko (180nm) and the Wii Broadway (90nm), isn't justified by for instance IBM SRAM cell sizes on the two processes. The likely culprit is that Broadway has to drive the same external I/O pins. Conspicuously, I haven't been able to find a high res die shot of Broadway, but judging from other members of the family, it has to have issues there, to the point that it may even have some unused space (or unleaked extra qualities).

    As I said, the straight scaling example actually includes three sets of external I/O circuitry, whereas the WiiU CPU doesn't have to drive any off package data at all! So there you actually save quite a bit of die area. The 3MB of L2-cache (if we believe the rumors) is more than the 768kB of L2 that three Broadways would add up to, but then it uses eDRAM instead of SRAM which (for complete arrays) is 2-4 times denser. For back of the envelope estimation purposes, its pretty much a wash in terms of die size for the L2.

    So there is a more thorough explanation of why I think there is more to the WiiU CPU than straight shrinks of Gekko/Broadway + minimum support for MP.
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    If the 12.8 GB/s memory bandwidth is indeed true, that can surely be problematic. AMDs last generation Llano APUs (Radeon HD 6550D) already have 29.9 GB/s memory bandwidth and are highly bandwidth starved (proven by memory overclocks that give almost linear performance improvements). Llano is running most console ports slightly faster (~40 fps) than Xbox 360 & PS3 with similar IQ settings and resolution.

    In comparison Radeon 4000 series have memory bandwidth of 115 GB/s. And that's just for the GPU alone. WiiU is supposed to have EDRAM (Wikipedia), so that of course helps. But if the main memory bandwidth is really that low (9x lower than Radeon 4000), the EDRAM needs to be extensively used throughout the rendering. If the EDRAM supports read&write it's possible to limit the memory traffic a lot, but it requires lots of algorithm tuning / compromises. For example you might want to render shadow maps to EDRAM and read them directly from there (to save all shadow map rendering and sampling bandwidth). However this technique would require more passes if you have lots of lights or want to use high resolution shadow maps (EDRAM has limited space).
    The four bits (0.5 bytes) per pixel (DXT1) figure is realistic. That already includes reuse.

    Without reuse (+trilinear and +bad access pattern) the worst case figure is: Eight 4x4 DXT1 blocks per sample. Four blocks for filtering if the sample is on DXT block border, multiplied by two, because trilinear filtering uses two mipmaps. That's 8*8=64 bytes per pixel. If the pixel is not on DXT block border (a more common case for random accessing), you need to fetch 2*8=16 bytes per pixel instead.

    The main purpose of GPU texture cache is to keep the filtering data (and the remaining pixels from DXT blocks) in the cache. With good cache utilization you can achieve the 4 bit per pixel ratio. Anything better than that is uncommon for generic cases as the GPU texture caches are very small (the cache is most likely completely reused before the next draw call).
     
  7. MDX

    MDX
    Newcomer

    Joined:
    Nov 28, 2006
    Messages:
    206
    Likes Received:
    0

    Question,
    What benefits does placing the GPU and CPU so close together accomplish?

    Nintendo states:
    Doesn't this mean that bandwidths, clock speeds, etc can be reduced but still provide the same performance? Doesn't this also help in offloading processing to the GPU as much as possible?
     
  8. XpiderMX

    Veteran

    Joined:
    Mar 14, 2012
    Messages:
    1,768
    Likes Received:
    0
    Talking about the "slow cpu" issue...

    Is GPGPU a good solution?
    Can a GPU do the GPGPU thing and make graphics operations at the same time without performance penalties?
     
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    If you assume one needs to fetch only about 0.5 bytes per filtered texel including the reuse, it doesn't saturate the memory bandwidth (as it is less than a third of the available one). ;)
    The worst case is not sustainable for any realistic scenario. My guess would be, that with trilinear filtering and reuse through the texture cache one will probably arrive not too much above 5 bits/filtered texel (that's, what the trilinear filtering/LOD algorithm shoots for) or more general, one needs to fetch about 1.25 individual unfiltered texels for ech filtered one (edit: after thinking about it a bit, it's probably a bit more).
    True.
    Of course one can construct situations where one needs a bit less (or significantly more), but I assume it works quite well on average when one considers the reluctancy of nV or AMD to increase the size of the texture caches. L1 caches are stuck for more than a decade in the same size region. It stood at 6-8 kB per quad TMU in most GPUs. Only Southern Islands increased it to 16kB (Kepler to 12kB), probably mainly because it doubles now as general purpose data cache (read only in case of Kepler) and it gets quite expensive to scale the bandwidth of the texture L2 with the rising number of consumers.
     
    #3409 Gipsel, Nov 22, 2012
    Last edited by a moderator: Nov 22, 2012
  10. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    I was looking at a worst case scenario - it doesn't need to be happening all the time to cause issues for anything else on the memory bus (i.e. the CPU). Even with only 16 4-bit texel fetches per clock at 500 mHz your looking at 8 Bytes x 500 mHz = ~ 4GB/s, or about a third of a theoretical 12.8GB/s (in practice this may be significantly lower). If you wanted to sample from 24-bit or 32-bit texture then you'd be in pretty bad shape.

    And what's to say that high end GPUs aren't bottlenecked by texture fetch bandwidth? Here's ERP, yesterday:

     
  11. Strange

    Veteran

    Joined:
    May 16, 2007
    Messages:
    1,698
    Likes Received:
    428
    Location:
    Somewhere out there
    It think the most important takeaway from that statement is that it costs less and consumes less power.
     
  12. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I completely agree, the bandwidth of the Wii U looks to be a bit anemic. But so does the whole machine (the bandwidth may be an order of magnitude behind a good performance GPU, but so is the raw texturing speed). What I wanted to say is that it's probably not completely off balance, it is just slow. :wink:

    And as a side note, texture fetch bandwidth does not have to mean memory bandwidth. For example, if you take a HD7970 and vary the memory bandwidth, performance tend to scale not so much. More TMUs increase the usable texture fetch bandwidth (as basically each TMU comes with it's own L1 cache) and the L2 cache and its bandwidth may also play a crucial role (the L1 caches are quite tiny as already mentioned, they need the backup of the L2).
     
  13. bomlat

    Regular

    Joined:
    Nov 5, 2006
    Messages:
    327
    Likes Received:
    0
    From the limited amount of information that we have I think the design of the WII U is more likely a composition of the low cost / high specific power console.

    The memory has a low latency, but it has a lot of it.
    So say you can render 300 MB worth of data on one frame, and you can change the rendered scene by 30 MB/frame,and you can show 900 MB of data within one sec.

    So if you write the game onto the WII U,and if it is not a port then you can show nice things,kind that not possible on the xb/ps

    From the other side, the CPU is weak ,but it should has a bandwidth with the GPU (and with the 32 megs edram) in the range of the 30-100 GB/sec with low latency.

    So it is possible to get the result from the shader,use it on the CPU and send it back to the shader again,and all of this can happen on the cpu.
    If the bandwidth limited then they can have lot of GPU capacity.
     
  14. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    Yeah, I think you're right. Nintendo seem to have a reputation for making fairly well balanced machines and the bandwidth is likely to be representative of their approach as a whole rather than a single outstanding problem.

    Thanks, I've been (inaccurately) talking about texture fetch bandwidth as being the amount of main memory bandwidth available (or needed) to keep the TMUs fed with anything not in the caches. As you say, this is only part of the total possible texturing bandwidth available or used.
     
  15. TheLump

    Regular

    Joined:
    Jul 13, 2012
    Messages:
    280
    Likes Received:
    9
    So what do we know about the Ausio DSP at this point? Anything new that wasn't known before launch?

    Will it have a real effect on the CPU to have Audio handled separately?
     
  16. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    I've been thinking about what you've said here and I've got a couple of questions about it.

    Regarding IO, I know you specifically mention that it isn't likely to be an issue, but I'm not sure why. If we were to assume that Broadway was pad limited for IO based on its scaling from Gekko, couldn't it also be that the WiiU CPU is IO limited? It could potentially need 6 times the data that Broadway did (3 cores x twice the speed, even assuming no other increases). Couldn't that mean a potentially greater number of pads that overwhelmed the benefits of only needing on package communication?

    On that subject, what is it about on package communication reduces IO area requirements? Is it that fewer pads are needed because you can signal faster over the shorter distance, or that smaller contact points are needed because you use less power per 'pin'? Or something else?

    Finally, what do you think Nintendo have added to the cores, or are they different cores entirely? I think you're probably correct and I've been wondering what the changes might be. Some kind of beefed up SIMD / Vector support seems desirable, especially given the expected low clocks.

    Sorry for the all the questions, but this is quite an interesting topic!
     
  17. dumbo11

    Regular

    Joined:
    Apr 21, 2010
    Messages:
    440
    Likes Received:
    7
    When I first read recently about OOOE, I was surprised as the CPU was rumoured to have access to 'fast local memory', whilst OOOE is obviously more advantageous for dealing with slow memory.

    Is it possible that Nintendo have somehow tweaked the "memory control logic" to make the system prioritize GPU requests, taking advantage of OOOE on the CPU to avoid that component stalling?
     
  18. wsippel

    Newcomer

    Joined:
    Nov 24, 2006
    Messages:
    229
    Likes Received:
    0
    Pretty sure the DDR3 isn't clocked at 800MHz. It should be 729MHz. I was told the DSP would be running at 120MHz. Looking at Nintendo's MO, it's probably not really 120MHz, but 121.5MHz - same base clock as the Wii. Nintendo likes clean multipliers, so I would assume the RAM to be clocked at 729MHz (6 x 121.5). Same as the Wii CPU. Nintendo likes to keep RAM and CPU in sync, so the CPU should be running at 1458MHz (12 x 121.5). Accordingly, the GPU would be clocked at 486MHz (4 x 121.5), and the eDRAM at either 486 or 729MHz.

    I don't know why Nintendo always seems to do this. I guess using a single fixed base clock and only changing multipliers for various components is simpler. And it definitely gives more predictable results. I don't see Nintendo giving that up.
     
  19. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    12,791
    Likes Received:
    1,596
    Where does this fit in?

    http://n4g.com/news/1123950/rumor-wii-u-cpu-could-be-clocked-at-1-22ghz

    It's very close to, but not quite, 10X121.5

    On the RAM though, that's even LESS than the worst case prior thought bandwidth. Ouch. Would be more like 11.7GB. HALF the 360's. Just ouch.
     
  20. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    No, not really. The my argument was two-fold, with the first being that a chip like Gekko/Broadway needs off-chip connection, but if you make a tricore version, the number of connections aren't going to triple. As you point out, the off-chip data communications needs would increase but that part is addressed by keeping signals on-package.
    Bearing in mind that I'm no IC designer, but a computational scientist, to the best of my knowledge both of your points above are correct. What I don't have is hard numbers, that is, if you want to really want to push the signaling speed per connection, how does that affect the necessary area for the associated drive circuitry? On the other hand, I can't really see that it would be an issue here, and in the cases where I've heard it described in more detail, they've claimed both benefits - much faster signaling at lower cost in die area.

    Although the thread title says GPU, I'm inclined to agree. :)
    As to your question, I'll be damned if I know. No developer has yet been heard gnashing his teeth about having to rewrite all SIMD code so Nintendo/IBM adding SIMD blocks to facilitate ports is a possibility. On the other hand Iwata has publicly made vague noises that could be interpreted as that the GPU would be the way to go for parallel FP. Or not. They could also have made a complete rework of the core, a la how different manufacturers produce ARMv7 cores of differing complexity. That would cost a bit though. Or they could have spent gates to beef up only what they deem to be key areas - after all, they have quite a bit of experience by now with where the bottlenecks have proven to be for their particular application space.

    While the lack of information is frustrating for the curious, we do know a few things. We know the that the die area is 33mm2 on 45nmSOI, and that the power draw is in the ballpark of 5W. We also know that is going to be compatible with Wii titles, which makes it an open question (but not impossible) if IBM has used a completely unrelated PPC core with sufficient performance headroom per core that performance corner cases can be avoided. "Enhanced" Broadway may indeed be the case.

    It's not going to be a powerhouse under any circumstances in raw ALU capabilities compared to contemporary processors. It spends roughly a fifth of the process adjusted die size per core (logic+cache) as the Apple A6 for instance. On the other hand the Cell PPE or the Xenos cores aren't particularly strong either for anything but vector-parallel FP codes that fit into local storage or L1 cache respectively. (Imperfect example could be that for instance the iPhone5 trumps the PS3 in both Geekbench integer and floating point tests). The take home message being that even if the WiiU CPU isn't a powerhouse, it isn't necessarily at much of a disadvantage vs. the current HD twins in general processing tasks even if we think of it as a tweaked Broadway design. If the more modern GPU architecture of the WiiU indeed makes some of the applications that the SIMD units were used for unnecessary, maybe it is a better call to simply skip CPU SIMD. This is a game console, after all.

    I have to say though that given what we know today, it seems to punch above its weight even at this point in time. There are a number of multi platform ports on the system, at launch day with what that implies, that perform roughly on par with the established competitors. And those games are not developed with the greater programmability, nor the memory organization of the WiiU in mind. So even without having its strengths fully exploited, it does a similar job at less than half the power draw of its competitors at similar lithographic processes! And its backwards compatible. To what extent its greater programmability and substantial pool of eDRAM can be exploited to improve visuals further down the line will be interesting to follow.
    How what we have seen so far can be construed as demonstrating hardware design incompetence on the part of Nintendo is an enigma to me.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...