Wii U hardware discussion and investigation *rename

Discussion in 'Console Technology' started by TheAlSpark, Jul 29, 2011.

Thread Status:
Not open for further replies.
  1. Commenter

    Newcomer

    Joined:
    Jan 9, 2010
    Messages:
    234
    Likes Received:
    17
    I wonder, would paired singles allow 2 integer multiply operations at a time?
     
    #5401 Commenter, Aug 30, 2013
    Last edited by a moderator: Aug 30, 2013
  2. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    Well Google found this for me:

    https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/A88091CAFE0F19CE852575EE0073078A/$file/To%20CL%20-%20CL%20Special%20Features%206-22-09.pdf

    Section 4, "Paired-Single Precision Floating Point Operations" doesn't seem to give any indication of simd-like int capabilities ....?
     
  3. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Like you found, singles = "single precision float" so no there's no integer SIMD.
     
  4. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    Thanks again.

    So the performance difference for integer workloads is potentially even greater than for floating point workloads. So much for Nbench based beard stroking then.

    After 7+ years of vectorising as many performance critical tasks as possible, I guess it makes sense that developers looked at the Wii U with it's float-only paired-singles and groaned.
     
  5. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Yeah, although people often do use "integer" synonymously with "scalar" or "branchy" code, stuff that doesn't vectorize well. Some people are still scratching their heads over what integer SIMD is useful for, although I personally use it like crazy (and rarely use float SIMD).

    There was at least one dev who gave an interesting insight into Wii U optimization.. he basically said to go for L2 cache locality or go home. On aggregate Wii U has substantially more L2 cache than XBox360. Sizing out algorithms to try to fit a particular cache size is often not very high on a typical developer's check list, although in the case of XBox360 it was probably pretty vital to be as cache resident as possible since the main RAM latency was so bad. It could be that the latency isn't that great on Wii U either, what with it going through a memory controller on the GPU. So that means resize your stuff to try to get a better hit-rate on Wii U's L2 if you can. But the weird 2MB + 512KB + 512KB local L2 structure is very different from shared 1MB, and I'm sure that presents some of its own challenges (there could be a substantial penalty for sharing data structures between cores now, it could even be as bad as going out to memory or even worse, depending on how they did it)
     
  6. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    One would hope the CPU die contains some mechanism to share data between CPU caches. Otherwise you'll be doing a round-trip through that not-particularly-awesome 12GB/s main RAM, which'll be hammered by GPU accesses at the same time.
     
  7. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    I agree, one would definitely hope, but I would have hoped for a lot of things that Nintendo didn't do here :p
     
  8. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    I had originally thought that the 2MB L2 of the "master core" might have been a way of trying to engineer what was effectively a L3 cache without actually engineering the the L3 bit i.e. L2 cache misses from the "slave" cores would benefit from improved performance and save on main memory bandwidth by pulling data from the large L2 of the master core.

    But it sounds like that's the last thing that you'd want. It sounds like the large L2 of the master core is actually there to assuage the fears of developers who are scared shitless of symmetrical multicore development, and who needs a core that they can fall back on that can easily handle the L2 accesses for large data sets.

    Kind of seems to go against the idea of symmetrical multi core i.e. some algorithms will need to be optimised for the none-master cores rather than the master core, or they'll suffer from performance penalties. 3 x 1MB would have made a lot more sense from a lay person perspective: fulfilling the objective (as stated explicitly in the Project Cafe leaked slides) of making 360 ports easily manageable by allowing L2 datasets of the same size as the the 360's shared cache to fit in each of Espresso's CPU L2 caches.
     
    #5408 function, Aug 31, 2013
    Last edited by a moderator: Sep 1, 2013
  9. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    I'm having difficulties understanding how 3x1MB caches would be superior to 2x1MB + 1x2MB, from any perspective.

    Just doesn't make sense to me, but perhaps there's some drug that's currently lacking in my system that could shed additional enlightenment...? :)
     
  10. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    Espresso is 1 x 2MB and 2 x 0.5 MB for a total of 3 MB.

    I was suggesting 3 x 1 MB (same total L2 and presumably same die size) so that the "weaker" 0.5 MB cores would have improved performance (and have access to same amount of L2 as each Xenon core), and so that you could run any task on an core with the same level of performance (greater flexibility in scheduling tasks).
     
  11. Commenter

    Newcomer

    Joined:
    Jan 9, 2010
    Messages:
    234
    Likes Received:
    17
    Can you ever have too much cache?
     
  12. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    I guess cash or cache it all works the same :lol:
     
  13. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Yes. The more you have, the slower it is to access (higher latency). A balance has to be struck between quantity and speed. To gain the benefits of different quantity and speed combinations, different cache layers are used - instruction and data caches, L1, L2, and sometimes L3, each smaller and faster than the next (this concept extends to RAM and storage, all increasing capacity at a reduction in speed).
     
  14. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    Ah, right. I forgot; my bad.

    I wonder what real performance difference there is between half and one meg of cache, considering say an intel core i-series CPU has only 256k L2 (although paired with a L3) and manages quite well. Perhaps only on the order of a few percent in most cases? Making cache four times larger for one out of the three cores is odd though, but then again SO MUCH of the wuu is just fricken odd, so better not get hung up on this particular oddity, eh? :)
     
  15. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    If the developer who swears that they only got acceptable performance by optimizing for Wii U's cache sizes is to be believed (sorry I don't remember the source) then I doubt the difference between 512KB and 2MB is negligible. It's not just that Core i has L3 cache but also most likely much lower latency (and higher bandwidth) main RAM, much more reordering capability, automatic prefetching, SMT, and more sophisticated memory controllers with more concurrency. Many more facilities for hiding the cost of a cache miss.

    It makes you think, if half as much L2 cache would have been about as good then they really shouldn't have bothered going with eDRAM on this. But then again, I don't know if there was a great technical reason for this, they could have just been suckered into whatever IBM wanted to sell them. Ideally they wouldn't be getting IBM to fab this at all, and would have managed some kind of CPU on the same die as the rest of the stuff.
     
  16. jlippo

    Veteran

    Joined:
    Oct 7, 2004
    Messages:
    1,744
    Likes Received:
    1,090
    Location:
    Finland
    Cache size seems to affect power usage as well. (Reason why nvidia proposed additional 1KB L0 cache.)
     
  17. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    Gamecube only had 256k cache and ~2.6GB/s main RAM bandwidth (although ridiculously low latency, at least from the GPU side since that's where the memory controller is), it did fine. 512k caches get 90+ percent hitrate typical case, so I have a hard time seeing how diff between 512k and 2MB would really be all that monstrous. Must be something else going on as well methinks if your source is correct.

    L2 wouldn't have helped GPU performance; without eDRAM the system would have choked on that pathetic 12GB/s main RAM B/W.

    You really think nintendo's dumb enough to be suckered into a deal like this? They've been building consoles for thirty years, you'd think they would know better than THAT at least. :)

    Why not? IBM has advanced eDRAM process, and they're the developer of the CPU core, so I would think they'd be ideal for ninty to work with. Especially since they probably have spare fab capacity too and might be able to offer a good price, since they only make chips for big iron enterprise market, which isn't all that big despite the moniker. :)
     
  18. DRS

    DRS
    Newcomer

    Joined:
    May 22, 2009
    Messages:
    135
    Likes Received:
    0
    Do you think it gets 90% hits on data access too? My guess is you can hold a pretty big model in that cache and perform the operations you want to do on it sequentially.
     
  19. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Gamecube's competitor PS2 had no L2 cache at all.. did it not do fine? What is your criteria here exactly? We're talking about older systems with much lower clock speeds working on different datasets, things change.

    Surely you don't think everyone else in the industry was also wrong for making caches bigger and bigger.

    I'm not sure you're aware that there are two eDRAM pools on the system. The L2 cache itself is implemented with eDRAM. I'm not talking about the eDRAM on the GPU chip.

    Yes.

    So do they need the eDRAM for L2 cache or not? If you think 512KB per core is enough then the answer would be "no." The GPU chip has nothing to do with IBM.

    You're right, they're ideal for Nintendo because Nintendo over-values backwards compatibility. Had they sacrificed it and moved on to a better CPU they could have done a unified SoC and a more powerful one at that. Everyone else is dropping BC, I don't think they needed it to survive, not at the expense of a better system.

    Mind you, going with Renesas is looking pretty bad too.
     
  20. see colon

    see colon All Ham & No Potatos
    Veteran

    Joined:
    Oct 22, 2003
    Messages:
    2,758
    Likes Received:
    2,207
    720P with FSAA is 14/28 MB for 2x/4x. I'm not sure what Megafenix is talking about, because it is the same everywhere, but you would have to tile with FSAA.

    Do we know how much eDram is there?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...