Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
so does this change your opinion on the 20 shaders per block theory?
Not too much, 160 shaders with 8 ROPs and 16 TMUs is possible just as 320 with 16 TMUs and 4 ROPs. I am not 100% if its 20 SP per block but I think its still very possible. 160:16:8 would be exactly double the numbers on an ati 4550. 320:16:8 would be double the numbers on an amd 6450.


Anybody catch the sonic all star racer, face off from digitial foundry, wii u had the worst resolution of all 3 versions, i'm starting to think wiiu has 160 SP like Function and Esrever suggested or it has some nasty bottle necks
They are probably lots of bottlenecks with work arounds that developers aren't familiar with yet. Even with 320 shaders there could be problems with other parts of the system that requires more optimization to get running as well as current gen consoles.
 
Not too much, 160 shaders with 8 ROPs and 16 TMUs is possible just as 320 with 16 TMUs and 4 ROPs. I am not 100% if its 20 SP per block but I think its still very possible. 160:16:8 would be exactly double the numbers on an ati 4550. 320:16:8 would be double the numbers on an amd 6450.


They are probably lots of bottlenecks with work arounds that developers aren't familiar with yet. Even with 320 shaders there could be problems with other parts of the system that requires more optimization to get running as well as current gen consoles.

I understand your stance on considering 160sp from a technical point of view, but from looking at the game titles, I don't see how that is feasible. Developers already has to optimize their games for the CPU that is definitely weaker in the tasks that the current-gen CPUs are the strongest at, so I don't see how Wii U's launch ports would be as roughly on-par as they are if the GPU was also lacking raw power compare to the other systems.

I agree with your second part of your post. This GPU seems to have a very unusual architecture, so it is not surprising to see some weird issues even if the system has a stronger GPU.
 
R600's was a one off in that it was a read/write cache. In the following generations it was write only. Your size analysis explains why the read caching was dropped.
It couldn't have been write only. Otherwise it wouldn't have been possible to get the readback from atomic operations ;). Actually, it was explicitly labeled as R/W cache in Cypress block diagrams. There were 8 sections of 16 kB each, aligned to the adress spaces handled by the 8 memory channels.

edit:
fellix posted the part of the block diagram I'm talking about already some time ago.
 
Last edited by a moderator:
It couldn't have been write only. Otherwise it wouldn't have been possible to get the readback from atomic operations ;). Actually, it was explicitly labeled as R/W cache in Cypress block diagrams. There were 8 sections of 16 kB each, aligned to the adress spaces handled by the 8 memory channels.
Maybe we're talking about different things, but there's no read/write cache in Cypress. I was referring to the write combining cache used for non frame buffer writes. It was read/write in R600. In Cypress atomics were handled by the LDS/GDS.
 
I was referring to the write combining cache used for non frame buffer writes. It was read/write in R600. In Cypress atomics were handled by the LDS/GDS.
Atomics on global memory were handled by this cache. The GDS also has atomic operations, but are not used for global atomics. The GDS atomics are used in OpenCL exclusively for the atomic counters, which are much faster than global atomics for this exact reason. Cypress actually has both, write combining buffers and a R/W cache for global memory/UAVs. I edited my post above already, but I just steal the part of the official Cypress block diagram fellix posted 3 years ago.

attachment.php


As I said, there are eight 16kB slices (aligned to the memory channels) of R/W cache handling the global atomics in addition to the write combining buffers (8 times 4 kB). It's kind of hard to make good use of it, but at least it serves the purpose of providing the global atomics (I think the only justification to call it a R/W cache is that one can access cached memory through it with atomic ops, but that is neither very convenient nor extremely fast). The R600 R/W cache basically served no purpose why I called it a glorified write combining buffer. ;)
 
Last edited by a moderator:
Well yes, it is a 512-bit bus with a higher bandwidth than the fb. But if used correctly, the 32 addressable macros could also amount to less overall page switches in the cache. Additionally, we really don't know what kind of latency Renesas' eDRAM boasts. I think 1t-SRAM is 1 clock cycle, and Wii U likely downclocks for Wii BC mode.

8 texturemaps per texture unit and 4 rops?

BTW, does the 360 support early Z checking? If there is a one way path from GPU to ROPS I'd guess not?
 
Could you be more specific?

I know it's a long thread, but it has been covered many times. It has strengths and weaknesses compared to ps360. Ultimately the games themselves tell you what you need to know, if you're waiting for developers to take years to release the hidden power of the wiiu you can stop. Games might get a bit better, but it doesn't look to have any magic packed into its 40W.
 
Atomics on global memory were handled by this cache. The GDS also has atomic operations, but are not used for global atomics. The GDS atomics are used in OpenCL exclusively for the atomic counters, which are much faster than global atomics for this exact reason. Cypress actually has both, write combining buffers and a R/W cache for global memory/UAVs. I edited my post above already, but I just steal the part of the official Cypress block diagram fellix posted 3 years ago.

attachment.php


As I said, there are eight 16kB slices (aligned to the memory channels) of R/W cache handling the global atomics in addition to the write combining buffers (8 times 4 kB). It's kind of hard to make good use of it, but at least it serves the purpose of providing the global atomics (I think the only justification to call it a R/W cache is that one can access cached memory through it with atomic ops, but that is neither very convenient nor extremely fast). The R600 R/W cache basically served no purpose why I called it a glorified write combining buffer. ;)
I see what you mean. I never considered that to be a cache and didn't realize it was described as such.
 
They have the 3DS, with a terrible dual core CPU (ARM11, as in Rapsberry Pi) and the GPU is a mixed bag but at least has vertex shaders and looks more advanced than the Wii one.
http://en.wikipedia.org/wiki/PICA200

Given the acute slowness of the 3DS CPU I hope they have been smart enough to learn programming it and thus have some multicore programming experience.
 
They have the 3DS, with a terrible dual core CPU (ARM11, as in Rapsberry Pi) and the GPU is a mixed bag but at least has vertex shaders and looks more advanced than the Wii one.
http://en.wikipedia.org/wiki/PICA200

Given the acute slowness of the 3DS CPU I hope they have been smart enough to learn programming it and thus have some multicore programming experience.

I really doubt that 3DS development can be compared to a multicore powerpc and modern shaders.
 
Yes it supports early Z (and hierarchical Z). http://www.beyond3d.com/content/articles/4/5

Early Z or late Z, the operations are still the same. The ROPs don't know or care when it's done, they just need to have the Z part decoupled from the color part.

Thanks for the clarification. I recon the operations are the same, but in early Z case triangle setup should query the ROP, while in late Z case the ROP itself can handle it.

Anyways, to clarify my line of thoughts, if the 360 wouldn't be able to do it, WiiU GPU could get away with somewhat lower rendering power and 160GFLOPS might be sufficient.

I know it's a long thread, but it has been covered many times. It has strengths and weaknesses compared to ps360. Ultimately the games themselves tell you what you need to know, if you're waiting for developers to take years to release the hidden power of the wiiu you can stop. Games might get a bit better, but it doesn't look to have any magic packed into its 40W.


Yes, and people also refer to 40W average as having all USB ports hooked up, which doesn't seem to be an average usage case to me. Also, games (such as Doom 3) copy vertex data from memory, animate it and write it to another location for the GPU to pick up, bumping up the bandwidth requirements.
 
Last edited by a moderator:
Kingsley on whether he finds the Wii U to be more powerful than the PS3/360…

“In a word yes – it’s very capable, and we’re taking advantage of this to bring greater graphical fidelity to the Wii U version of Sniper Elite V2.”

Source: http://www.nintendolife.com/news/20...iper_elite_v2_wii_u_is_the_definitive_version
Not very constructive, sadly. A Nintendo fansite where the developer is wanting to drum up interest in their new game, they're going to be in full PR mode. "Powerful" is unqualified. It'll be interesting to compare the game though to see what they manage. It was very much in need of better IQ on PS3.
 
So who wants to take bets on it looking like the 360 version, possibly with a few improved textures, but a less stable frame rate?
 
Thanks for the clarification. I recon the operations are the same, but in early Z case triangle setup should query the ROP, while in late Z case the ROP itself can handle it.

That's correct. The Xbox360 has HiZ for sure, but unfortunately does not have EarlyZ. The per-pixel depth values are processed via LateZ (ie post pixel shader) directly in the ROPs.
 
That's correct. The Xbox360 has HiZ for sure, but unfortunately does not have EarlyZ. The per-pixel depth values are processed via LateZ (ie post pixel shader) directly in the ROPs.

Looks like that is the case.. I figured the eDRAM logic handled color and depth separately but I guess not. It's confusing since a lot of stuff out there refers to the hi-Z as a form of early-Z (which it is I guess)
 
With all the peeping at die shots (which has been tremendous fun) I think we might have gotten tunnel vision and be losing the "big picture". The question of "320 vs 160" shaders is still unanswered and stepping back should help us answer it.

The current popular hypothesis that Latte is a 16:320:8 part @ 550 mHz. Fortunately, we can see how such a part runs games on the PC. You know, the PC, that inefficient beast that's held back by Windows, thick APIs, Direct X draw-call bottlencks that break the back of even fast CPUs, and all that stuff. Here is a HD 5550, a VLIW5 GPU with a 16:320:8 configuration running at @550 mhz:

http://www.techpowerup.com/reviews/HIS/Radeon_HD_5550/7.html

And it blows past the 360 without any problems. It's not even close. And that's despite being on the PC!

Now lets scale things back a bit. This is the Llano A3500M w/ Radeon 6620G - a 20:400:8 configuration GPU, but it runs @ 444 MHz meaning it has exactly the same number of gflops and TMU ops as the HD 5550, only it's got about 20% lower triangle setup and fillrate *and* it's crippled by a 128 bit DDR 1333 memory pool *and* it's linked to a slower CPU than the above benchmark (so more likely to suffer from Windows/DX bottlenecks). No super fast pool of edram for this poor boy!

http://www.anandtech.com/show/4444/amd-llano-notebook-review-a-series-fusion-apu-a8-3500m/11
http://www.anandtech.com/show/4444/amd-llano-notebook-review-a-series-fusion-apu-a8-3500m/12

And it *still* comfortably exceeds the 360 in terms of the performance that it delivers. Now lets look again at the Wii U. Does it blow past the 360? Does it even comfortably exceed the 360? No, it:

keeps
losing
marginally
to
the
Xbox
360

... and that's despite it *not* being born into the performance wheelchair that is the Windows PC ecosystem. Even if the Wii U can crawl past the 360 - marginally - in a game like Trine 2 it's still far below what we'd expect from a HD5550 or even the slower and BW crippled 6620G. So why is this?

It appears that there two options. Either Latte is horrendously crippled by something (API? memory? documentation? "drivers"?) to the point that even equivalent or less-than equivalent PC part can bounce its ass around the field, or ... it's not actually a 16:320:8 part.

TL: DR version:
Latte seems to be either:
1) a horrendously crippled part compared to equivalent (or lower) PC GPUs, or
2) actually a rather efficient 160 shader part

Aaaaaaand I'll go with the latte(r) as the most likely option. Face it dawgs, the word on the street just don't jive with the scenes on the screens.
 
I didn't know you did comedy function, looking at call of duty port, it is fairly clear that they took the 360 game and forced it to run on Wii U. It runs the same exact frame resolution, same as and same image quality. Now if it had an ALU problem vs 360, then it wouldn't be able to produce the multiplayer game on both the tv and gamepad in split screen, since that would require more polygon pushing power, not less... The reality is these ports are running modified 360 code, something similar to what vigil did to get ds2 running on the Wii U in only a few weeks.
 
So you think games are running massively more efficiently on the PC than on the Wii U?

I don't actually think the Wii U has an "ALU problem" compared to the 360. I think it gets by okay compared to the PS360, especially considering the power draw. I think the "problem" is that a in a bizarre feat of back-peddling and entrenchment the 360 - a seven year old system - is seen as some kind of benchmark for "console power" for a brand new $350 / £300 games console pimped by Nintendo as being some kind of 3rd party dream box.
 
Status
Not open for further replies.
Back
Top