Wii U hardware discussion and investigation *rename

Fourth Storm · Feb 7, 2013

AlStrong said:
No array density? There's still overhead in between when constructing the arrays. For example, TSMC 40nm eDRAM cell size is ~0.0583um^2 (right in the same ballpark of your 40nm figure), whereas the macro arrays (1Mb) in the literature are 0.145mm^2. There's nearly 2.4x bloat in this case (you'd expect something closer to 0.0611mm^2 with 1024x1024 bits), though I've seen as low as 2x bloat depending on who's manufacturing & what performance targets etc.

The above array sizes I calculated there were for a group of yellow tiles, not just based on a single one btw, so there's still spacing in between those tiles. Hope that makes sense.

Makes perfect sense. I confused your numbers with the post by fellix which subtracted the overhead and still came to about 32 mm^2. Could sensory amps account for the difference?

fellix · Feb 7, 2013

AlStrong said:
hm... 4 shader blocks, 20 each, hence the 80 for bobcat. Each block is ~164x274 pixels out of the 2051x1821 image, and the die area is ~75mm^2, which gives me 0.90mm^2.

no?

Sorry, a brain fart of mine -- your numbers are correct.

SoreSpoon · Feb 7, 2013

So, based on what I'm seeing here, either the GPU is 40nm with 320 SPs, the SIMD has less than 40 SPs (I would assume most of you are thinking 20 per block, meaning 160 SPs) and it's 55nm, or the guy who did the x-ray of the GPU is mis-identifying the SIMD blocks or is at least mistaken about something with the layout.

Before I continue, I want to note that I promise not to argue. i know that you guys are the experts here. So, here are my questions:
a.) How would 160 SPs at ~550MHz compare to Xenos?
b.) What is the benefit (cost or otherwise) of cutting down the number of SPs per block vs just cutting the number of blocks? Would it not be cheaper to just use less SIMDs?
c.) If we assume that it actually has 320 SPs, is it possible that Nintendo locked out part of the GPU to use exclusively for GPU or to increase yields?

In my opinion, I think that it has 320 SPs and those of you that assumed it has to be less just either overestimated the power of the RV730 or underestimated the power of Xenos relative to the RV730.

EDIT: By the way, I don't think the 32MB eDRAM was ever confirmed. Correct me if I'm wrong.

function · Feb 8, 2013

Regarding that slow transparencies business

A few interesting things have happened recently. First marcan posted this:

https://twitter.com/marcan42/status/298922364740190208

marcan said:
Oh, and for those who claim it's not a Radeon-like design: http://www.marcansoft.com/paste/Kq0OLb0X.txt … . R6xx. Register names match AMD ones too.

Then on NeoGaf a poster called Popstar posted this:

http://www.neogaf.com/forum/showpost.php?p=47409504&postcount=3275

Popstar said:
*Random thinking out loud probably not related to the actual Wii U GPU*

If you have all that memory embedded right on the GPU and accessible to the shader units with low latency, do you need conventional ROP hardware at all? Or can you just do blending in the shader like a PowerVR / Tegra chip? Perhaps with mini-rops for Z / stencil test?

And that got me to remembering that R6xx series had something funny going on with the ROPs. I think the shaders had to be involved in the MSAA resolve. So I was thinking that what if on the Wii U MSAA resolve and also some transparency operations were shader based instead of ROP based?

The of low edram bandwidth hypothesis came about because of things like:

- Poor frame rate on MSAA enabled COD:BLOPS 2, and awful dips when there's lots of transparent overdraw
- Trees removed from Darksiders 2
- Foliage removed from Tekken, and motion blur stage removed

But if unlike the 360, where alpha blends and MSAA resolve are done on the ROPs, these operations were done on shaders then there could be a big impact that looked like a ROP/bandwidth issue. Sampling from a 24-bit render target for a blend would hit the texture cache hard and eject many times the number of S3TC compressed texels, and that could starve the ALUs of texture data and bring the GPU to it's knees - looking all the while like a ROP / BW issue. And trees of the kind removed from Wii U Darksiders 2 could have many layers of transparently textured geometry to draw...

AIStrong said that Z and colour were functionally separate parts of the ROP and that they could be separate units (I think he said this, can't find the quote right now). So perhaps the Wii U could have fast Z test/write on dedicated units but not colour. ERP said that fast PC GPUs frequently had idling ALUs because they were texture fetch bound; perhaps transparency and/or MSAA can make this an even bigger problem for the Wii U too?

If you were running games designed for a 16 TMU Xbox with smart ROPs, and suddenly had to port them to a 16 TMU Wii U that handled some or all transparency via shaders then perhaps that could cause some of the Wii U port performance / modification issues that we've seen.

... perhaps?

Fourth Storm · Feb 8, 2013

Good point there, function. I was also reading up a bit on R600 architecture after seeing Marcan's post. Something that struck me was that R600 SPUs were practically identical to their R700 counterparts but less dense on the same process. Hmmm.

Esrever · Feb 8, 2013

Fourth Storm said:
Good point there, function. I was also reading up a bit on R600 architecture after seeing Marcan's post. Something that struck me was that R600 SPUs were practically identical to their R700 counterparts but less dense on the same process. Hmmm.

I think its unlikely that they used R600 but its plausible. Why port R600 onto 40nm and have it be as big if not bigger than R700 on 55nm.

Exophase · Feb 8, 2013

function said:
But if unlike the 360, where alpha blends and MSAA resolve are done on the ROPs, these operations were done on shaders then there could be a big impact that looked like a ROP/bandwidth issue. Sampling from a 24-bit render target for a blend would hit the texture cache hard and eject many times the number of S3TC compressed texels, and that could starve the ALUs of texture data and bring the GPU to it's knees - looking all the while like a ROP / BW issue. And trees of the kind removed from Wii U Darksiders 2 could have many layers of transparently textured geometry to draw...

I don't think AMD is going to design a GPU that has no blending support at all. There are platforms where blending is done in the shader. But in the case the shaders have special load access to the render target, so they wouldn't have to go through the texture cache. If it did the texture cache would get worse pressure but the impact to eDRAM or DDR3 bandwidth wouldn't change unless eDRAM can't store textures at all.

I hear on NeoGAF that Nintendoland has tons of blending but does okay. Is it possible that the problem arises with translucency handled in a deferred engine? I'm not up to date on techniques for this but I know you have to handle it in a special way. And it's possible Nintendoland isn't using a deferred engine; Nintendo could lack experience with such a thing since it wouldn't have been suitable on Wii.

Note that translucency does reduce the amount of occlusion in the scene, meaning regardless of how the blending itself is handled the overall pressure on shading including texture fetch bandwidth is going to be higher than if those fragments were opaque (and early-Z is doing a good job). A game could, in theory, be texture bandwidth limited on Wii U where it wasn't on XBox 360 or PS3, due to having worse main memory bandwidth (if all the textures are coming from main memory), and in those cases increased pressure would make things worse.

Gipsel · Feb 8, 2013

Fourth Storm said:
Something that struck me was that R600 SPUs were practically identical to their R700 counterparts

AMD acually added a barrel shifter to each ALU. In R600, only the t ALU had one. The R700 generation could do shifts in all 5 ALUs. That was the beginning of AMD's complete dominance in these crypto algorithms using bitshifts.

shinobi · Feb 8, 2013

SoreSpoon said:
So, based on what I'm seeing here, either the GPU is 40nm with 320 SPs, the SIMD has less than 40 SPs (I would assume most of you are thinking 20 per block, meaning 160 SPs) and it's 55nm, or the guy who did the x-ray of the GPU is mis-identifying the SIMD blocks or is at least mistaken about something with the layout.

Before I continue, I want to note that I promise not to argue. i know that you guys are the experts here. So, here are my questions:
a.) How would 160 SPs at ~550MHz compare to Xenos?
b.) What is the benefit (cost or otherwise) of cutting down the number of SPs per block vs just cutting the number of blocks? Would it not be cheaper to just use less SIMDs?
c.) If we assume that it actually has 320 SPs, is it possible that Nintendo locked out part of the GPU to use exclusively for GPU or to increase yields?

In my opinion, I think that it has 320 SPs and those of you that assumed it has to be less just either overestimated the power of the RV730 or underestimated the power of Xenos relative to the RV730.

EDIT: By the way, I don't think the 32MB eDRAM was ever confirmed. Correct me if I'm wrong.

are people assuming it's 160 SP? i thought it was confirmed by DF it was 320 SP. i think the chances of 160 SP are very slim, but then again i'm just reading this thread and having a hard time figuring out what they are even talking about :/ but if it did have 160 SP, 360 would easily be the more powerful system.

SoreSpoon · Feb 8, 2013

shinobi said:
are people assuming it's 160 SP? i thought it was confirmed by DF it was 320 SP. i think the chances of 160 SP are very slim, but then again i'm just reading this thread and having a hard time figuring out what they are even talking about :/ but if it did have 160 SP, 360 would easily be the more powerful system.

Last time I was here, people kept saying that it's most likely a cut-down RV730 with either 160 or 240 SPs (depending on who you ask). Before the launch, there were several people here who even claimed that it would have 80 SPs. Their reasoning is the power consumption.

On the current topic, the R600 theory does clear a lot up, but the thing is that literally every reliable rumor and leak about the GPU points to R700. I think low eDRAM bandwidth makes more sense.

function · Feb 8, 2013

Fourth Storm said:
Good point there, function. I was also reading up a bit on R600 architecture after seeing Marcan's post. Something that struck me was that R600 SPUs were practically identical to their R700 counterparts but less dense on the same process. Hmmm.

Ooh, now that's interesting! Do you have a die shot of R600 or any measurements of the shader blocks? Would be interested in any 65nm Radeon stuff too.

As usual I have a hypothesis but I don't want to go in unarmed and get mugged!

Edit: Too late! I can't help myself!

Esrever said:
I think its unlikely that they used R600 but its plausible. Why port R600 onto 40nm and have it be as big if not bigger than R700 on 55nm.

You'd do it if your original design path began back when R600/R610/R630 were current designs.

If R600 (or R6xx) are indeed the basis of the Wii U graphics technology then it would mean that, just the Wii was an evolution of the GC, the Wii U is an evolution of a console that was originally planned to arrive around 2006/2007 either instead of the Wii or as a backup plan in case the unknown risk that was the Wii tanked.

Perhaps there was enough work done on a Gamecube backwards compatible, R600 based platform that it was worth reigniting the project as the Wii lifecycle progressed. It might explain the shader block scaling. Heck, might even explain edram BW - if indeed it is an issue at all - if the edram was originally planned to be connected via an off-chip bus.

All this is assuming R6xx is indeed the technology base, of course. Which we don't know to be true yet.

McHuj · Feb 8, 2013

shinobi said:
are people assuming it's 160 SP? i thought it was confirmed by DF it was 320 SP. i think the chances of 160 SP are very slim, but then again i'm just reading this thread and having a hard time figuring out what they are even talking about :/ but if it did have 160 SP, 360 would easily be the more powerful system.

Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

shinobi · Feb 8, 2013

McHuj said:
Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

what about the limited bandwidth and weak cpu?

swaaye · Feb 8, 2013

I don't know about this new supposed R600 connection.... But I do recall Wii HD rumors shortly into Wii's lifetime.

R600 and friends have characteristics like ringbus and unique, big ROPs. That would be recognizable I would think.

function · Feb 8, 2013

Exophase said:
I don't think AMD is going to design a GPU that has no blending support at all. There are platforms where blending is done in the shader. But in the case the shaders have special load access to the render target, so they wouldn't have to go through the texture cache. If it did the texture cache would get worse pressure but the impact to eDRAM or DDR3 bandwidth wouldn't change unless eDRAM can't store textures at all.

Texture cache and TMU pressure are the basis of the supposition, and in particular for games designed primarily for the 360 with its smart edram, but yeah I get that the total read/write for the memory should be the same

I hear on NeoGAF that Nintendoland has tons of blending but does okay. Is it possible that the problem arises with translucency handled in a deferred engine? I'm not up to date on techniques for this but I know you have to handle it in a special way. And it's possible Nintendoland isn't using a deferred engine; Nintendo could lack experience with such a thing since it wouldn't have been suitable on Wii.

My thought regarding this is that Nintendoland probably isn't pushing as many texture layers (or things like hi-res bump maps) as AAA Xbox 360 ports, and so if TMU and texture cache are indeed limiting factors in blending ops then Nintnendoland might have a much easier time of handling transparencies.

I haven't actually played Nintendoland though so I'm kind of going out on a guess-based-limb here though.

Note that translucency does reduce the amount of occlusion in the scene, meaning regardless of how the blending itself is handled the overall pressure on shading including texture fetch bandwidth is going to be higher than if those fragments were opaque (and early-Z is doing a good job). A game could, in theory, be texture bandwidth limited on Wii U where it wasn't on XBox 360 or PS3, due to having worse main memory bandwidth (if all the textures are coming from main memory), and in those cases increased pressure would make things worse.

Yeah, you're absolutely right of course. It could be down to that and/or edram BW rather than shader based transparency, but with the "R600" output from one of marcans tests and the old story about "broken" ROPs on the R6xx Radeons it's fun to speculate.

function · Feb 8, 2013

McHuj said:
Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

And then there's the higher clock, and the likelihood that with 8 ROPs and 16 TMUs to fewer ALUs that they should be less likely to stall due to outside bottlenecks. The 32 MB of edram would save on the tiling and copy out overheads too, which can be anywhere between 2 and 30% based on dev comments.

shinobi · Feb 8, 2013

function said:
And then there's the higher clock, and the likelihood that with 8 ROPs and 16 TMUs to fewer ALUs that they should be less likely to stall due to outside bottlenecks. The 32 MB of edram would save on the tiling and copy out overheads too, which can be anywhere between 2 and 30% based on dev comments.

I don't undersatnd, i thought the general idea here was wiiu on par with curentgen with 320 SP, now if it has 160 SP, would'nt that make it weaker then current gen, or is SP count not all that important?

SoreSpoon · Feb 8, 2013

McHuj said:
Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

Is this based on anything beyond theory?

kalelovil · Feb 8, 2013

swaaye said:
I don't know about this new supposed R600 connection.... But I do recall Wii HD rumors shortly into Wii's lifetime.

R600 and friends have characteristics like ringbus and unique, big ROPs. That would be recognizable I would think.

IIRC the low end HD2400/HD3400 series used a conventional 64bit crossbar rather than a ring bus, as it was simpler and used less power. The ring bus was not integral to the r6xx architecture.

Esrever · Feb 8, 2013

Isn't R600 whats inside the 360?

I still think the 55nm theory might still be plausible. The SIMD components match Brazos almost perfectly except for being 30% larger. We just need to see what eDRAM density really is and if it is possible.

Wii U hardware discussion and investigation *rename

Fourth Storm

fellix

SoreSpoon

function

None functional

Fourth Storm

Esrever

Exophase

Gipsel

shinobi

SoreSpoon

function

None functional

McHuj

shinobi

swaaye

Entirely Suboptimal

function

None functional

function

None functional

shinobi

SoreSpoon

kalelovil

Esrever

Similar threads