Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
No array density? There's still overhead in between when constructing the arrays. For example, TSMC 40nm eDRAM cell size is ~0.0583um^2 (right in the same ballpark of your 40nm figure), whereas the macro arrays (1Mb) in the literature are 0.145mm^2. There's nearly 2.4x bloat in this case (you'd expect something closer to 0.0611mm^2 with 1024x1024 bits), though I've seen as low as 2x bloat depending on who's manufacturing & what performance targets etc.

The above array sizes I calculated there were for a group of yellow tiles, not just based on a single one btw, so there's still spacing in between those tiles. Hope that makes sense. :oops:

Makes perfect sense. I confused your numbers with the post by fellix which subtracted the overhead and still came to about 32 mm^2. Could sensory amps account for the difference?
 
So, based on what I'm seeing here, either the GPU is 40nm with 320 SPs, the SIMD has less than 40 SPs (I would assume most of you are thinking 20 per block, meaning 160 SPs) and it's 55nm, or the guy who did the x-ray of the GPU is mis-identifying the SIMD blocks or is at least mistaken about something with the layout.

Before I continue, I want to note that I promise not to argue. i know that you guys are the experts here. So, here are my questions:
a.) How would 160 SPs at ~550MHz compare to Xenos?
b.) What is the benefit (cost or otherwise) of cutting down the number of SPs per block vs just cutting the number of blocks? Would it not be cheaper to just use less SIMDs?
c.) If we assume that it actually has 320 SPs, is it possible that Nintendo locked out part of the GPU to use exclusively for GPU or to increase yields?

In my opinion, I think that it has 320 SPs and those of you that assumed it has to be less just either overestimated the power of the RV730 or underestimated the power of Xenos relative to the RV730.

EDIT: By the way, I don't think the 32MB eDRAM was ever confirmed. Correct me if I'm wrong.
 
Last edited by a moderator:
Regarding that slow transparencies business

A few interesting things have happened recently. First marcan posted this:

https://twitter.com/marcan42/status/298922364740190208
marcan said:
Oh, and for those who claim it's not a Radeon-like design: http://www.marcansoft.com/paste/Kq0OLb0X.txt … . R6xx. Register names match AMD ones too.

Then on NeoGaf a poster called Popstar posted this:

http://www.neogaf.com/forum/showpost.php?p=47409504&postcount=3275
Popstar said:
*Random thinking out loud probably not related to the actual Wii U GPU*

If you have all that memory embedded right on the GPU and accessible to the shader units with low latency, do you need conventional ROP hardware at all? Or can you just do blending in the shader like a PowerVR / Tegra chip? Perhaps with mini-rops for Z / stencil test?

And that got me to remembering that R6xx series had something funny going on with the ROPs. I think the shaders had to be involved in the MSAA resolve. So I was thinking that what if on the Wii U MSAA resolve and also some transparency operations were shader based instead of ROP based?

The of low edram bandwidth hypothesis came about because of things like:

- Poor frame rate on MSAA enabled COD:BLOPS 2, and awful dips when there's lots of transparent overdraw
- Trees removed from Darksiders 2
- Foliage removed from Tekken, and motion blur stage removed


But if unlike the 360, where alpha blends and MSAA resolve are done on the ROPs, these operations were done on shaders then there could be a big impact that looked like a ROP/bandwidth issue. Sampling from a 24-bit render target for a blend would hit the texture cache hard and eject many times the number of S3TC compressed texels, and that could starve the ALUs of texture data and bring the GPU to it's knees - looking all the while like a ROP / BW issue. And trees of the kind removed from Wii U Darksiders 2 could have many layers of transparently textured geometry to draw...

AIStrong said that Z and colour were functionally separate parts of the ROP and that they could be separate units (I think he said this, can't find the quote right now). So perhaps the Wii U could have fast Z test/write on dedicated units but not colour. ERP said that fast PC GPUs frequently had idling ALUs because they were texture fetch bound; perhaps transparency and/or MSAA can make this an even bigger problem for the Wii U too?

If you were running games designed for a 16 TMU Xbox with smart ROPs, and suddenly had to port them to a 16 TMU Wii U that handled some or all transparency via shaders then perhaps that could cause some of the Wii U port performance / modification issues that we've seen.

... perhaps?
 
Last edited by a moderator:
Good point there, function. I was also reading up a bit on R600 architecture after seeing Marcan's post. Something that struck me was that R600 SPUs were practically identical to their R700 counterparts but less dense on the same process. Hmmm.
 
Good point there, function. I was also reading up a bit on R600 architecture after seeing Marcan's post. Something that struck me was that R600 SPUs were practically identical to their R700 counterparts but less dense on the same process. Hmmm.
I think its unlikely that they used R600 but its plausible. Why port R600 onto 40nm and have it be as big if not bigger than R700 on 55nm.
 
But if unlike the 360, where alpha blends and MSAA resolve are done on the ROPs, these operations were done on shaders then there could be a big impact that looked like a ROP/bandwidth issue. Sampling from a 24-bit render target for a blend would hit the texture cache hard and eject many times the number of S3TC compressed texels, and that could starve the ALUs of texture data and bring the GPU to it's knees - looking all the while like a ROP / BW issue. And trees of the kind removed from Wii U Darksiders 2 could have many layers of transparently textured geometry to draw...

I don't think AMD is going to design a GPU that has no blending support at all. There are platforms where blending is done in the shader. But in the case the shaders have special load access to the render target, so they wouldn't have to go through the texture cache. If it did the texture cache would get worse pressure but the impact to eDRAM or DDR3 bandwidth wouldn't change unless eDRAM can't store textures at all.

I hear on NeoGAF that Nintendoland has tons of blending but does okay. Is it possible that the problem arises with translucency handled in a deferred engine? I'm not up to date on techniques for this but I know you have to handle it in a special way. And it's possible Nintendoland isn't using a deferred engine; Nintendo could lack experience with such a thing since it wouldn't have been suitable on Wii.

Note that translucency does reduce the amount of occlusion in the scene, meaning regardless of how the blending itself is handled the overall pressure on shading including texture fetch bandwidth is going to be higher than if those fragments were opaque (and early-Z is doing a good job). A game could, in theory, be texture bandwidth limited on Wii U where it wasn't on XBox 360 or PS3, due to having worse main memory bandwidth (if all the textures are coming from main memory), and in those cases increased pressure would make things worse.
 
Something that struck me was that R600 SPUs were practically identical to their R700 counterparts
AMD acually added a barrel shifter to each ALU. In R600, only the t ALU had one. The R700 generation could do shifts in all 5 ALUs. That was the beginning of AMD's complete dominance in these crypto algorithms using bitshifts.
 
So, based on what I'm seeing here, either the GPU is 40nm with 320 SPs, the SIMD has less than 40 SPs (I would assume most of you are thinking 20 per block, meaning 160 SPs) and it's 55nm, or the guy who did the x-ray of the GPU is mis-identifying the SIMD blocks or is at least mistaken about something with the layout.

Before I continue, I want to note that I promise not to argue. i know that you guys are the experts here. So, here are my questions:
a.) How would 160 SPs at ~550MHz compare to Xenos?
b.) What is the benefit (cost or otherwise) of cutting down the number of SPs per block vs just cutting the number of blocks? Would it not be cheaper to just use less SIMDs?
c.) If we assume that it actually has 320 SPs, is it possible that Nintendo locked out part of the GPU to use exclusively for GPU or to increase yields?

In my opinion, I think that it has 320 SPs and those of you that assumed it has to be less just either overestimated the power of the RV730 or underestimated the power of Xenos relative to the RV730.

EDIT: By the way, I don't think the 32MB eDRAM was ever confirmed. Correct me if I'm wrong.

are people assuming it's 160 SP? i thought it was confirmed by DF it was 320 SP. i think the chances of 160 SP are very slim, but then again i'm just reading this thread and having a hard time figuring out what they are even talking about :/ but if it did have 160 SP, 360 would easily be the more powerful system.
 
Last edited by a moderator:
are people assuming it's 160 SP? i thought it was confirmed by DF it was 320 SP. i think the chances of 160 SP are very slim, but then again i'm just reading this thread and having a hard time figuring out what they are even talking about :/ but if it did have 160 SP, 360 would easily be the more powerful system.

Last time I was here, people kept saying that it's most likely a cut-down RV730 with either 160 or 240 SPs (depending on who you ask). Before the launch, there were several people here who even claimed that it would have 80 SPs. Their reasoning is the power consumption.

On the current topic, the R600 theory does clear a lot up, but the thing is that literally every reliable rumor and leak about the GPU points to R700. I think low eDRAM bandwidth makes more sense.
 
Good point there, function. I was also reading up a bit on R600 architecture after seeing Marcan's post. Something that struck me was that R600 SPUs were practically identical to their R700 counterparts but less dense on the same process. Hmmm.

Ooh, now that's interesting! Do you have a die shot of R600 or any measurements of the shader blocks? Would be interested in any 65nm Radeon stuff too.

As usual I have a hypothesis but I don't want to go in unarmed and get mugged! :p

Edit: Too late! I can't help myself!

I think its unlikely that they used R600 but its plausible. Why port R600 onto 40nm and have it be as big if not bigger than R700 on 55nm.

You'd do it if your original design path began back when R600/R610/R630 were current designs.

If R600 (or R6xx) are indeed the basis of the Wii U graphics technology then it would mean that, just the Wii was an evolution of the GC, the Wii U is an evolution of a console that was originally planned to arrive around 2006/2007 either instead of the Wii or as a backup plan in case the unknown risk that was the Wii tanked.

Perhaps there was enough work done on a Gamecube backwards compatible, R600 based platform that it was worth reigniting the project as the Wii lifecycle progressed. It might explain the shader block scaling. Heck, might even explain edram BW - if indeed it is an issue at all - if the edram was originally planned to be connected via an off-chip bus.

All this is assuming R6xx is indeed the technology base, of course. Which we don't know to be true yet.
 
Last edited by a moderator:
are people assuming it's 160 SP? i thought it was confirmed by DF it was 320 SP. i think the chances of 160 SP are very slim, but then again i'm just reading this thread and having a hard time figuring out what they are even talking about :/ but if it did have 160 SP, 360 would easily be the more powerful system.

Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.
 
Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

what about the limited bandwidth and weak cpu?
 
I don't know about this new supposed R600 connection.... But I do recall Wii HD rumors shortly into Wii's lifetime.

R600 and friends have characteristics like ringbus and unique, big ROPs. That would be recognizable I would think.
 
I don't think AMD is going to design a GPU that has no blending support at all. There are platforms where blending is done in the shader. But in the case the shaders have special load access to the render target, so they wouldn't have to go through the texture cache. If it did the texture cache would get worse pressure but the impact to eDRAM or DDR3 bandwidth wouldn't change unless eDRAM can't store textures at all.

Texture cache and TMU pressure are the basis of the supposition, and in particular for games designed primarily for the 360 with its smart edram, but yeah I get that the total read/write for the memory should be the same

I hear on NeoGAF that Nintendoland has tons of blending but does okay. Is it possible that the problem arises with translucency handled in a deferred engine? I'm not up to date on techniques for this but I know you have to handle it in a special way. And it's possible Nintendoland isn't using a deferred engine; Nintendo could lack experience with such a thing since it wouldn't have been suitable on Wii.

My thought regarding this is that Nintendoland probably isn't pushing as many texture layers (or things like hi-res bump maps) as AAA Xbox 360 ports, and so if TMU and texture cache are indeed limiting factors in blending ops then Nintnendoland might have a much easier time of handling transparencies.

I haven't actually played Nintendoland though so I'm kind of going out on a guess-based-limb here though. :eek:

Note that translucency does reduce the amount of occlusion in the scene, meaning regardless of how the blending itself is handled the overall pressure on shading including texture fetch bandwidth is going to be higher than if those fragments were opaque (and early-Z is doing a good job). A game could, in theory, be texture bandwidth limited on Wii U where it wasn't on XBox 360 or PS3, due to having worse main memory bandwidth (if all the textures are coming from main memory), and in those cases increased pressure would make things worse.

Yeah, you're absolutely right of course. It could be down to that and/or edram BW rather than shader based transparency, but with the "R600" output from one of marcans tests and the old story about "broken" ROPs on the R6xx Radeons it's fun to speculate. :D
 
Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

And then there's the higher clock, and the likelihood that with 8 ROPs and 16 TMUs to fewer ALUs that they should be less likely to stall due to outside bottlenecks. The 32 MB of edram would save on the tiling and copy out overheads too, which can be anywhere between 2 and 30% based on dev comments.
 
And then there's the higher clock, and the likelihood that with 8 ROPs and 16 TMUs to fewer ALUs that they should be less likely to stall due to outside bottlenecks. The 32 MB of edram would save on the tiling and copy out overheads too, which can be anywhere between 2 and 30% based on dev comments.

I don't undersatnd, i thought the general idea here was wiiu on par with curentgen with 320 SP, now if it has 160 SP, would'nt that make it weaker then current gen, or is SP count not all that important?
 
Not necessarily. A more modern dx10/dx11 architecture would probably easily squeeze out more performance than the ancient Xenos even at a lower flop count. The 32 MB of EDRAM would go a really long way as well to helping performance.

Is this based on anything beyond theory?
 
I don't know about this new supposed R600 connection.... But I do recall Wii HD rumors shortly into Wii's lifetime.

R600 and friends have characteristics like ringbus and unique, big ROPs. That would be recognizable I would think.

IIRC the low end HD2400/HD3400 series used a conventional 64bit crossbar rather than a ring bus, as it was simpler and used less power. The ring bus was not integral to the r6xx architecture.
 
Isn't R600 whats inside the 360?

I still think the 55nm theory might still be plausible. The SIMD components match Brazos almost perfectly except for being 30% larger. We just need to see what eDRAM density really is and if it is possible.
 
Status
Not open for further replies.
Back
Top