Wii U hardware discussion and investigation *rename

Grall · Jan 27, 2013

DRS said:
Thinking a bit about caches, I kind of wonder what a texture cache's efficiency is and how much a miss actually costs.

Texture caches themselves are very efficient as texturing access is typically quite regular and predictable (barring indirect access through a pixel shader using say, a bumpmap to offset the final texels being looked up, in which case access patterns can become majorly chaotic). However you still need to put stuff into the texture cache first before you can read from it, so basically every texturing operation involves a stall, which can last 1000+ clock cycles. So to get around that, GPUs juggle thousands of these pixel (and vertex) job threads, most of which are bound to be stalled while waiting for data at any one time. When you can have enough of these threads flying around at any one time, theoretically enough data should have trickled in so you have something to do all the time and can keep the hardware busy that way.

Because of this juggling going on it's probably very difficult, if not outright impossible to accurately measure the latency of any one single set of pixel operations. It's probably not something AMD, Nvidia etc document publically (low-level hardware information is often surrounded by trade secrets and all that jazz), and also, it's the GPU that schedules these threads by itself. While it's probably controlled by an algorithm of some sort (which may again be an undocumented trade secret), there's probably a lot of flexibility in what order it actually complete each batch of threads/pixels, making any measurement unpredictable.

So AFAIK we don't know exact latencies, and it's probably hard to find out, but then again we don't really have to know either. It's not that interesting a number, except for the engineers who work on designing these things in the first place. As users, we want smooth framerates, so what counts is that drawing of each frame finishes in a short, even timespan.

Perhaps it is possible to estimate how much texel bw is needed for 1280x720 anyways.

I believe the base formula per pixel is four texels per MIP map times 2 MIP levels for trilinear, but more, or possibly less, for anisotropic filter, so generally 8 texels = 32 bytes for 32-bit RGBA texture map. More for "deep" format textures. But then there's texture compression, so you'll never hit these high numbers except when reading from render target buffers, which will be uncompressed since you're generating them in realtime. Repeat until you've textured every pixel of the whole screen. Of course, this doesn't include multitexturing, in which case you will need to multiply with the number of layers per pixel for that particular polygon. ...And then there's overdraw, but that's highly variable so hard to put any single number on.

...Or my math's off, but then hopefully one of the 3D wizards in this forum will come flying in and stomp all over me.

haihoo · Jan 27, 2013

Shifty Geezer said:
On a par. Any performance advantages Wii U may have are offset by limitations (being small and low power draw), such that any increase in overall performance (nigh impossible to measure) above current gen will be fractional rather than a multiple.

We don't have enough information on the Wii U hardware to answer this question. This thread is pure speculation.

function · Jan 27, 2013

haihoo said:
We don't have enough information on the Wii U hardware to answer this question. This thread is pure speculation.

Die sizes, CPU process node, main memory type and quantity and bus size, clocks, power consumption, and the DF analyses are not pure speculation.

We easily know enough to say that the Wii U is in the PS360 ballpark.

Shifty Geezer · Jan 27, 2013

haihoo said:
We don't have enough information on the Wii U hardware to answer this question. This thread is pure speculation.

Informed speculation based on a number of data as Function has outlined (including dev comments). It is nigh impossible for Nintendo to have produced a smaller, lower-power draw device that improves (at all, let alone significantly) on overall performance, especially when we know the memory is so damned slow. Only if they have secretly used a much smaller node is that possible.

I don't see how anyone can question the evidence. That's illogical.

DRS · Jan 27, 2013

@grail, ok I get the point. When the first pixel group is rendered, we look at 1000 clocks latency. But after that, the caches should produce enough bandwidth to keep things going. Thanks for the clarification. Though I'm still a bit unclear about the framebuffer latency. As far as I know DRAM has low latency when writing continuous blocks but what if the writes are scattered? I'd expect that it is inefficient for a ROP to wait one or two cycles in between writes. On the other hand, all 16 ROP PC cards utilize DDR, so do GPUs utilize cache between ROPs and framebuffer as well?

believe the base formula per pixel is four texels per MIP map times 2 MIP levels for trilinear, but more, or possibly less, for anisotropic filter, so generally 8 texels = 32 bytes for 32-bit RGBA texture map. More for "deep" format textures. But then there's texture compression, so you'll never hit these high numbers except when reading from render target buffers, which will be uncompressed since you're generating them in realtime. Repeat until you've textured every pixel of the whole screen. Of course, this doesn't include multitexturing, in which case you will need to multiply with the number of layers per pixel for that particular polygon. ...And then there's overdraw, but that's highly variable so hard to put any single number on.

I think your math is correct. Though the overhead of mipmapping should be covered by caches and not result in additional external bw requirements ofcourse EDIT: I'm stupid, fetching another mip level's texels ofcourse requires additional external reads. If it reads 8 only texels per cache miss the worst case texel requirement for a single textured 1280x720 screen would be about 8MB in case of 8bpp textures. However, since you mentioned 1000 cycles stall its probably much higher.

function said:
Die sizes, CPU process node, main memory type and quantity and bus size, clocks, power consumption, and the DF analyses are not pure speculation.

We easily know enough to say that the Wii U is in the PS360 ballpark.

So isn't the GPU die size big enough to fit 400-480 shading units (be it FP24) along with the eDram? And did someone measure power consumption with high demanding games (I just heard of mario and system menu being measured). Sure, the fact that WiiU games don't utilize better resolutions point in the direction of fillrates similiar to XBOX, but what if Nintendo offers 8MB framebuffer space and these games require more than one concurrent render target? I think coding can still be a factor that limits the performance.

Not that I care that much... I just wanted to play BO2 with wiimote, and WiiU sees to that

BRiT · Jan 27, 2013

haihoo said:
We don't have enough information on the Wii U hardware to answer this question. This thread is pure speculation.

Please keep your Nintendo Fanboyism in check. As already stated, given the information we do know, it's evident that the WiiU has no magic that remains to be seen. Yet every single one of your posts tries to suggest otherwise.

DRS · Jan 27, 2013

BRiT said:
As already stated, given the information we do know, it's evident that the WiiU has no magic that remains to be seen

With current facts I don't think we know if WiiU has gamecube or xbox like design, memory wise. In my opinion the GPU assumption is still like claiming something like: if it has breasts, it must be a girl.

AlphaWolf · Jan 27, 2013

No, it's more like 'if the whole system draws 40W, the GPU isn't a monster waiting to be unleashed'

Shifty Geezer · Jan 27, 2013

DRS said:
With current facts I don't think we know if WiiU has gamecube or xbox like design, memory wise.

That's not going to make the difference between Wii U being on par with current gen or considerably better though. All the question marks at this point are the difference between being 80% of PS360 or 130%, sort of thing.

shinobi · Jan 27, 2013

Shifty Geezer said:
That's not going to make the difference between Wii U being on par with current gen or considerably better though. All the question marks at this point are the difference between being 80% of PS360 or 130%, sort of thing.

so your saying the wiiu could be 30% stronger or 20% weaker then current gen consoles.

function · Jan 27, 2013

shinobi said:
so your saying the wiiu could be 30% stronger or 20% weaker then current gen consoles.

That's just an example. Different parts of the system will compare differently, and everything will very by task.

shinobi · Jan 27, 2013

function said:
That's just an example. Different parts of the system will compare differently, and everything will very by task.

basically it's like ps3 and 360 they each do some things better then the other, so its a wash, same thing for the wiiu, i guess.

DRS · Jan 28, 2013

Shifty Geezer said:
That's not going to make the difference between Wii U being on par with current gen or considerably better though

One GPU related difference could be having EFB and texture cache on chip too (and even less space for GPU circuits). Being small it wouldn't cost much and is more likely to have high bw. It allows the GPU to be less demanding on the eDram. It would also explain why rendering hi-res isn't an obvious thing to do. It would be limiting but also be able to perform better.

Though, I kind of agree with you Shitfy, if this system is twice as fast it is unlikely that XBOX ports run slower than on XBOX itself. And ERP's statement about the output resolution is pretty strong. WiiU has lower bandwidth and less CPU so XBOX ports will struggle to start. I'm just not that easily convinced that the GPU isn't able to render more vertices and pixel operations than XBOX GPU. How big would a 40nm Xenos be compared to Wiiu's CPU?

Grall · Jan 28, 2013

DRS said:
As far as I know DRAM has low latency when writing continuous blocks but what if the writes are scattered? I'd expect that it is inefficient for a ROP to wait one or two cycles in between writes. On the other hand, all 16 ROP PC cards utilize DDR, so do GPUs utilize cache between ROPs and framebuffer as well?

These low-level tech specs are never given out publically, but in general terms there's always on-chip write buffers that queue up a bunch of writes while waiting for an opportune time to commit them to RAM (either due to RAM latency, refresh or page miss waitstates and so on, or waiting for more data to complete one burst write. For example.) This helps with scattered writes, as the RAM controllers can queue up separate writes that go to the same memory page for example.

The write buffer also frees the device trying to write to RAM to go off and do other things, while the buffer takes care of actually completing the write to memory.

Also, graphics cards don't store the frame buffer (and also textures, other bits of data I suspect) in one contiguous chunk. Instead they break them up into pieces and map them out according to RAM pages and the separate RAM channels/memory controllers of the GPU, to spread accesses evenly across all RAM channels. This also helps with scattered writes, as you're less likely to have two scattered writes in a row hit the same memory controller. By staggering memory accesses, penalties can be spread out more evenly which increases efficiency.
...Or something like that.

I'm no super expert on this sort of thing.

function · Jan 28, 2013

DRS said:
One GPU related difference could be having EFB and texture cache on chip too (and even less space for GPU circuits). Being small it wouldn't cost much and is more likely to have high bw. It allows the GPU to be less demanding on the eDram. It would also explain why rendering hi-res isn't an obvious thing to do. It would be limiting but also be able to perform better.

Though, I kind of agree with you Shitfy, if this system is twice as fast it is unlikely that XBOX ports run slower than on XBOX itself. And ERP's statement about the output resolution is pretty strong. WiiU has lower bandwidth and less CPU so XBOX ports will struggle to start. I'm just not that easily convinced that the GPU isn't able to render more vertices and pixel operations than XBOX GPU. How big would a 40nm Xenos be compared to Wiiu's CPU?

You have to look at 5xxx series mobile binned parts on 40nm to find 400 shader GPUs that might fit into the Wii U power envelope. And even with 8 ROPs they could outperform the 360.

The Wii U GPU runs at 550 mHz and so should be able to run at 10% higher resolution than the 360 even if ROPs are the sole limiting factor (triangles/second should be 10% higher too).

Chances are that either the GPU is BW limited in some way or it doesn't actually have a lot of extra grunt. Or both.

What was that old Eurogamer claim back when the dev kits were clocked low (about 400 mHz iirc)? Not enough shaders? Doesn't seem to be a problem now, but if there was any truth in behind what their source said then the Wii U could be a 30% drop in clock speed away from having less effective shader powah than the 360.

ERP · Jan 28, 2013

Usually frame buffer memory is organized into what vendors refer to as tiles, which are just 2D blocks that are contiguous in memory, this means that the 2x2 block of ROPS usually write to a single tile, I believe that large triangles are rasterized in tile order and I would assume there is some caching that occurs.
I believe this tiling is what puts the often severe alignment restrictions on target buffer location.
Most GPU's I'm aware of have the option to render to none tiled memory as an option, and at least some to swizzled targets, but there is usually a fairly significant penalty for doing so. With render to texture there is something of a juggling act of should you tile it, or swizzle it or not, because tiling it reduces the efficiency of the texture cache which is optimized for the swizzled case.
All things you don't get to care about on a PC.

jlippo · Jan 28, 2013

DRS said:
I think your math is correct. Though the overhead of mipmapping should be covered by caches and not result in additional external bw requirements ofcourse EDIT: I'm stupid, fetching another mip level's texels ofcourse requires additional external reads. If it reads 8 only texels per cache miss the worst case texel requirement for a single textured 1280x720 screen would be about 8MB in case of 8bpp textures. However, since you mentioned 1000 cycles stall its probably much higher.

Also mipmaps are needed to get any decent performance from cache.
This gamefest presentation had nice information on subject..
http://www.microsoft.com/en-us/download/details.aspx?id=1166

I wonder if part of edram could be used as additional cache layer for GPU on WiiU and coming platforms.

function · Jan 28, 2013

So I keep hearing that the Wii U GPU is made using Renesas' 40nm UX8GD edram. But I can't find it on the website.

All I can find is a reference to UX8LD - the 40nm low power edram - which says "under development".

http://am.renesas.com/products/soc/asic/cbic/ipcore/edram/

So ... can anyone find the Renesas web page of the process and edram that the Wuu is supposed to be using?

AlNom · Jan 28, 2013

I think Renesas eDRAM was just a speculation based on their prior history. No one really knows.

wsippel · Jan 28, 2013

function said:
So I keep hearing that the Wii U GPU is made using Renesas' 40nm UX8GD edram. But I can't find it on the website.

All I can find is a reference to UX8LD - the 40nm low power edram - which says "under development".

http://am.renesas.com/products/soc/asic/cbic/ipcore/edram/

So ... can anyone find the Renesas web page of the process and edram that the Wuu is supposed to be using?

UX8GD is like the high capacity Macronix ROMs the 3DS uses: It exists, but is only available on request. UX8GD is the high performance version of Renesas' 40nm eDRAM (up to 800MHz), UX8LD is the low power version (up to 150MHz). 150MHz would be too slow even for Wii compatibility mode, so they have to use the performance version.

Wii U hardware discussion and investigation *rename

Grall

Invisible Member

haihoo

function

None functional

Shifty Geezer

uber-Troll!

DRS

BRiT

(>• •)>⌐■-■ (⌐■-■)

DRS

AlphaWolf

Specious Misanthrope

Shifty Geezer

uber-Troll!

shinobi

function

None functional

shinobi

DRS

Grall

Invisible Member

function

None functional

ERP

jlippo

function

None functional

AlNom

Moderator

wsippel

Similar threads