Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
I'm assuming the big swath of eDRAM that is not near the SIMDs isn't LDS.
On a more regular layout like RV770, the LDS is between the texture block and the ALU blocks.
The less regular small APUs don't do this. Brazos might have its LDS to the left of the one of its SIMDs. I'm not sure about the Wii U's setup.

The usage model described for the eDRAM doesn't match up with what the LDS is capable of doing, so I didn't have a reason to think it relevant.
 
I'm assuming the big swath of eDRAM that is not near the SIMDs isn't LDS.
That much is clear - the device should have its 'proper' LDS. Question is, how close the eDRAM block would need to be to have a chance of being accessed by the SIMDs, Evergreen style.

On a more regular layout like RV770, the LDS is between the texture block and the ALU blocks.
The less regular small APUs don't do this. Brazos might have its LDS to the left of the one of its SIMDs. I'm not sure about the Wii U's setup.
Yes, according to the official floorplan, the LDS block sits right on the left of the SIMD0 block, and not so close to the SIMD1 block.

The usage model described for the eDRAM doesn't match up with what the LDS is capable of doing, so I didn't have a reason to think it relevant.
What do we know about the usage model of the eDRAM, though?

Apropos, backtracing from your earlier discussion with DRS, I'm curious to see what benchmarks demonstrate little latency benefits from the caches on an AMD VLIW design. Any pointers would be welcome.
 
What do we know about the usage model of the eDRAM, though?
There are descriptions of games using to hold framebuffer and intermediate buffer data. That involves handling ROP output, which the LDS does not have a path to receive.

Apropos, backtracing from your earlier discussion with DRS, I'm curious to see what benchmarks demonstrate little latency benefits from the caches on an AMD VLIW design. Any pointers would be welcome.

The first link is http://www.sisoftware.net/?d=qa&f=gpu_mem_latency .

One correction is that I had misrembered which lines belonged to which architectures. There is a difference of about 100-150ns from best case to worst case, with the best case being roughtly 500ns for a discrete GPU.
The physical external memory devices would have less than 40ns of latency, leaving the majority of the hit to be taken up by the cache and memory pipeline of the GPU.

The big question I have about the numbers is that they do seem very high, especially relative to what was tested for RV770 here:
http://forum.beyond3d.com/showthread.php?p=1322632#post1322632

That earlier testing showed 180 cycles for an L1 hit, which is in the neighborhood of half the Sandra numbers. In either case, the baseline latency for a GPU is either very long, or extremely long. External DRAM has a fixed component of total latency, and it's not the majority.

The question comes back to whether the memory pipeline has been customized to bypass some of the main memory stages for the eDRAM, although even this is watered down by the very long L1 time.
 
There are descriptions of games using to hold framebuffer and intermediate buffer data. That involves handling ROP output, which the LDS does not have a path to receive.
Yes, but we don't know how the multiple banks in there factor in - are fb accesses interleaved across all banks, or are they interleaved across a few, and if the latter, what do the 'idle' banks do - are they open for access for other purposes?

The first link is http://www.sisoftware.net/?d=qa&f=gpu_mem_latency .

One correction is that I had misrembered which lines belonged to which architectures. There is a difference of about 100-150ns from best case to worst case, with the best case being roughtly 500ns for a discrete GPU.
The physical external memory devices would have less than 40ns of latency, leaving the majority of the hit to be taken up by the cache and memory pipeline of the GPU.
Thank you. I'm really curious to see Sandra's code, though, do they use vfetch ot texsample? If the former, in the Global mem random access case Evergreen would use its caches effectively as a mere coalescing apparatus - the cache lines are invalidated after each wave. Not so with the texfetch path - caches would behave as actual caches there.

The big question I have about the numbers is that they do seem very high, especially relative to what was tested for RV770 here:
http://forum.beyond3d.com/showthread.php?p=1322632#post1322632

That earlier testing showed 180 cycles for an L1 hit, which is in the neighborhood of half the Sandra numbers. In either case, the baseline latency for a GPU is either very long, or extremely long. External DRAM has a fixed component of total latency, and it's not the majority.
That's a very good read, thanks (and props to prunedtree). I did a similar, albeit rudimentary test (involving 4x4 matrices) myself, and it demonstrates a clear gradation across the vfetch, texsample and lds accesses - roughly a mem efficiency factor of 2x, going from left to right (while the matmul alu clause remains essentially the same). I can post the asm listings as well if needed.

I really think serious benchmak providers should publish the source codes of their benchmarks. Now we have to guess what Sandar might be doing there to measure those latencies.

The question comes back to whether the memory pipeline has been customized to bypass some of the main memory stages for the eDRAM, although even this is watered down by the very long L1 time.
That sounds reasonable.
 
Last edited by a moderator:
The numbers from SiSoft Sandra are in some cases just wrong (or misrepresent the operation of the caches in normal use). They probably didn't measure what they wanted to measure because of a lack of understanding of the different architectures. I wouldn't put too much faith in it.
 
The numbers from SiSoft Sandra are in some cases just wrong (or misrepresent the operation of the caches in normal use). They probably didn't measure what they wanted to measure because of a lack of understanding of the different architectures. I wouldn't put too much faith in it.
That's part of my concerns re such benches as well - CPU caches are strictly domestic animals compared to the GPU caches in the wild.
 
Interesting followup, thanks for spending some time on it guys! So does LDS enable direct framebuffer access for SIMD cores? Does this for example explain the excessive use of DOF effects in modern games?

About the scheduling, could it be that the scheduling layer also translates TEV blending and indirect texture commands to SIMD clauses (though I guess latte's clock is a bit low for that)? Or that the scheduler delegates these commands to a set of dedicated TEV cores? This would allow shader code to make use of both lines concurrently and increase performance a bit, though I don't see how that fits a genuine shader compiler.

Also back to the 160SP thing, which ofcourse is the base of my questions:) I think I can understand why a SIMD core needs more than 1 SRAM bank per lane and why thread execution is interleaved per clock; there is overlap in the current thread reading registers and the previous thread writing its result. However this just justifies 2 (single ported) banks per lane. So why must they have 4? In my head it seems that having 4 banks results in having 2 spare r/w cycles (assuming that a minimum of 4 threads is required in that case). 4 banks is ok for interfacing with outside world without interfering SIMD operations ofcourse but doesn't seem to be an absolute necessity.
 
Or that the scheduler delegates these commands to a set of dedicated TEV cores?
I would be surprised if it was genuinely necessary to include fifteen year old integer, fixed-function hardware in a...*ahem*...current-ish...GPU. What would be so magically special about the TEV you think that it would necessitate being included in wuugpu? It was clocked at what, 180MHz or somesuch. Seriously, I think today's shader processors could do everything it could with both hands tied behind their backs and then some. :)
 
Yep. The only reason to include TEVs would be easy of emulation, but Nintendo do seem to like that so I wouldn't put it past them.
 
I would be surprised if it was genuinely necessary to include fifteen year old integer, fixed-function hardware in a...*ahem*...current-ish...GPU. What would be so magically special about the TEV you think that it would necessitate being included in wuugpu? It was clocked at what, 180MHz or somesuch. Seriously, I think today's shader processors could do everything it could with both hands tied behind their backs and then some. :)

TEV performs a blend A*B + (1-A)*C + D like operation per clock (4 ROPS, so 20 FLOPS) and since a SIMD alu does 2 FLOPS per clock it takes a few clocks to 'emulate' that. The indirect texture operation performs dual DOT3 and an add on 2D texture coordinates so thats 8 FLOPS more. I agree that modern GPUs can handle it perfectly. A 160 SP GPU does 32 FLOPS per clock which exceeds the 28 required. Its at ~240Mhz no issue either.

My thinking is the timing might end up differently though. Not sure how much of a problem that could be. Perhaps my thinking is too flat?

[EDIT[ Perhaps the numbers are completely wrong, I don't know how RGBA quads are regarded to FLOP related. And smoked some.
 
Last edited by a moderator:
a SIMD alu does 2 FLOPS per clock it takes a few clocks to 'emulate' that.
Wuu has many more SIMDs than the TEV has ROPs. Many, many more. So not an issue to worry about really. :) Not that it'd be very costly to include it, hardware-wise, it's got to be a tiny thing at today's silicon processes, merely an inperceptible blip on the chip probably, but you gotta tie it into an alien rendering pipeline somehow, interface it with hardware it was never designed to co-exist with. That's gotta be a lot more complicated and costly than just the (relatively) small amount of logic for the TEV itself. It'll never be hardware compatible with the old hollywood chip anyway, nintendo's already stated they're not including the whole wii kit and kaboodle in wuu, so I wonder if it really would be any point in tearing out one piece of non-critical hardware and transplanting it into wuu...
 
Wuu has many more SIMDs than the TEV has ROPs. Many, many more. So not an issue to worry about really. :) Not that it'd be very costly to include it, hardware-wise, it's got to be a tiny thing at today's silicon processes, merely an inperceptible blip on the chip probably, but you gotta tie it into an alien rendering pipeline somehow, interface it with hardware it was never designed to co-exist with. That's gotta be a lot more complicated and costly than just the (relatively) small amount of logic for the TEV itself. It'll never be hardware compatible with the old hollywood chip anyway, nintendo's already stated they're not including the whole wii kit and kaboodle in wuu, so I wonder if it really would be any point in tearing out one piece of non-critical hardware and transplanting it into wuu...

Seems reasonable! And there is more ways to get the timing correct afterwards I assume. Dolphin emu does pretty well too, though has its glitches. It doesn't render stencil shadows correctly for example. The error I made in my previous post is a 160SP GPU does 320 flops per clock and not 32 ofcourse. And about the Wii flops, to be honest I even don't know if a DOT3 is considered to be a single flop, 3 flops since its 3 MADDs or 5 because its 3 multiplies and 2 adds.

I'd like a bit more insight in the SIMDs SRAM setup though; people say that a modern 160SP@550Mhz GPU can keep up with Xenos which seems fair enough. But a RV6XX DX10 design can't be considered modern. And if eDram doesn't help either I'd guess it should be clocked >550Mhz or has other optimizations to reach the performance NFS MW shows. It seems easier to reach that performance with a 256 or 320SP part at lower clock. This should be more power efficient too. And 320 units seem to use less than 15W, considered the differences in TDP between several AMD parts.
 
Wuu has many more SIMDs than the TEV has ROPs. Many, many more.
I assume you're suggesting that multiple threads on the SIMD unit could emulate a single Flipper pipeline, as AMD's VLIW architectures have exactly 5 (or 4) ALUs per clock per thread on the SIMD.
 
I'm loathe to link to Lens of Truth, on the grounds that they are shit, but as no-one else seems to care enough about the Wii U any more here is Lens of Truth on Splinter Cell:

http://www.lensoftruth.com/head2hea...arison-and-analysis-ps3-vs-xbox-360-vs-wii-u/

They refuse to install the HD texture pack for the 360 because it's too confusing, or some shit like that. But it's the Wii U version that's really interesting because ... it's pretty much exactly the same as the PS360 versions (who saw that coming?).

And when I say pretty much exactly the same what I mean is worse frame rates than the 360 and really shit loading times. No torn frames though, which is a definite bonus, and seems to be achieved through triple buffering which the PS360 don't use (memory limits?). Too bad the extra memory isn't used for higher res textures but maybe the load times would have even worserer (and totally unacceptable if they had). Memory - both main ram and edram - is probably the only area where the Wii U has a significant and genuine advantage, so it's a shame to see it seemingly underutilised again.

So anyway, once again the Wii U is coming in at more or less PS360 levels: a little better in some ways, and a little worse in others. (ModTweak: Let's not go there please)
 
Those load times reveal why DICE didnt even bother with Frostbite. Textures would on Wii U games would be limited similar to 360 BF3 without HDD.

No HDD cache support really shows that Nintendo dosent give a shit for these games like GTA or BF to perform better than on current-gen
 
Last edited by a moderator:
Also back to the 160SP thing, which ofcourse is the base of my questions:) I think I can understand why a SIMD core needs more than 1 SRAM bank per lane and why thread execution is interleaved per clock; there is overlap in the current thread reading registers and the previous thread writing its result. However this just justifies 2 (single ported) banks per lane. So why must they have 4? In my head it seems that having 4 banks results in having 2 spare r/w cycles (assuming that a minimum of 4 threads is required in that case). 4 banks is ok for interfacing with outside world without interfering SIMD operations ofcourse but doesn't seem to be an absolute necessity.
I think I can answer this myself now. Where I thought that Single Instruction Multiple Data referred to a complete SIMD core, its not. Each horizontal row (wavefront) can execute instructions independent of eachother. So to do 40SP with only 16 banks, the banks should be either clocked twice the speed or the SIMD must pair two rows to a single instruction (which doesn't help branching operations ofcourse but for graphics it would still be faster than having only half the rows). Is this the correct perception?

@Function, I glanced over that review quickly and failed to notice. Did they include a pixelcount?
 
Last edited by a moderator:
The "lazy devs" mantra is still going strong. Seems to be the heir apparant to the "unfinished dev kits" catch-all.

There are good reasons why multi platform ports should perform rather poorly on the WiiU.

We can definitely point to the CPU for some issues. Both the PS3 and the 360 has CPUs that devote major resources to SIMD floating point units, capabilities that may well be completely absent from the WiiU CPU. If a developer has managed even moderate utilization of the SIMD units of the PS360, this will create headaches for a WiiU port, that may or may not be solvable. Indeed, more than one developer has explicitly pointed to the CPU as creating issues for their ports.

The other obvious suspect for performance issues is main memory bandwidth. The 360 has some 75% higher main memory bandwidth, and the PS3 is as fast times two for its separate CPU and GPU memory pools.
Hypothetical example: Call the main memory bandwidth of the 360 "X". Then a title that targets the PS360, may use the full bandwidth available to the PS3 GPU, and lets assume they avoid using the full PS3 CPU bandwidth in order not to create unnecessary issues with the 360 version, say only 0.3X. Then the PS3 would use 1.3X bandwidth, more than available to the 360. However lets assume that the developers can use the 360 EDRAM to reduce GPU main memory traffic, so that it fits in the 360 envelope of X GB/s, 0.3X for CPU game engine needs and just 0.7X for the GPU. So what happens when this code is to be ported to a device that offers 0.6X total main memory bandwidth? Well if the game engine running on the CPU still requires 0.3X, then that only leaves 0.3X left for the GPU. Ouch. That's less than half of what the GPU on the 360 had available to it, and that was when optimizations for EDRAM was taken into account. It seems a given that there will be areas in any game that makes reasonably good use of the PS360 memory bandwidth which will run into rather severe performance issues on the WiiU unless it can be recoded to utilize the larger EDRAM pool of the WiiU to fully compensate. I doubt that will always be possible, and reduced complexity or frame rate issues will result.

It's worth pointing out that any strengths that the WIiU might have will go unused unless the game is explicitly recoded to take advantage of them. The original code will simply stay within the demands defined by the PS360. For a multi platform port, a strength in one area generally cannot compensate for a weakness in another.
And it does seem that there are such areas of strength, both from developer commentary and specific instances where the WiiU version actually performs better than the HD twins, or, such as in NFS-MW, provides additional eye-candy.

But in general, a game that was developed targeting the PS360, was, and still is, most likely to perform worse on the WiiU due to obvious and well known hardware limitations, which are unrelated to GPU ALU capabilities.
 
Different systems have different architectures, therefore strengths and weaknesses, running the same code on them will end up with different results, it's so obvious I didn't see the point to say it, but you're a most likely right to write it down for everyone to learn.
 
I'm loathe to link to Lens of Truth, on the grounds that they are shit, but as no-one else seems to care enough about the Wii U any more here is Lens of Truth on Splinter Cell:

http://www.lensoftruth.com/head2hea...arison-and-analysis-ps3-vs-xbox-360-vs-wii-u/

They refuse to install the HD texture pack for the 360 because it's too confusing, or some shit like that. But it's the Wii U version that's really interesting because ... it's pretty much exactly the same as the PS360 versions (who saw that coming?).

And when I say pretty much exactly the same what I mean is worse frame rates than the 360 and really shit loading times. No torn frames though, which is a definite bonus, and seems to be achieved through triple buffering which the PS360 don't use (memory limits?). Too bad the extra memory isn't used for higher res textures but maybe the load times would have even worserer (and totally unacceptable if they had). Memory - both main ram and edram - is probably the only area where the Wii U has a significant and genuine advantage, so it's a shame to see it seemingly underutilised again.

So anyway, once again the Wii U is coming in at more or less PS360 levels: a little better in some ways, and a little worse in others. Surely there's nowhere left for the reality detached Nintendo fanboys to go now? I mean, surely?

I'm not really seeing these "reality detached fanboys" tbh. There might be a very small minority who are obviously living in a dream world (as with every platform) but the vast majority - even those who openly try and defend the wiiu at every turn - aren't implying it's a huge jump in specs over current gen.

Regarding the LoT comparison. Actually pretty impressive (knowing what we know about the WiiU's innards) to maintain ~same framerates as the other two whilst having v-sync locked. Shame about the loading times but I'd be interested to see if a DD version solves that? Yeah they've left the texture pack out on 360 but if anything that would have decreased the performance wouldn't it? Makes it an unfair comparison though, yes.

The "lazy devs" mantra is still going strong. Seems to be the heir apparant to the "unfinished dev kits" catch-all.

Again, it's a very small minority that seem to be trying to find excuses like the the "lazy dev" thing. And to play devil's advocate (sorry i can't resist) the 'unfinished dev kit' comments were fairly relevant with the launch window stuff - as it was something brought up by a developer (Critereon) and since even mentioned by Platinum in the Iwata Asks thing. The games coming out now are I think a fair representation of WiiU's capabilities with regards to multi-platform releases. It's clearly designed, as many suggested from the start, to be current gen+ rather than 'next gen'. It can pump out the the same/slightly better graphics as PS360 with half the power consumption (yay, dat USP!) , with zero screen tearing and potentially a few extra bells and whistles (purely from the more modern feature set/more advanced shaders).

imo, this would all be great - if they started selling it for $50/£50/€50 cheaper than it currently is ;)
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top