Wii U hardware discussion and investigation *rename

swaaye · Aug 13, 2013

R600 had tessellation too. Oh oh resurrect the R600-in-WiiU hypothesis.

Grall · Aug 13, 2013

psychogmv said:
The only customization that I believe Nintendo did was adding the Renesas eDram to the GPU logic.

I believe the unnamed person from AMD you refer to to be misinformed, or possibly being deliberately deceptive. There's two pools of (probably) SRAM too on the die, and very likely other hardware as well, meant to facilitate emulation of the ancient gamecube/wii chipset. There's the CPU bus/northbridge interface as well, DSP, ARM CPU integration and so on.

DRS · Aug 14, 2013

psychogmv said:
And in my point of view, changin SPU count is a customization, and this was ruled off by the AMD rep. The only customization that I believe Nintendo did was adding the Renesas eDram to the GPU logic. For me, its either a vanila HD4650 or HD5550, with no Nintendo secret sauce at all. And the tesselation abilities points more to the later option.

The requirement is also backward compatibility with Hollywood/Flipper. How does a vanilla HD4650 or HD5550 achieve that? I think Iwata asks mentions something about hollywood/flipper features *extending* the design and actually contribute.

wario said:
some devs have said its on par, some have said its even weaker.

Yes and unfortunately they don't all really elaborate their opinion (or do they?).

psychogmv · Aug 14, 2013

DRS said:
The requirement is also backward compatibility with Hollywood/Flipper. How does a vanilla HD4650 or HD5550 achieve that? I think Iwata asks mentions something about hollywood/flipper features *extending* the design and actually contribute.

Yes and unfortunately they don't all really elaborate their opinion (or do they?).

Yeah, you both are completely right. I forgot about those. But I still don't think Nintendo would modify the base GPU architecture that much. And maybe that what's that AMD guy was talking about. No new secret Nintendo instructions or a completely different SPU/TMU balance from R7xx parts available. But, I also don't think that this would change Wii Us position as somewhat better than PS360, and way inferior than PS4/XB1.

Thanks for the clarifications guys,

Rangers · Aug 14, 2013

willardjuice said:
I love it.

[edit] Okay perhaps that is slightly too trollish, but why can't ps2->xbox logic be applied to 360/ps3->wiiu logic? I understand the magnitude of the transitions are different, but I don't think that changes the underlining point.

Probably because Xbox usually had the slightly superior ports to PS2 in every case. even the "lazy" ports looked better on Xbox in almost every case.

With Wii U we see that's not the case, it has a spotty record on ports so far.

But sure, I guess it can be applied. I think the CPU is preventing easy porting to Wii U early on.

DRS · Aug 15, 2013

Question for the specialist: the GPU's on die eDram reduces latencies compared to GDDR5 stuff right? Would it be possible that due to this reduced latency, the SIMD cores stall a lot less? And would that reduce the need for running a multitude of threads per core? And wouldn't that reduce SRAM register bank requirements compared to AMDs mainstream designs?

psychogmv said:
But I still don't think Nintendo would modify the base GPU architecture that much. And maybe that what's that AMD guy was talking about. No new secret Nintendo instructions or a completely different SPU/TMU balance from R7xx parts available. But, I also don't think that this would change Wii Us position as somewhat better than PS360, and way inferior than PS4/XB1.

Well I guess we all can agree with that. Nintendo's problem is that they promised a console capable of delivering next gen quality but didn't show anything like that. Which led to no interest from core gamers and thus bad sales. So either they lied flat out or the system is capable of running PS360 ports without major optimization just below par and exceeds that largely when resources are used properly. I don't think anyone beliefs the latter,there is no proof for that whatsoever. I just hope that Nintendo did, for their own sake.

rangers said:
Probably because Xbox usually had the slightly superior ports to PS2 in every case. even the "lazy" ports looked better on Xbox in almost every case.

With Wii U we see that's not the case, it has a spotty record on ports so far.

Then again, PS2 didn't use the most sophisticated form of shading I believe while the XBOX and Gamecube probably already looked better by turning on simple gourraud and/or phong shading

Shifty Geezer · Aug 15, 2013

DRS said:
Question for the specialist: the GPU's on die eDram reduces latencies compared to GDDR5 stuff right? Would it be possible that due to this reduced latency, the SIMD cores stall a lot less?

Only in certain GPGPU workloads (as discussed in threads on the subject of eDRAM/ESRAM). GPU's are highly developed to manage high latency RAM access when rendering graphics - it's not a problem that the GPU vendors shrugged their shoulders at and said, "high latency, huh? Whatever."

3dilettante · Aug 15, 2013

DRS said:
Question for the specialist: the GPU's on die eDram reduces latencies compared to GDDR5 stuff right? Would it be possible that due to this reduced latency, the SIMD cores stall a lot less? And would that reduce the need for running a multitude of threads per core? And wouldn't that reduce SRAM register bank requirements compared to AMDs mainstream designs?

Until Microsoft or AMD discloses more about how the eSRAM interfaces with the GPU memory subsystems, we don't know how much faster it would be. That it should be faster to some degree is the current assumption, but until we know how it works and how it fits in the overall system, we are making assumptions beyond what is publicly available.
The GDDR5 memory itself contributes measurable latency, but in terms of the latency the GPU memory subsystem has it is a small fraction of the total. If the path the eSRAM is mostly the same as general memory traffic, the overall reduction in stalls won't be enough to significantly reduce the need for multiple threads.
The comparisons where people point out the full GPU memory pipeline's latency, and then a magical 2ns for the eSRAM have no basis in currently disclosed information.

GCN, at a minimum, is going to need four threads per CU to avoid cutting off whole SIMDs.
Within each SIMD, if you want to take advantage of its full issue width, you are going to want at least 5 threads. This isn't an absolute requirement if a workload doesn't leverage every issue slot, but it leaves per-cycle functionality on the table if you don't. That's 20 threads per CU out of 40 as a rough floor for the thread count.
If you want to watch out for the many other sources of latency, such as contention for local resources, instruction fetch, and shared memory, you might want some other threads to pitch in.
That can put a balance somewhere in the range of 20-40 threads per CU, which is sort of where we are now, although the pressure is probably closer to 40 with GDDR5 than it would be for eSRAM.

GCN at an architectural level is not strongly interested in catching a latency reduction, although it can still benefit.

DRS · Aug 15, 2013

Hi Shifty thanks for pointing out that thread, its very informative. Though my question wasn't really about latency being an issue but more about if reduced latency allows for a reduced number of concurrent threads/wavefronts being executed. Since these threads are supposed to hide the latency.

A page or 30 ago in this thread it was discussed if 320 shaders would be possible with only 16 SRAM banks. I think the conclusion was something like it would lead to deficiencies in handling wavefronts and limited number of threads. So I'm curious if less latency reduces the requirement for a great number of concurrent threads and perhaps allows for less SRAM banks.

Not that I'm convinced that it has 320 shaders, but this popped my mind back then when I read it but never dared to ask.

[EDIT]

3dilettante said:
The GDDR5 memory itself contributes measurable latency, but in terms of the latency the GPU memory subsystem has it is a small fraction of the total. If the path the eSRAM is mostly the same as general memory traffic, the overall reduction in stalls won't be enough to significantly reduce the need for multiple threads.

Ah, think my question is answered with that, thanks

3dilettante · Aug 15, 2013

Brain fart on my part, I was thinking GCN when it should be one of the VLIW architectures.

That would require at least two threads per SIMD, at a minimum. The latency numbers are less definite, and I've forgotten some of the gotchas from that gen, but unless you are running highly homogenous and branch-free code, you need to hide clause switch latencies that are in the tens of cycles.
That's going to put a floor on the thread counts irrespective of memory latency.

The point that we don't know how the on-die memory plugs into the same memory pipeline stands.
Also, the publicly known VLIW memory latency is even worse on average than GCN, with benchmarked latencies constant regardless of where in the hierarchy they are supplied. The R/W hierarchy in modern GPUs shows an actual latency difference for things like an L1 hit.

babybumb · Aug 15, 2013

willardjuice said:
I love it.

[edit] Okay perhaps that is slightly too trollish, but why can't ps2->xbox logic be applied to 360/ps3->wiiu logic? I understand the magnitude of the transitions are different, but I don't think that changes the underlining point.

Lets be real here. The gulf between PS2 and Xbox is astronomical compared to PS360->WiiU. There was simply no comparison between PS2 GPU and Geforce 3/4 based Xbox GPU. PS2 didnt even have a "GPU".. The Wii U GPU is simply few generations ahead in architechture but not much in power.

Pretty much every single game was superior on Xbox no matter how little efforts the ports got. The Wii U versions from major publishers are worse by a huge ratio than they are better.

The people arguing that Bayonetta->Bayonetta2 is some proof of that Wii U is superior are completely ignoring that Bayonetta was the first game by Platinium Games on HD consoles back in 2009. This engine has been improved multiple times (most recent version used Metal Gear Rising: Revengeance) and their Wii U titles are benefitting from that. Wii U development is not starting from zero.

This is like arguing that Gears of War and Gears of War 3 cant be possible on the same console and ignoring the software r&d updates to Unreal Engine 3. Wii U is just getting the latest versions from these engines and experience the developers have.

The reason people shouldnt expect much progress in Wii U graphics in its lifetime is simply because there is very little software r&d happening on this platform. The existing currentgen engines are simple brought over (hi dice!) just like PS2/Xbox engines to the original Wii. Its also arguable if there is much to be extracted even with huge efforts.

If Wii U had been released same time as PS3 i´m sure it would have had similar amounts of graphical proggression. Xbox 360 got BF2/COD2 and MW/BF3.. There was simply nothing better available even in the PC space in 2005. The massive graphical leap comes only from software R&D.

The engine tech suitable for Wii U (2008 GPU hardware at best) is fully understood now but it wasnt ready for Xbox 360 (multicore+directx9+) in 2005 which was ahead of times.

Nintendo games running at 60fpsish is also nothing new because they always have. It would be news if they all run 30fps like Pikmin 3. Games running 1080p are very similar to games running 1080p on other current gen consoles ie. not the most demanding.

Squeak · Aug 16, 2013

PS2 was a much different idea of how to build graphics rendering hardware. It can't be compared to other consumer approaches at the time. Some of the thinking in it is old and goes back to the 60's, some of it is very modern, and a lot of it both.

xbox's architecture on the other hand was mainly grown out the more or less arbitrary way PC had approached 3D, with a lot of arbitrary bottlenecks an redundancy.

xbox saving grace was that it had a very mature and well known API and toolset (and probably more RAM also helped because microsoft was ready to do just about anything to succeed. Though I don't think it made a lot of difference for the individual frames rendering quality).
That helped more than any kind of clever hardware in the short run.

And trouble is console cycles are short runs.
People are not willing to change their entire approach for just 5 years. Especially if they have an alternative and think that "the madness" is going to disappear within a reasonable timeframe.

That's why technically ambitious titles where developed for the xbox and made to fit on PS2.
I know it sounds like madness to some, but I'd still wager that the PS2 was never even close to being fully used in any commercial game for the above reasons.
It was not a perfect system, but neither where any of the three.

tongue_of_colicab · Aug 16, 2013

With a few exceptions like D3 and HL2 aren't you mixing up ps2 and xbox? ps2 had a 70+% market share and a whole bunch of exclusive games. I highly doubt that given the large differences in hardware & the market difference devs worked on xbox first and that shoehorned it to ps2. The other way is most likely.

PS2 hardware is probably much more "maxed out" than xbox or gc hardware. I remember devs on this board commenting about what kind of crazy stuff some people did with ps2 to make it do things it never really was designed to do.

DRS · Aug 16, 2013

@3dilettante:

So do I understand you correctly and does a VLIW SIMD core need 2 threads to mask it's own latencies? Thinking a bit further, compared to a 4650, the WiiU's eDram may have at least twice the bandwidth available. So texture fetch times will be much smaller, less threads needed to fill that gap. Does this mean that, in theory, 40SP per SIMD core would be possible with only 16 SRAM banks per core as it would need less SRAM to hold thread bound registers?

@babybumb:

Did you think of back in Ps2/XBOX age the financial picture involved with platform optimized ports was different from today? That's why you see small games such as Trine 2 improve much more; it costs less to do so.

The engine tech may be understood now, DX10.1 may be fully understood now. Its a nobrainer that Wii U (and PS4/XBONE) steps in where others are right now. I wouldn't even call it a benefit. But what about a system having a rather small pool of fast eDram, 2GB of slow memory, a CPU core with huge cache and low latency instead of high bandwith? How does one map the previously mentioned knowledge on a system with such a different architecture? Developers need to balance this in order to achieve the best results possible. For example, XB360 stores textures in relatively fast main memory, PS3 and PC store textures in dedicated VRAM. On Wii U it might be feasible to code another abstraction of texture storage in eDram (DISK->DDR->eDram->GPU) to improve GPU performance.

Now you expect developers not to invest in R&D. I agree that this won't happen as long developers can't make a good profit out of doing so. Besides that, they already have to invest in, for example, doing something useful with Wii U's gamepad and Miiverse integration. Also you rule out Nintendo themselves. By now they should know that their system doesn't sell well since people judge it on not producing next gen graphics. Do you think they want to keep it that way? I can imagine that the Unity and Crytek engine ports are pretty much optimized for Wii U as well, and paid by Nintendo. From that perspective I think it is harsh to say that Wii U's performance won't improve further.

Squeak · Aug 16, 2013

tongue_of_colicab said:
With a few exceptions like D3 and HL2 aren't you mixing up ps2 and xbox? ps2 had a 70+% market share and a whole bunch of exclusive games. I highly doubt that given the large differences in hardware & the market difference devs worked on xbox first and that shoehorned it to ps2. The other way is most likely.

PS2 hardware is probably much more "maxed out" than xbox or gc hardware. I remember devs on this board commenting about what kind of crazy stuff some people did with ps2 to make it do things it never really was designed to do.

That is the usual argument. Devs could easily get away with not putting enough effort into PS2 games (or was it that they simply didn't have the time or resources to do it?) They would sell anyway due to the huge base.
It was much much easier to get the best out of xbox, I don't think anyone would contest that.
And devs and publishers simply didn't see the gain from putting in the effort to truly make PS2 sing, when it would be much easier to put the blame on the hardware.
Sony on the other hand vastly overestimated the pride and adventurousness of devs.
The era of democoders and hackers who loved to tinker with stuff, learn and improve, where more or less over in 2000. Games had become big business and assembly line (no pun intended) work.

Sony should have put a lot more work into the API and libs (they where virtually nonexistent from the get go) and probably should also have taken time to explain and test the hardware to and with devs. If that meant launching a year later then so be it.
Many devs had a bad start with the machine and that soured the relationship. After the first year xbox was out and became a distraction and a morality breaker (why work hard to master PS2 when xbox was so much easier to get good results from, if high end graphics was your aim?), and then the generation was already almost over for many teams (it takes a couple of years to make a high budget game) so why even try?

3dilettante · Aug 16, 2013

DRS said:
@3dilettante:

So do I understand you correctly and does a VLIW SIMD core need 2 threads to mask it's own latencies? Thinking a bit further, compared to a 4650, the WiiU's eDram may have at least twice the bandwidth available. So texture fetch times will be much smaller, less threads needed to fill that gap. Does this mean that, in theory, 40SP per SIMD core would be possible with only 16 SRAM banks per core as it would need less SRAM to hold thread bound registers?

It needs at least two threads before it can utilize all its issue cycles. If there is only one thread, half the cycles on a SIMD do nothing. For GCN, with its 4 SIMDs and round-robin issue, the minimum is four. This is before all other considerations.

VLIW has different requirements for peak performance, including instruction-packing and register conflict restrictions that aren't latency-related but which add a lot of complexity for performance analysis.
VLIW also has additional complications like clause scheduling, since different types of instruction get put into separate packets of code, each of which incurs 40 or so cycles to switch.
The memory latency for even cache hits has been benchmarked in the realm of several hundred cycles. Rv770 was around 180 cycles on an L1 hit, as it was measured on this forum some time ago.
An optimized dense matrix multiply could get 8 or 6 threads per SIMD.
Other, more general uses have much more complex requirements and more places where the inflexibility of the clause approach injects more latency.

Suboptimal clause packing, hiding switch penalties, and getting around high minimum latencies would put threading requirements up, and since the memory latency within the GPU is almost unvaryingly high, the eDRAM can't start to make a difference until after the majority of the latency penalty is already incurred.

DRS · Aug 16, 2013

@3dilettante
Ok I had to read it 3 times before I got it, but I think I got it. You mention that there is 180 cycles of latency in case of a *hit*. Does this mean there is a queue in between? I mean I don't suppose the TMU's databus is kept busy so long when it accesses its cache or do I get that wrong?

My misconception was I always assumed that SIMDs were able to fetch instructions, texels etc themselves. But instead its the scheduling layer above, that manages those things and feeds the SIMDs with 'simplified' instuctions such as MADD, though introducing more latency. And as long as the amount of cycles spent on fetching data in case of a cache miss isn't huge in comparison to the other latencies, the eDram won't help much. Correct?

Or am I raging as a complete idiot now?

3dilettante · Aug 17, 2013

DRS said:
@3dilettante
Ok I had to read it 3 times before I got it, but I think I got it. You mention that there is 180 cycles of latency in case of a *hit*. Does this mean there is a queue in between? I mean I don't suppose the TMU's databus is kept busy so long when it accesses its cache or do I get that wrong?

I'm not sure of what the implementation is, or why latency benchmarks show the VLIW GPUs as having a flat latency graph that's a flat line at the worst-case latency level. This may be a combination of a memory subsystem that very heavily trades latency for high bandwidth utilization and some kind of very long pipelining for texture fetches.

My misconception was I always assumed that SIMDs were able to fetch instructions, texels etc themselves. But instead its the scheduling layer above, that manages those things and feeds the SIMDs with 'simplified' instuctions such as MADD, though introducing more latency. And as long as the amount of cycles spent on fetching data in case of a cache miss isn't huge in comparison to the other latencies, the eDram won't help much. Correct?

The VLIW GPUs have a more centralized scheduling and control setup, which GCN distributed amongst the CUs. Some of the hardware was probably in the VLIW, just unexposed and physically distant from the SIMDs.
Memory fetches are still the biggest latency hits. I question the difference the eDRAM can make because if the memory pipeline isn't customized, the GPU forces a very high minimum latency that the eDRAM is not allowed to improve.

darkblu · Aug 19, 2013

DRS said:
My misconception was I always assumed that SIMDs were able to fetch instructions, texels etc themselves. But instead its the scheduling layer above, that manages those things and feeds the SIMDs with 'simplified' instuctions such as MADD, though introducing more latency. And as long as the amount of cycles spent on fetching data in case of a cache miss isn't huge in comparison to the other latencies, the eDram won't help much. Correct?

Or am I raging as a complete idiot now?

No, you're not. But 3dilettante was exclusive talking of texture and vertex fetches (i.e. GATHER and VFETCH ops) and missed one third kind of data access - LDS. Such access is done from within ALU clauses, without the interference of the scheduler. Or put in other words, VLIW SIMDs *can* make use of low-latency mempools.

3dilettante · Aug 19, 2013

Unless the claim is that the eDRAM is plugged directly into the LDS (it is visibly separate from the SIMDs), I didn't see a reason to discuss it.

Wii U hardware discussion and investigation *rename

swaaye

Entirely Suboptimal

Grall

Invisible Member

DRS

psychogmv

Rangers

DRS

Shifty Geezer

uber-Troll!

3dilettante

DRS

3dilettante

babybumb

Squeak

tongue_of_colicab

DRS

Squeak

3dilettante

DRS

3dilettante

darkblu

3dilettante

Similar threads