Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
Are you mixing up wavefronts with "threads" (work items)? While the wavefronts are the actual threads of the hardware, usually the work items (from which you have 64 in a wavefront) are called that way in the context of GPUs. But there is no way you can fill even the very small Latte GPU with just 160 or 192 of these "threads". For starters, the VLIW architectures always interleave two wavefronts on a single SIMD to cover instruction latencies (the command processor keeps more wavefronts to swap in in case one hits a long latency instruction [memory access] or control flow). That means one needs already 256 "threads" (4 wavefronts) at minimum, to be even able to hide the ALU latencies for a tiny GPU with just two SIMDs. For running efficiently, you would want significantly (an order of magnitude or something in that range) more than that.
And there is actually no efficient way to run less "threads" than one has in a wavefront. So 10 "threads" for GS doesn't make the slightest sense, 10 wavefronts (640 "threads") do.

Probably due to some recent explanations that amounted to 1 thread:1 shader. I have a limited understanding of threads and even that was from a long time back. So thanks for the clarification. :D

So what would be your explanation as to why Nintendo is listing the numbers in this manner? Which apparently even caused a dev to say Latte had 192 ALUs.
 
And it actually makes a lot of sense. A wavefront (or warp for nV) is the closest thing you can get to a thread as known from CPUs. There, it doesn't make a single CPU core to have 8 threads just because you process eight 32bit floats with a single AVX instruction. The vectors a GPU procsses are just a bit wider (32 or 64 wide instead of 8 wide with AVX), but all elements are always processed in lockstep, they can't really diverge but only masked (that is not completely true anymore for GCN, doesn't matter here) and there is only a single instruction pointer for one wavefront. That's why I hesitate to name the work items (vector elements) of a wavefront "threads". At the GCN ISA level, it even becomes very clear that the wavefronts are the actual threads as one has a clear distinction between scalar and vector instructions (the latter operating on a vector of 64 values, the former not), just like on a CPU.
 
Last edited by a moderator:
Are you mixing up wavefronts with "threads" (work items)? While the wavefronts are the actual threads of the hardware, usually the work items (from which you have 64 in a wavefront) are called that way in the context of GPUs. But there is no way you can fill even the very small Latte GPU with just 160 or 192 of these "threads". For starters, the VLIW architectures always interleave two wavefronts on a single SIMD to cover instruction latencies (the command processor keeps more wavefronts to swap in in case one hits a long latency instruction [memory access] or control flow). That means one needs already 256 "threads" (4 wavefronts) at minimum, to be even able to hide the ALU latencies for a tiny GPU with just two SIMDs. For running efficiently, you would want significantly (an order of magnitude or something in that range) more than that.
And there is actually no efficient way to run less "threads" than one has in a wavefront. So 10 "threads" for GS doesn't make the slightest sense, 10 wavefronts (640 "threads") do.

From what ive gathered of bg's access to the information... Nintendo, in a brilliant display of being er, nintendo, has apparantly decided to declare 'warps' and 'wavefronts' as 'threads' now, in their documentation. Just to be jerkpants and confuse everybody.

I remember a conversation from a while back where it was mentioned that wavefronts were called threads back in some old Xenos presentations.

Here: http://beyond3d.com/showpost.php?p=1766062&postcount=5273


Honestly, looking back at terms that keep surfacing, like general purpose registers instead of temporary registers, and threads instead of wavefronts/warps, it could just be Nintendo is sticking to their old ati terminology.

I still say they are doing it to be jerkpants.
 
Last edited by a moderator:
As I said, I actually prefer that "old" terminology, it's much closer to the one applied to all other processors. Why one had to come up with something else (and reusing or even abusing well established terms) just for GPUs? That's where the confusion started.
 
Thanks for the link FS.

And it actually makes a lot of sense. A wavefront (or warp for nV) is the closest thing you can get to a thread as known from CPUs. There, it doesn't make a single CPU core to have 8 threads just because you process eight 32bit floats with a single AVX instruction. The vectors a GPU procsses are just a bit wider (32 or 64 wide instead of 8 wide with AVX), but all elements are always processed in lockstep, they can't really diverge but only masked (that is not completely true anymore for GCN, doesn't matter here) and there is only a single instruction pointer for one wavefront. That's why I hesitate to name the work items (vector elements) of a wavefront "threads". At the GCN ISA level, it even becomes very clear that the wavefronts are the actual threads as one has a clear distinction between scalar and vector instructions (the latter operating on a vector of 64 values, the former not), just like on a CPU.

So keeping in line with what I was originally saying along with your explanation, would Latte be limited to 160 concurrent wavefronts?

From what ive gathered of bg's access to the information... Nintendo, in a brilliant display of being er, nintendo, has apparantly decided to declare 'warps' and 'wavefronts' as 'threads' now, in their documentation. Just to be jerkpants and confuse everybody.




Honestly, looking back at terms that keep surfacing, like general purpose registers instead of temporary registers, and threads instead of wavefronts/warps, it could just be Nintendo is sticking to their old ati terminology.

I still say they are doing it to be jerkpants.

I don't know if it was intentional, but the first thing I picked up on was the old terminology. I remembered most of them being in the R600 docs I read awhile back. Having some familiarity with that allowed me to figure out it was a 160:8:8 design.

Anybody have suggestions on why Nintendo chose to go with 1:1 TMUs and ROPs?
 
Thanks for the link FS.



So keeping in line with what I was originally saying along with your explanation, would Latte be limited to 160 concurrent wavefronts?



I don't know if it was intentional, but the first thing I picked up on was the old terminology. I remembered most of them being in the R600 docs I read awhile back. Having some familiarity with that allowed me to figure out it was a 160:8:8 design.

Anybody have suggestions on why Nintendo chose to go with 1:1 TMUs and ROPs?

No prob, bg. I hope gipsel can weigh in, but let me take a stab at this just to see if I'm getting things straight myself. From what I gather, if a wavefront (64 threads/work items) is executed over four cycles, then each cycle 16 threads are executed by a SIMD engine. This is one thread for each of "ALUs" as Nintendo lists them (group of 5 shaders).

I've read that other AMD GPUs have a limit of 8 "work groups" per SIMD engine and that each work group is 1 or 4 wavefronts. Thus, if this is also true of Latte, it would seem that total, there can be up to 64 wavefronts in flight at any given time. 192 seems to be a global limit in the command processor.

As for the TMU to ROP ratio - some quick guesses. ROPs were needed to for the additional fillrate of the Gamepad? More TMUs would be bottlenecked by low MEM1 bandwidth? TMUs in Xenos were possibly bottlenecked to some degree? Or splitting the SIMD engines would require more complexity, which Nintendo vetoed? That would be wavefront size of 32. Or they just had a set silicon budget and two more sets of TMUs and L1 caches would take up to much space (this probably in combination with some of the other factors):?:
 
Didn't nV start this? Or was it MS? :rolleyes:

At least nVidia's SIMT supports divergence within a warp, I think potentially up to the number of ALUs, where sequencing is done to support all the divergences. Andy Glew argues that this should qualify as threading since there are independent PCs.
 
My take on Nintendo keeping a 1:1 TMU and ROP ratio was either intentional or incompetence. I'm guessing both. It does seem like they went out of their way to make a severely handicapped piece of hardware for the costs involved.
 
So keeping in line with what I was originally saying along with your explanation, would Latte be limited to 160 concurrent wavefronts?
If it has just two full size SIMDs (so 16x5 ALUs and 64 elements wide wavefronts), it is actually limited to less. IIRC, the GPR allocation has an allocation of 4 float4 registers (and a few [4 by default I think] per SIMD are usually reserved for clause temporary registers). That means you can't keep more than 63 wavefronts in flight on each SIMD (would be 126 in total) at the absolute maximum. And this is only true for absolutely minimal shaders requiring not more than 4 registers (and there could be even a minimum allocation of 8 registers, don't remember and would have to look it up). That's why I think 192 is very generous. Latte will never be limited by this number (also evident by the fact that a RV770 with 10 SIMDs was fine with 256 wavefronts for the whole chip). But maybe it made no sense to invest time and money to scale the "ultra threaded dispatch processor" further down. Or there are additional small gains under some circumstances to have wavefronts waiting in there without any resources (GPRs, LDS) of the SIMD allocated to them. But nothing jumps to my mind (as the data for the wavefronts have to be stored somewhere). No idea.
Anybody have suggestions on why Nintendo chose to go with 1:1 TMUs and ROPs?
4 ROPs may have been a bottleneck for 1080p titles?

========================================

At least nVidia's SIMT supports divergence within a warp, I think potentially up to the number of ALUs, where sequencing is done to support all the divergences. Andy Glew argues that this should qualify as threading since there are independent PCs.
That is simply not true. It is just lane masking. One still has always a single PC for all elements in a warp which are executed always in lockstep. It doesn't allow for so called irreducible control flow. It is simply impossible on nV hardware in the moment.
SIMT right now just describes SIMD with lane masking (which is why it is a redundant term made up by nV in my opinion, the "threads" lack a fundamental property of a real thread, it's own PC and an logically independent execution). It may evolve into something more (nV's proposed architectures for Einstein and this Exascale thing appeared to plan for it according to presentations), but so far it isn't there. On the other side, GCN added a somewhat kludgy mechanism to enable irreducible control flow. It isn't fast, but it works (by dynamically "dividing" a wavefront into multiple by the use of a stack [deep enough to be able to split it up to individual elements, so a forked wavefront may have only a single active element] and joining them together afterwards).There you have really different PCs for elements in different branches.
 
Last edited by a moderator:
So... Nintendo was intentionally incompetent?

Please understand.


Does anyone know if nintendos engineering had any major key people retire/leadership changes between cube to wii and to wii u? I know names are named of the engineers behind wii u in the iwata asks, but thats about the extent i know of that. Names i cant remember.
 
That is simply not true. It is just lane masking. One still has always a single PC for all elements in a warp which are executed always in lockstep. It doesn't allow for so called irreducible control flow. It is simply impossible on nV hardware in the moment.
SIMT right now just describes SIMD with lane masking (which is why it is a redundant term made up by nV in my opinion, the "threads" lack a fundamental property of a real thread, it's own PC and an logically independent execution). It may evolve into something more (nV's proposed architectures for Einstein and this Exascale thing appeared to plan for it according to presentations), but so far it isn't there. On the other side, GCN added a somewhat kludgy mechanism to enable irreducible control flow. It isn't fast, but it works (by dynamically "dividing" a wavefront into multiple by the use of a stack [deep enough to be able to split it up to individual elements, so a forked wavefront may have only a single active element] and joining them together afterwards).There you have really different PCs for elements in different branches.

Okay, then I must have been very misled by Andy Glew's presentations (which implement what I describe and call it SIMT) and claims that it's what nVidia GPUs are using now. I was also thrown off by other comments such as David Kanter's: "If there are N divergent paths in a warp, performance decreases by about a factor of N, depending on the length of each path." - if the predication is handled with a flat set of paths then it wouldn't be a factor of N but a factor proportional to the number of paths the code could take, which could be totally unrelated to N. This struck me as the main benefit of the approach I thought it was using, that you could have say switch statement with hundreds of targets and not have to go through some hierarchy in software, weeding out big chunks of the space that no "threads" are executing.

I've also seen the description (here for example: https://engineering.purdue.edu/~smidkiff/ece563/files/GPU1.pdf) that SIMT is different from SIMD by having single instruction -> multiple register sets, multiple addresses, and multiple flow paths. The first one is purely semantics (although, you could call a lack of horizontal instructions to be a "feature" of this), and the second is not actually something excluded in SIMD w/scatter/gather. That leaves the third, and I don't agree that it's qualified via predication alone. But I guess nVidia was describing a software model and not a hardware one? I guess the __any and __all functions should have been evidence to this.
 
Last edited by a moderator:
I've also seen the description (here for example: https://engineering.purdue.edu/~smidkiff/ece563/files/GPU1.pdf)
The description there is really rubbish. In one sentence they claim, that the elements ("threads") in a warp are executed in lockstep but in the next sentence each one can progress at its own rate (which is utterly wrong). How is that supposed to fit together?
your link said:
All threads running at some cycle execute the same instruction, giving rise to the name Single Instruction, Multiple Threads.
Each thread has its own registers, and can progress at its own rate, unlike SIMD.
WTF?!?
That leaves the third, and I don't agree that it's qualified via predication alone. But I guess nVidia was describing a software model and not a hardware one? I guess the __any and __all functions should have been evidence to this.
With these warp vote functions one can get coherent branches of all workitems/elements in a vector ("threads" in a warp) depending on a population count on a bitfield consisting of the flags for all elements (for instance set by a comparison in each element). It's basically a kludge for not having a proper scalar instruction set (like GCN, which can do all that and more). I would say it doesn't pertain to the SIMD with lane masking equals SIMT issue. A CPU with an SIMD extension can usually do the same.
And if you think nV does more than just lane masking to handle divergent branches, what is that supposed to be?

Edit:
Anyway, I think it was discussed already in another thread (I just remember, that this prior disussion even mentioned the existence of translation tables between the common processor architecture language and "nVidian") and we came quite a bit off the track here.
 
Last edited by a moderator:
This blog post http://tangentvector.wordpress.com/2013/04/12/a-digression-on-divergence/ by Tim Foley (intel engineer) is interesting to learn more details on the way GPU handle divergent branches.
Reading this made me chuckle:
Aside: Some people will insist that GPUs have a distinct program counter for every invocation in the warp. These people are either (A) confused, (B) trying to confuse you, or (C) work with some really odd hardware.
Btw., Intel isn't innocent neither with their strands and fibers. :LOL:
 
Status
Not open for further replies.
Back
Top