The ISA for a UPU

Gipsel · Mar 11, 2013

MfA said:
You can tie those cores to SIMD/MIMD arrays and then leave the arrays doing nothing when you're running single threaded code on them ... then you can call them unified too

Then Nick would be stretching his definition of "unified" to the point where this "unification" defeats its supposed purpose.

tekyfo · Mar 11, 2013

Data will be the biggest problem. Even on todays cpus most parallel problems are bandwidth bound. Add to that the fact that your effective cache size is only a quarter.

Nick · Mar 11, 2013

Ethatron said:
I mean that the ISA of a CPU is essentially frozen, and that of x86 is in a bad shape as well. How do you propose to offer more than 16 registers over time? (Say raising them every half a year) From an ISA perspective ofc, the register-file contains more than enough.

The main purpose of providing more registers would be to hide latency, but this can also be achieved with long-running vector instructions. For instance AVX-1024 instructions could be executed on 256-bit execution units in four cycles.

The ISA is not frozen at all, and x86 in particular has a long track record of keeping itself highly relevant through new extensions. It started as a 16-bit ISA without floating-point support. The VEX encoding used by AVX offers lots of opcode room for future instructions, and there are reserved bits that could be used to make it wider. It is certainly a messy jumble, I'll give you that, but an ISA doesn't have to be pretty to be effective! It hasn't stopped x86 from becoming dominant in the desktop, laptop and server market, and it doesn't seem like much of an obstacle to enter the TV and mobile markets.

Ethatron · Mar 12, 2013

The CPU's upgrade-cycle so far has been almost one decade per getting more "SIMD-vectors" (registers). That's not competitive with the GPU's upgrade-cycle, which is between half and one year per getting more "SIMD-vectors" (...Us).
I'm not sure you really want to address 512 distinct registers in the opcode-coding once you get there. The advantage of the GPU-setup is that "registers" are effectively address-coded via the CU-id instead of the opcode, you run the same code on (say) 32 of them, or 345 next year.
Besides, you can partition all GPU units into whatever number of work-groups at nice granularity, a CPU SIMD-vector is unpartitionable and unthreadable.

liolio · Mar 12, 2013

I ask here again (as I opened a thread somehow related elsewhere) but isn't ISA the problem?

I mean isn't ISA what somehow hinder the development of proper throughput optimized architecture based on CPU type of units?
The other way around hich may make it clearer, isn't it a GPU advantage that the environment is not ISA "bound"?

I've been wondering if a GPU type of software environment would free the engineers of a lot of constrains, do what work now, when you can do better change the underlying hardware with no impact on software.

So to summup is the ISA the problem or the software environment too ISA bounded?

Nick · Mar 12, 2013

Gipsel said:
So you proved it with your wishful thinking?

No, those were separate things. I proved my claim of his prejudice with current data.

For your allegation of wishful thinking (which isn't about having no hard proof since we're talking about a prediction, but about going against all logic), you need to show that there are no trends to support unification. There are clearly several trends that do support it, and others that make it a big challenge. So my theory is at least as good as yours and you cannot accuse me of wishful thinking. Else anyone with a theory and arguments to support it would be a wishful thinker. If that's what you call all of us, I'll take it as a compliment from now on.

But sorry, you proved nothing.

I never claimed I had full proof. And neither do you have hard proof for the opposite claim, because there are no true unified architectures yet (that cannot clearly be significantly improved upon). It's just one theory against another for now. There will be many more arguments and experiments before either of us can start proving anything.

All you offered is some kind of (in my opinion flawed) conjectures.

Good. All we can do is offer conjectures, and agree or disagree on them. I value your arguments and opinions.

I suggest you think again about that. It is not fully unified. The FPU in BD/PD/SR is even shared between two cores. How can this be fully unified with the integer core?

I never defined a unified architecture as having floating-point units as part of the integer core. So I apologize for not specifying it before but you're attacking a straw man. To me a unified architecture has a homogeneous ISA for high ILP and high DLP workloads, and very low overhead for switching between them. Basically, it should be able to take any code in which the loops with independent iterations have been vectorized, with practically no risk of being slower than legacy CPUs.

I'm very open to alternate implementations, but it appears clear that each integer core will need to have access to a nearby SIMD cluster that feeds off the same instruction stream. I wouldn't recommend sharing such SIMD cluster between two scalar cores like AMD currently does, but I wouldn't call it not a unified architecture. It's just a weak one right now, with a seemingly bleak future. But I'd love to be proven wrong in that regard.

You are wrong. You miss the important point here. Vertex and pixel processing loads do not differ in some fundamental point of view. Both are throughput oriented workloads on 32bit floats.

Yes, but that wasn't the case a priori. Vertex and pixel processing did differ very much in the early days. Polygon counts were really low, in part because vertex processing used to be done on the CPU and it didn't have a lot of computing power. And pixel pipelines were fixed-function and integer only. It would have been madness back then to suggest to unify them because of some common ground. The differences far outweighed the similarities - which is why they were processed by separate hardware back then. It's only after vertex processing became programmable, after it used multiple cores, after polygons became smaller, after pixel processing became programmable, and after pixel processing became floating-point, that unification became a viable theory. I still recall though that there were naysayers on this very forum, only days before the GeForce 8800 announcement. So no, unification was never an obvious thing. Nowadays it's a given, because the software started to fully exploit the unification. We take it for granted and find it very valuable. But that is a posteriori knowledge talking when we find it obvious.

Unification of the CPU and GPU may not seem very obvious to you right now because there is no software to really exploit it, which is only the case because there are no unified architectures with such high compute density yet. It's a chicken-and-egg issue that is only going to get resolved gradually. This lack of software also makes it hard to see the common ground. But the CPU and GPU are already trading blows when it comes to OpenCL performance (which ironically was targeted specifically at the GPU). So there are some faint signs that in my opinion should not be brushed off as irrelevant.

Really the common ground between the CPU and GPU is 'computing'. Any application is a mix of ILP, DLP and TLP. It does not make sense much longer to drive a wedge in the middle. Intel is adding AVX2+ to its CPUs because there's a lot of DLP left to exploit. GPUs are lowering their average instruction latency to increase efficiency on complex workloads with limited parallelism and to keep scaling under limited bandwidth conditions.

Once this convergence culminates into a unified architecture, it will spark the development of new applications that depend on it, and we'll soon wonder how we could ever live with heterogeneous compute hardware.

But you want to overcome the fundamental difference between latency and throughput oriented workloads. That is not going to happen so easily as the processed data types won't change anything on that distinction.

I beg to differ. Despite being throughput oriented, today's GPUs cannot deal with an average instruction latency of a hundred cycles or more, like they used to in the fixed-function pixel pipeline days. So clearly they have become far more latency oriented too, whether you like it or not. It's really simple: Once you're out of data parallelism, everything becomes latency sensitive. That includes graphics.

So it's becoming harder to label these workloads as fundamentally different. Sure, these so-called throughput oriented workloads can use thousands of ALUs instead of a few, but from the individual ALU's point of view it's all just a thread with similar characteristics to the one you'd feed a scalar core. Again, once you're out of data parallelism, which means your SIMD units can't be made wider without losing performance, everything becomes latency sensitive.

That was discussed already endless times I think. A throughput oriented task usually access amounts of data not fitting into the L1 or L2. Streaming it through a cache structure optimized for low latency access is just wasteful, also from a power perspective. Calling an asynchronous subroutine on a dedicated througput unit with its own L1/L2 specialized to those tasks don't increase data movement at all. The data just gets moved from the LLC or the memory hierarchy to a different place than your latency optimized core.

The problem is that you're assuming "a throughput oriented task usually access amounts of data not fitting into the L1 or L2" and you somehow appear to think that's something that can not and should not be changed, as if it's a good thing. It is not a good thing. For argument's sake let's say L1 and L2 are your only caches, so everything else is a RAM access. When your ALU count doubles with the next silicon node, your RAM bandwidth does not. So the only way to feed your ALUs is to get more use out of your caches. To increase the hit ratio you need fewer thread, and this can only be achieved through latency-oriented techniques.

This isn't some recent problem. It has been going on for years. Early graphics chips relied on multi-pass techniques to make pixels a little more interesting. All data was read from RAM, sent through the pipeline, written back to RAM, read back from RAM in the second pass, and then written back to RAM. Things evolved to single-pass techniques because we ran out of RAM bandwidth to keep doing that. But is meant we needed on-chip storage for the soon to be reused data. To not require tons of it to cover for the entire latency of the fixed-function pipeline, while the ALU:TEX ratio was increasing, the solution was to decouple the arithmetic and texture pipelines. That's a latency-oriented optimization, and it was a necessary step toward the much celebrated unification of vertex and pixel processing.

The bandwidth wall is real. And is has to be dealt with every time a new process node gives us more transistors to work with. It's not going away, and it's only getting worse. And yes, the dark silicon issue is real too, but fighting it by moving data around is only making the bandwidth wall hit you in the face faster. Fighting the bandwidth wall head-on by aggressively providing more bandwidth isn't smart either, because an off-chip DRAM access takes far more power than an SRAM access.

The solution for the near future is to make it an on-chip DRAM access. But everyone agrees that's only going to work for one or two process generations, and comes with a considerable added cost. So it's not the only avenue being pursued. NVIDIA has revealed to also be experimenting with an architecture that has a really tiny register file right next to the ALUs. The purpose of this seems very similar to the CPU's bypass network; to minimize the latency between dependent arithmetic instructions.

If you look back a bit, AMD's VLIW architectures could issue up to 5 different ops from the same thread in one cycle. That actually even beats Intel's 4 wide OoOE cores. Does it help a lot or is it even necessary? Obviously not as usually one can exchange ILP for DLP and TLP in throughput oriented cores quite easily. It's just a matter of balancing execution latencies and data access latencies with the expected amount of DLP and TLP to keep everything running at full load with minimized hardware effort.

I mostly agree. Note though that Intel has only had 3 arithmetic ports for now (Haswell adds a fourth), and out-of-order scheduling and Hyper-Threading improve the occupancy. Trying to statically schedule 5 instructions from a single thread was clearly overkill. VLIW4 improved things, but there are signs that AMD went too far in abandoning multi-issue with GCN. Dual-issue appears to be the sweet spot for this generation.

As detailed above, the pressure is on to keep improving the IPC per thread to reduce the number of threads. Multi-issue plays a role in that but as you correctly point out it should not be exaggerated. It is clear that future GPUs will need moderate amounts of many different techniques. But they have one thing in common: they're all used by CPU architectures. CPUs in turn can learn from GPUs by utilizing multi-core, SMT, wider SIMD units, gather/scatter, FMA, long-running instructions, etc.

The convergence may appear slow and bumpy when comparing one year to the next, but over multiple years there has clearly been a phenomenal exchange of technology between the CPU and GPU already. It is the most logical hypotheses to assume that these old but strong forces which have caused/allowed this will continue to cause more convergence. I also like to think that the thousands of Intel engineers who put so much effort into AVX2 believe it will be of great relevance for the many years to come (and not become a burden instead), and that they made it extendable to 1024-bit for good reason.

You mean like the 8 MB of register files of a Tahiti GPU? I want to see that in a CPU with heavily multiported reg files running at 4 GHz.

Tahiti does not have 8 MB register files. They may call them that, and they may be used that way from a software perspective, but they are not register files in same the sense as the ones you're trying to compare against in the CPU. They are SRAM banks. And with that cleared up, the CPU does have copious amounts of SRAM already. Furthermore, the experimental NVIDIA architecture I mentioned earlier has an 'operand register file' of just 16 entries.

So you really have to look at the entire storage hierarchy and not single out any layers that happen to be called the same due to historic reasons more than anything else. That said, the CPU can increase its latency hiding capabilities with long-running vector instructions, which necessitates growing the register file. AVX-1024 on 512-bit execution units would increase the architectural register space fourfold over what we have today and double the latency hiding per thread. How much the physical register space would grow depends on the interaction with out-of-order execution and Hyper-Threading, but in any case the register space would no longer be small compared to the L1 space, which is certainly atypical compared to a legacy CPU, and more GPU-like.

Or all the write combining and memory access coalescing going on to use the available bandwidth as efficient as possible but causes a massive (about one order of magnitude) latency penalty which would be completely prohibitive for a latency oriented task? Putting this on top of a latency optimized cache architecture would just burn an unnecessary amount of power. It's very likely more efficient to keep these structures separate maybe save for the LLC (with special provisions to avoid excessive trashing of the data for latency optimized tasks by throughput tasks).

It used to make a lot of sense to keep the GPU's memory separate, and yet for a while now the most frequently sold GPU is one that uses system memory. And now you're telling me it makes sense to share the LLC... Can't you see the trend here? I call that convergence, and apparently with every step of it we obtained qualities that are considered better than what separation gave us. It used to also be inconceivable to unify vertex and pixel processing due to prohibitive differences, and yet one step at a time and perhaps not fully consciously we were converging them, until the final leap was made.

What I'm trying to say here is that CPU-GPU unification will not happen overnight, and there are most definitely many more issues to overcome. But rather than looking at these as prohibitive and throwing in the towel, there are qualities of unification that are worth pursuing which should at least motivate us to try and take the next convergence step.

I'm taking it as a challenge. It is why I started this thread to look for a part of the solution for the remaining fixed-function logic. The write combining and read coalescing aspects are definitely also interesting, but again I see ways to continue converging it one step at a time. The CPU already does write combining and Intel claims that Haswell does not suffer from bank conflicts despite having two read ports and a write port, while it also implements gather. Furthermore, long-running vector instructions would easily allow the minimal latency to be a bit higher. Also note that Nehalem increased the access latency by one cycle and it had no measurable effect on single-threaded performance. SIMD accesses also take a few more cycles than scalar integer accesses. With long-running instructions and Hyper-Threading I'm sure that many more additional cycles of latency can be covered. Also note that FO4 continues to go down while clock frequency stagnates. More work can be done per cycle, which has already resulted in some instruction latencies going down. It is also what allows Haswell to implement FMA at no added latency, and to have four integer scalar units with zero cycle bypass. And despite all that, Haswell is going to be more power efficient. So if improved coalescing is required, I'm sure they can handle it, even if it takes a couple more generations.

Gipsel · Mar 12, 2013

Nick said:
I proved my claim of his prejudice with current data.

As I said, you proved nothing and just claimed he is prejudiced. That may be your opinion, but you offered no proof.

Nick said:
For your allegation of wishful thinking (which isn't about having no hard proof since we're talking about a prediction, but about going against all logic), you need to show that there are no trends to support unification.

I didn't claim to have proven anything. I just asking you to check your arguments in that direction.

And I have actually given some very general reasons (and supporting opinions from some influential people in the industry) that this unification you dream of may be not so likely as you think. Not more, not less.

Nick said:
I never defined a unified architecture as having floating-point units as part of the integer core. So I apologize for not specifying it before but you're attacking a straw man. To me a unified architecture has a homogeneous ISA for high ILP and high DLP workloads, and very low overhead for switching between them. Basically, it should be able to take any code in which the loops with independent iterations have been vectorized, with practically no risk of being slower than legacy CPUs.

The argument was specifically for the coprocessor model of the x87 FPU. So if anything, I addressed the exact strawman you introduced to the discussion.

And if you look back through the (hijacked?) Larrabee thread, you will find we (including others, especially 3dilletante for instance sketching the coprocessor model for CUs in the post following the linked one) talked about some kind of a coprocessor model for throughput tasks already quite some time ago. But you want a "true" unification, not some throughput CU(s) sitting next to the latency optimized cores to which the latency optimized cores can shift the execution (or call [asynchronous] subroutines there as I described it in this thread as a variation of it). So we went through all the arguments already almost two years ago. No need to repeat that.

Nick said:
I'm very open to alternate implementations, but it appears clear that each integer core will need to have access to a nearby SIMD cluster that feeds off the same instruction stream. I wouldn't recommend sharing such SIMD cluster between two scalar cores like AMD currently does, but I wouldn't call it not a unified architecture. It's just a weak one right now, with a seemingly bleak future. But I'd love to be proven wrong in that regard.

You wouldn't? I had a different impression so far (see the discussion linked above). Have you seen MfA's joke about unification by putting two seperate units next to each other just some posts ago?

Nick said:
Yes, but that wasn't the case a priori. Vertex and pixel processing did differ very much in the early days. Polygon counts were really low, in part because vertex processing used to be done on the CPU and it didn't have a lot of computing power. And pixel pipelines were fixed-function and integer only. It would have been madness back then to suggest to unify them because of some common ground. The differences far outweighed the similarities - which is why they were processed by separate hardware back then. It's only after vertex processing became programmable, after it used multiple cores, after polygons became smaller, after pixel processing became programmable, and after pixel processing became floating-point, that unification became a viable theory.

So what you are basically saying is that the evolution of games/GPU or 3D tech in general made the vertex load to become a throughput oriented task. So what does this change on my claim that there will be always latency sensitive tasks and throughput tasks?

Nick said:
Really the common ground between the CPU and GPU is 'computing'. Any application is a mix of ILP, DLP and TLP. It does not make sense much longer to drive a wedge in the middle.

That's not the point. But the parts of an application posessing a vast amount of DLP (that's the easiest one) and potentially TLP can be characterized as very parallel parts or throughput oriented parts. The parts exhibiting relatively small amounts of DLP and TLP can be characterized as serial or latency sensitive tasks. That's not a distinction I just came up with, you find it in one or the other form already for decades. And I think there is a good reason for that and also that different hardware approaches were developed for tackling these two different classes of tasks.

Nick said:
GPUs are lowering their average instruction latency to increase efficiency on complex workloads with limited parallelism and to keep scaling under limited bandwidth conditions.

That is not exactly limited by the instruction latency. It is usually limited by the task scheduling latencies which are orders of magnitudes larger than instruction latencies and often heavily influenced by driver overhead. That overhead is what AMD and nV want to get rid of (and can do so already in some cases in the latest GPUs).

Nick said:
Once this convergence culminates into a unified architecture, it will spark the development of new applications that depend on it, and we'll soon wonder how we could ever live with heterogeneous compute hardware.

Or it will never happen in the way you think. You appear to have no doubt about this homogenization. That's why I asked you if maybe, just maybe, you twist everything to fit your wish.

Nick said:
I beg to differ. Despite being throughput oriented, today's GPUs cannot deal with an average instruction latency of a hundred cycles or more, like they used to in the fixed-function pixel pipeline days. So clearly they have become far more latency oriented too, whether you like it or not.

What hundred cycles latencies are you talking about regarding old GPUs? I have something at the back of my head telling me that older GPUs tended to stall if something wasn't in the L1 texture cache (was still the case with the GF7/PS3 GPU)? Today, they can easily tolerate a few cache misses and hundreds of cycles latency to load it from DRAM without stalling. So something appears to be wrong here. GPUs definitely got more throughput oriented and more latency tolerant to increase their performance.

Nick said:
It's really simple: Once you're out of data parallelism, everything becomes latency sensitive. That includes graphics.

But graphics is still quite a bit away from that point.

Nick said:
So it's becoming harder to label these workloads as fundamentally different.

It only becomes harder if you assume that the serial tasks die out eventually. I give you that there is no hard line of distinction and that this blurry line even moves over time. But as the workloads not only get more complex but also simply larger (which is often the easier route), this line will be there for the foreseeable future and I would predict it will still be there for quite some time in the unforeseeable future.

Nick said:
Sure, these so-called throughput oriented workloads can use thousands of ALUs instead of a few, but from the individual ALU's point of view it's all just a thread with similar characteristics to the one you'd feed a scalar core. Again, once you're out of data parallelism, which means your SIMD units can't be made wider without losing performance, everything becomes latency sensitive.

You don't have to make the SIMDs wider over time. Where did you get this idea? The SIMD width does not determine if you have a latency or throughput oriented architecture. It may move your core a bit along that axis, but a lot of other factors come into play there too.

Nick said:
The problem is that you're assuming "a throughput oriented task usually access amounts of data not fitting into the L1 or L2" and you somehow appear to think that's something that can not and should not be changed, as if it's a good thing. It is not a good thing. For argument's sake let's say L1 and L2 are your only caches, so everything else is a RAM access. When your ALU count doubles with the next silicon node, your RAM bandwidth does not.

It may not exactly double, but it goes up and will continue to do so. And the internal bandwidth (cache, registers) usually doubles. It can bring you quite far with moderate cache size increases. Just look at the developement in the past.
And that "a throughput oriented task usually access amounts of data not fitting into the L1 or L2" is just an observation without any judgement. You have to think what the actual tasks of the caches are, both from a performance and also from a power perspective. And then you can think about if it makes sense to increase the size at the lower cache levels or if it is better to keep them relatively small and better add a level higher up the hierarchy.

Nick said:
So the only way to feed your ALUs is to get more use out of your caches. To increase the hit ratio you need fewer thread, and this can only be achieved through latency-oriented techniques.

That's actually not true in the general case. Assuming you have a lot of DLP and neighboring work items ("threads" in nV speak) are processing a lot of spatially related data (which is often the case), the hit rate is probably very close (may actually be better in an extreme case) to the case where you process them one ofter another. Spatial coherence can be used the same way as temporal coherence. GPUs traditionally rely to a great deal on the former, CPUs on the latter (doesn't exclude neither to use both).

Nick said:
This isn't some recent problem. It has been going on for years. Early graphics chips relied on multi-pass techniques to make pixels a little more interesting. All data was read from RAM, sent through the pipeline, written back to RAM, read back from RAM in the second pass, and then written back to RAM. Things evolved to single-pass techniques because we ran out of RAM bandwidth to keep doing that. But is meant we needed on-chip storage for the soon to be reused data. To not require tons of it to cover for the entire latency of the fixed-function pipeline, while the ALU:TEX ratio was increasing, the solution was to decouple the arithmetic and texture pipelines. That's a latency-oriented optimization, and it was a necessary step toward the much celebrated unification of vertex and pixel processing.

Wasn't you just explaining above that old GPUs could tolerate hundreds of cycles of latency (which isn't true)? And now they couldn't (which is the truth)? I'm confused.
And how the decoupling of ALUs and TEX units is a latency optimization needs a bit more explanation. That's hard to grasp how the decoupling (which necessitates more threads to fill the time until the fetch returns the data!) is not a throughput optimization.

Nick said:
The bandwidth wall is real.

And constantly moving.

Nick said:
And yes, the dark silicon issue is real too, but fighting it by moving data around is only making the bandwidth wall hit you in the face faster.

And again, you don't need to move any more data! You just move it from RAM or the LLC to a different unit. The product of moved bits*distance is constant.

Nick said:
NVIDIA has revealed to also be experimenting with an architecture that has a really tiny register file right next to the ALUs. The purpose of this seems very similar to the CPU's bypass network; to minimize the latency between dependent arithmetic instructions.

No. The purpose is to minimize the average distance the data has to move. It's mainly a power optimization. It could even be a slight latency hit (as you have to check, if you use something from the reg file cache and with a >10% chance one has to fetch it from the main register file). If you look at it from a mile high, one may compare it with the relatively small x86 architectural register file (with spilling to the stack/L1, but here it doesn't spill to the memory hierarchy). Btw., Radeon GPUs had a bypass network for ages already.

Nick said:
I mostly agree. Note though that Intel has only had 3 arithmetic ports for now (Haswell adds a fourth), and out-of-order scheduling and Hyper-Threading improve the occupancy. Trying to statically schedule 5 instructions from a single thread was clearly overkill.

It clearly wasn't, considering the traditional prevalence of 4 component vector operations in the graphics ecosystem and the tremendous compute density this enabled. It just wasn't the best fit anymore for more modern workloads.

Nick said:
VLIW4 improved things, but there are signs that AMD went too far in abandoning multi-issue with GCN.

What are those signs? I'm curious.
In fact, a GCN CU is able of multiple issue. One CU can issue up to 5 instructions per clock, just not 5 instructions for the same thread as with VLIW5 (which would require ILP). It can issue instructions for up to 5 threads in the same cycle (instructions going to different kind of execution units, one can't issue 5 vector ops of course). This doesn't require (expensive) dependency checking. It appears sensible to me.

Nick said:
Dual-issue appears to be the sweet spot for this generation.

I would really like to know where this gets apparent.

Nick said:
Tahiti does not have 8 MB register files. They may call them that, and they may be used that way from a software perspective, but they are not register files

If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

Nick said:
in same the sense as the ones you're trying to compare against in the CPU. They are SRAM banks.

And what you think CPU registers are made of? Fairy dust and unicorns? It's basically also just a very fast, small, multiported SRAM made of flipflops.
The difference is that GPUs can use basically standard SRAM macros for it (wouldn't be very practical otherwise), while CPU registers usually need some more extensive effort to get it to run at multiple GHz and zero cycle latency (which also burns quite a bit of power and needs more space).

Nick said:
And with that cleared up, the CPU does have copious amounts of SRAM already. Furthermore, the experimental NVIDIA architecture I mentioned earlier has an 'operand register file' of just 16 entries.

Most SRAM in CPUs are caches for the memory hierarchy, most SRAM in GPUs isn't.

Nick said:
It used to make a lot of sense to keep the GPU's memory separate, and yet for a while now the most frequently sold GPU is one that uses system memory.

Was it really the best solution or was it simply the one chosen because of practical constraints in the PC space? I think quite a few developers have stated that a unified memory for CPU and GPU makes a lot of sense. And it did so already in the past as evident by unified memory solutions in the console space (even when the chips where separate!), most prominent example being the XBox360.

Nick said:
And now you're telling me it makes sense to share the LLC...Can't you see the trend here?

I was never saying anything else as you can check in the old thread from 2011 linked above. Nothing changed. One can only integrate more transistors on a single die.

Nick · Mar 13, 2013

MfA said:
It's probably best to design for SIMD/MIMD hybrid ala Dally's proposal (with MIMD coming at increased power consumption) and then allow some kind of pipeline linking for VLIW similar to what AMD had at reduced performance (with the same staggered register access to allow you to keep the number of ports down).

Dally's proposal is still the closest to the ideal from anyone with some authority to appeal to.

Is this Bill Dally or William Dally you're talking about? Could you point me to a paper or presentation of his "proposal"? Thanks.

A processor designed to extract a little extra parallelism at a cost of adding a whole lot more ports to the register file are probably never going to be ideal for GPUs (which is not to say Intel's process advantage might still not be able to pull it out ahead of course).

Regardless of this exact proposal, keep in mind that a unified architecture doesn't have to be every bit as good as a GPU per mm², or even peak performance/Watt. It would be both a powerful CPU and a GPU, so you should compare it against a CPU+GPU as a whole. Also note that for certain workloads the CPU and GPU are tied, which indicates that a unified architecture could offer much better performance.

Gipsel · Mar 13, 2013

Nick said:
Is this Bill Dally or William Dally you're talking about?

It's the same guy, Bill (William J.) Dally working for nV and Professor at Stanford University.

edit:
A paper describing a possible future chip used to be hosted here, but it is still down as consequence of nV's struggle with some security issues. I'm not exactly sure, but it could be this paper, which someone mirrored here. A presentation of this architecture (with some differences in details) can be found in this video (starting at about 30 minutes into the talk).

Nick · Mar 13, 2013

MfA said:
You can tie those cores to SIMD/MIMD arrays and then leave the arrays doing nothing when you're running single threaded code on them ... then you can call them unified too

Gipsel said:
Then Nick would be stretching his definition of "unified" to the point where this "unification" defeats its supposed purpose.

As long as it provides a homogeneous programming model, with no 'surprises', and it can fulfill the tasks of a legacy CPU and legacy GPU, I'm open to whatever that architecture looks like.

I'm thinking outside the box, so I don't want to have any preconceptions of what a unified architecture should end up looking like exactly. The convergence is happening one step at a time, and we should not narrow our views by extrapolating one successful step. When CPUs became dual-core, that was an important inflection point for CPU-based rendering. My knee-jerk reaction was to just keep adding more cores. But I quickly learned that SIMD is a much cheaper way to extract data parallelism than multi-core, and that GPUs don't have many cores at all, they have many SIMD lanes. AVX and now AVX2 clearly shows that Intel came to the same conclusion (although TSX should open the path to more cores when the time is right). AMD went overboard with multi-core and neglected SIMD and the still important IPC and they're paying the price for that shortsightedness.

So currently I'm suggesting to further widen the SIMD units and perhaps add long-running instructions to help hide latency. But I'm not hell-bent on this being the only path toward unified computing. There are several other qualities the GPU has that would have to be merged into the CPU. Also, the GPU architecture is still evolving (converging) too so it's important not to add anything that would become a burden in the future. Rasterization, anti-aliasing, blend operations, and even texturing are showing signs of becoming programmable. So in this thread I was hoping to learn about specialized instructions that are still generic enough to accelerate various applications. PDEP/PEXT and fixed-point to float conversion seem worth considering.

Nick · Mar 13, 2013

tekyfo said:
Data will be the biggest problem. Even on todays cpus most parallel problems are bandwidth bound.

Indeed the bandwidth wall will be the biggest problem. Soon we'll have tens of billions of transistors, but very little increase in bandwidth. Instead of using these transistors to build a classic GPU with more computing power than can be fed with off-chip data, architects have to urgently look at how to get better reuse of on-chip data. This requires reducing the thread count and thus reducing latencies, which points in the direction of unification. Unification would also have additional benefits due to eliminated overheads and a much simpler programming model.

Add to that the fact that your effective cache size is only a quarter.

What do you mean? On a per FLOP basis, the CPU's cache hierarchy is much bigger and faster than that of the GPU.

Exophase · Mar 13, 2013

Staying out of the oft-repeated latency vs throughput optimized core argument and more in response to the original question..

Haven't put a lot of thought into this, just some vague ideas:

- Linear interpolation instruction
- Anything to make perspective correct barycentric coordinate calculation faster/easier, including a single instruction that does just that (but can probably be generalized down into something simpler)
- More variable precision control for long latency operations like division
- Mixed precision integer multiplication
- Mixed precision select instructions - like for instance, you may only need one byte or even one bit to know if you want to select a field (blend instruction in x86 parlance) that could be several bytes
- To complement the above, compare/test instructions with bitwise outputs; the useful thing here is you can perform a ton of logical operations in parallel with these, if you accumulate a bunch in a vector in a later pass.. of course you can use a parallel pext to compress these as well
- Some sort of vector sort instruction

What I'm thinking for the last one is if you're doing say, a TBDR in a software rasterizer. You calculate triangle rasterization (visible or not) and depth values over a bounding box for a triangle and accumulate a tile's worth of depth values and triangle IDs. Now I might not be thinking this through properly since I haven't done a software TBDR, but it seems that at this point the conventional step would be to use table lookups to get triangle setup parameters for each pixel. It might make sense to be able to sort by triangle ID and render the same ID in-order. This could also be used to efficiently handle multiple draw calls in the same tile.

Davros · Mar 13, 2013

Nick said:
Unification of the CPU and GPU may not seem very obvious to you right now because
.

No it does but on the other hand 2 separate chips will allways be faster (try a but a geforce 680 and an intel sandybridge on the same die it cant be done)

Nick said:
But the CPU and GPU are already trading blows when it comes to OpenCL performance (which ironically was targeted specifically at the GPU).
.

No they arnt, (clbench)

Nick · Mar 13, 2013

Ethatron said:
The CPU's upgrade-cycle so far has been almost one decade per getting more "SIMD-vectors" (registers). That's not competitive with the GPU's upgrade-cycle, which is between half and one year per getting more "SIMD-vectors" (...Us).

I'm afraid your definition of "registers" might still a bit off, so please allow me to elaborate on what it means to me so that we don't get lost in terminology. There are SIMD units, which have a certain width and operate on vectors of elements, and there are registers for storing and retrieving temporary results. From the (assembly) software point of view the registers don't have to be the same width as the SIMD units. If they are wider, the SIMD unit needs multiple clock cycles to process them (in a pipelined manner).

The Pentium 4 had two 64-bit SIMD units, and 8 or 16 128-bit logical registers, depending on x86-32 or x86-64 code. So it needed two cycles per 128-bit operation. Physically it had 128 128-bit registers (as part of the ROB), for renaming as part of out-of-order execution. Core 2 widened the SIMD units to 128-bit. Sandy Bridge widened the floating-point parts of them to 256-bit, and switched to using a physical register file with 144 256-bit registers. Haswell widens the integer operations to 256-bit as well and both SIMD units support FMA. The PRF increased to 168 256-bit registers.

Note that registers do not affect the peak throughput. But you need enough of them to store temporary results and cover for latencies to avoid stalling. The CPU also has caches, in particular the very fast L1 cache, to spill and restore registers when there are temporary results that won't be needed for a short while. So this is memory acting almost like extra registers, if you need it. In modern applications the stack (which is the collection of all the temporary variables used by the stack of function calls) can easily grow to an order of a megabyte, per thread.

Today's GPUs can't deal with that. They have a high number of registers in total but they're partitioned between many threads because they hide high latencies by switching threads instead of doing out-of-order execution within threads. And have very poor support for a stack, if at all. It has improved though. Shader Model 1.3 only offered two temporary registers per thread. Shader Model 4.0 offers up to 4096 registers per thread. However, if you try to use that many the GPU will slow down to a crawl because it doesn't have that many registers for all the threads needed to keep covering latencies. The CPU will use the stack if you need that many temporaries, and not lose much performance.

So the GPU has been catching up with the CPU in terms of how many temporary values it can use per thread, but the CPU still has a substantial advantage when it comes to storing temporary variables. It's lacking in regards to the SIMD width, but not by a whole lot. As indicated above, the SIMD width has grown steadily in relatively few years time.

I'm not sure you really want to address 512 distinct registers in the opcode-coding once you get there.

I see no need for that whatsoever. Shaders or compute kernels can be treated by the CPU in exactly the same way as any other code, except that it would operate on SIMD vectors instead of scalar variables. Out-of-order execution and Hyper-Threading make it use 168+ physical registers, which is plenty for the relatively low average instruction latency a CPU deals with.

If there is a need to cover more latency (or to lower the switching activity in the scheduler), then I suggest the adoption of long-running instructions. That is, again making the logical registers wider than the physical SIMD width. That would double the average latency that can be covered (or more). In a way this offers more registers of the physical SIMD width, without actually requiring more encoding bits for logical registers. The VEX encoding used by AVX has unused bits which can be used to indicate the operation width.

The advantage of the GPU-setup is that "registers" are effectively address-coded via the CU-id instead of the opcode, you run the same code on (say) 32 of them, or 345 next year.

That's not a tangible advantage.

Besides, you can partition all GPU units into whatever number of work-groups at nice granularity, a CPU SIMD-vector is unpartitionable and unthreadable.

The GPU's SIMD width is also unpartitionable. Could you elaborate on what you mean by unthreadable?

Nick · Mar 13, 2013

liolio said:
I ask here again (as I opened a thread somehow related elsewhere) but isn't ISA the problem?

I mean isn't ISA what somehow hinder the development of proper throughput optimized architecture based on CPU type of units?
The other way around hich may make it clearer, isn't it a GPU advantage that the environment is not ISA "bound"?

I've been wondering if a GPU type of software environment would free the engineers of a lot of constrains, do what work now, when you can do better change the underlying hardware with no impact on software.

So to summup is the ISA the problem or the software environment too ISA bounded?

Neither. Mainstream CPUs have increased their floating-point throughput over 32-fold in just a decade, and that's while bringing a GPU on board so a unified architecture would be closer to 64-fold. This has been achieved with relatively straightforward ISA extension, and it can be increased even more with trivial ISA changes.

The assembly instructions for a graphics shader or compute kernel are not substantially different from a CPU's instructions. Any differences can and some already have been eliminated through ISA extensions. It's true that the GPU uses a different instruction format internally, but so does the CPU. Furthermore, NVIDIA's warp size has not changed in years, and possibly never will, to simplify application or driver development. Despite that, their hardware has changed substantially. Likewise the CPU can continue to evolve without any disruptive ISA changes. Note that Xeon Phi is a throughput-oriented architecture based on x86, and it seems to be doing well in the HPC market.

GPUs will have a terrible time ever supporting scalar sequential workloads though. Even though their instruction latencies have dropped tremendously so there is lots of convergence on the SIMD side of things, their ISA is not prepared for scalar sequential workloads and O.S. tasks. CPUs already master those and they're having an easy time increasing their throughput computing power with every generation.

tekyfo · Mar 13, 2013

te42kyfo said:
te42kyfo said:

Data will be the biggest problem. Even on todays cpus most parallel problems are bandwidth bound.

Click to expand...

What do you mean? On a per FLOP basis, the CPU's cache hierarchy is much bigger and faster than that of the GPU.

Well, each one of your instructions works on 4 times the data. So, the number of vectors that fit into the caches is quartered.

OpenGL guy · Mar 13, 2013

Nick said:
I'm afraid your definition of "registers" might still a bit off, so please allow me to elaborate on what it means to me so that we don't get lost in terminology. There are SIMD units, which have a certain width and operate on vectors of elements, and there are registers for storing and retrieving temporary results. From the (assembly) software point of view the registers don't have to be the same width as the SIMD units. If they are wider, the SIMD unit needs multiple clock cycles to process them (in a pipelined manner).

This is exactly how a GPU uses registers. A warp/wavefront takes multiple cycles to process an instruction for all registers in use by the instruction.

Nick said:
The Pentium 4 had two 64-bit SIMD units, and 8 or 16 128-bit logical registers, depending on x86-32 or x86-64 code. So it needed two cycles per 128-bit operation. Physically it had 128 128-bit registers (as part of the ROB), for renaming as part of out-of-order execution. Core 2 widened the SIMD units to 128-bit. Sandy Bridge widened the floating-point parts of them to 256-bit, and switched to using a physical register file with 144 256-bit registers. Haswell widens the integer operations to 256-bit as well and both SIMD units support FMA. The PRF increased to 168 256-bit registers.

CPUs use out-of-order execution and register renaming to hide latency. GPUs typically don't need to hide ALU latency and memory latency is handled by switching to other warps/wavefronts. It's not really that different.

Nick said:
Note that registers do not affect the peak throughput. But you need enough of them to store temporary results and cover for latencies to avoid stalling. The CPU also has caches, in particular the very fast L1 cache, to spill and restore registers when there are temporary results that won't be needed for a short while. So this is memory acting almost like extra registers, if you need it. In modern applications the stack (which is the collection of all the temporary variables used by the stack of function calls) can easily grow to an order of a megabyte, per thread.

CPUs have low latency L1 cache, but it's not necessarily faster than a GPU's L1 cache. From what I have seen on GTX 680, L1 cache for buffers is very low latency as well. Try the new memory latency test in SiSoft Sandra.

Nick said:
Today's GPUs can't deal with that. They have a high number of registers in total but they're partitioned between many threads because they hide high latencies by switching threads instead of doing out-of-order execution within threads

So if a CPU uses hyperthreading it's not using registers?! Odd.

Nick said:
And have very poor support for a stack, if at all. It has improved though. Shader Model 1.3 only offered two temporary registers per thread. Shader Model 4.0 offers up to 4096 registers per thread. However, if you try to use that many the GPU will slow down to a crawl because it doesn't have that many registers for all the threads needed to keep covering latencies. The CPU will use the stack if you need that many temporaries, and not lose much performance.

What does supporting a stack have to do with registers?! A stack pointer is a specialized register with specialized instructions to utilize it, nothing more.

Nick said:
So the GPU has been catching up with the CPU in terms of how many temporary values it can use per thread, but the CPU still has a substantial advantage when it comes to storing temporary variables. It's lacking in regards to the SIMD width, but not by a whole lot. As indicated above, the SIMD width has grown steadily in relatively few years time.

Where the CPU is lacking is that any cache miss is death, whereas a GPU can just switch to another warp/wavefront. In general, GPUs are much better at covering cache miss penalties than a CPU.

Another thing you are forgetting when comparing a CPU's GFLOPs vs. a GPU's GFLOPs is everything else that makes up a GPU. Features such as rasterization, texture addressing, texture filtering, ROPs, etc. that don't show up in the rated GFLOPs. So it's incredibly naive to compare CPU A to GPU B based on GFLOPs and say the CPU should be competitive in graphics because it wouldn't even be close. If it were so easy to just look at GFLOPs, then Larrabee would have been a lot more compelling in graphics. Let's also not forget the power savings these fixed function bits offer.

Nick · Mar 13, 2013

Gipsel said:
As I said, you proved nothing and just claimed he is prejudiced. That may be your opinion, but you offered no proof.

Sure I did. He said "huge out of order speculative cores will always be utterly destroyed in terms of performance/watt by a GPU like throughput oriented architecture". I then mentioned a 400 GFLOPS CPU and an 800 GFLOPS GPU with the same Wattage. Unless you really think these are not representatives of what he calls out-of-order speculative cores and a GPU-like throughput-oriented architecture, I just don't see any "utterly destroying" going on. Nor do I see any utterly destroying going on here or here, and that's before AVX2.

And then there's still the "always" part of his claim. That is quite bold since it only takes one example of not "utterly destroying" to disprove it, now or in the future. I already gave several examples now, but even then there are relatively straightforward ways to increase the CPU's throughput even more, and long-running instructions can lower the power consumption by reducing switching activity in the front-end and scheduler. And then there's the memory wall and the GPU's bad access locality while wires are starting to dominate power consumption.

So given the falseness and boldness of his claim, he must have been prejudiced. I'm really not accusing him of ill will. It's probably just innocent ignorance and shortsightedness.

ninelven · Mar 13, 2013

How exactly are you expecting to overcome the bandwidth deficit without increasing the perf/watt gap?

Gipsel · Mar 14, 2013

Nick said:
Sure I did. He said "huge out of order speculative cores will always be utterly destroyed in terms of performance/watt by a GPU like throughput oriented architecture". I then mentioned a 400 GFLOPS CPU and an 800 GFLOPS GPU with the same Wattage.

And Alexko mentioned another GPU with a lower wattage (50W) having a lot higher theoretical Peak (1792 GFlop/s). And that's something you can buy for almost a year already versus something which still isn't on the market. I could also say sometimes next year GPUs will achieve 3 TFlop/s in that 50/65W envelope. As others said already, you cherrypicked your numbers. That's possible in both directions.
And don't forget what OpenGL guy just mentioned in his last paragraph. On tasks fitting the architecture of GPUs, CPUs get utterly destroyed (and that will stay that way for the foreseeable future). Ignoring that, doesn't change the fact. It only shows your prejudice or agenda if you want.

The ISA for a UPU

Gipsel

tekyfo

Nick

Ethatron

liolio

Aquoiboniste

Nick

Gipsel

Nick

Gipsel

Nick

Nick

Exophase

Davros

Nick

Nick

tekyfo

OpenGL guy

Nick

ninelven

PM

Gipsel

Similar threads