The ISA for a UPU

Nick · Mar 14, 2013

Gipsel said:
The argument was specifically for the coprocessor model of the x87 FPU. So if anything, I addressed the exact strawman you introduced to the discussion.

Like I said before, I don't see any fundamental problem with something like Flex FP. Keep calling it a coprocessor for all I care, but the x87 coprocessor most definitely got unified into x86. So you're not going to win this argument by changing the meaning of unified. If there comes an architecture that can extract high ILP and high DLP using a homogeneous ISA, and is fully viable as a CPU and low-end GPU, I will consider my theory proven.

And if you look back through the (hijacked?) Larrabee thread, you will find we (including others, especially 3dilletante for instance sketching the coprocessor model for CUs in the post following the linked one) talked about some kind of a coprocessor model for throughput tasks already quite some time ago. But you want a "true" unification, not some throughput CU(s) sitting next to the latency optimized cores to which the latency optimized cores can shift the execution (or call [asynchronous] subroutines there as I described it in this thread as a variation of it). So we went through all the arguments already almost two years ago. No need to repeat that.

Let's go back in time again to when the GPU's vertex and pixel processing were separate and fixed-function. Anyone suggesting unification would have heard a lot of arguments why it's not a good idea. But when things became programmable, and then floating-point, it had progressed a lot closer to being feasible and desirable. And still even right before the GeForce 8800 was launched, NVIDIA's David Kirk was saying the cost of unification is huge. Today, not having unified shader would be unimaginable.

The morale here is that any person who brought up unification in the early days of the GPU should not be told to shush because they've been over all the arguments already. Clearly they had not.

Neither do I think we've been over all the arguments for CPU-GPU unification yet. Year after year more actual convergence can be observed, and new techniques are identified to continue convergence, as well as hard laws of physics that either enable it or force us into it. But if you don't wish to be part of this exciting new conversation, nobody's forcing you to stay.

So what you are basically saying is that the evolution of games/GPU or 3D tech in general made the vertex load to become a throughput oriented task. So what does this change on my claim that there will be always latency sensitive tasks and throughput tasks?

So-called latency-sensitive tasks still consist of a lot of loops and independent sections. If you vectorize and thread this code, it becomes throughput-oriented. Maybe that's only for parts of it, and it may switch characteristics fast, but you have to do a good job at it because extracting more DLP and TLP is cheaper than more ILP so it's critical in today's power constrained designs. This is why we're getting AVX2 and TSX.

Likewise your throughput-oriented tasks likely contain a lot of code with naturally short latency arithmetic instructions. During those sections you shouldn't be switching from one thread to another, because that causes poor locality of reference which, again, isn't ideal in a power constrained design. Storing large amount of temporary data in distant locations takes a lot of power. So there's a lot of latency sensitivity in what you're calling throughput-oriented tasks. It is why NVIDIA is experimenting with an architecture that has an ORF that is similar in function to the CPU's bypass network, to lower the average instruction latency and keep the last few results in close proximity of the ALUs for reuse in the next few cycles.

And just look at what both of the movements away from pure latency-oriented and throughput-oriented have already accomplished. A lot of today's applications would be unthinkable without SIMD units in the CPU, or unbearable without multi-core. Likewise the dramatic reduction in average instruction latency on the GPU has made it highly programmable and unified. Developers have welcomed each convergence step and have created applications that use the new abilities. But the developers themselves have also pushed for more convergence. John Carmack was a big proponent of floating-point pixel shaders, and GPU manufacturers also had a hard time keeping up with the increase in ALU:TEX ratio of the shaders that developers wrote, which lead to the decoupling of the texture and arithmetic pipelines to achieve lower latencies. Both things were key parts in the unification of vertex and pixel processing. Which then sparked the whole GPGPU idea, which required extending the capabilities even more.

The important thing to observe here is that latency-sensitivity and throughput-orientedness is in fact orthogonal. So it doesn't make a lot of sense to have a latency-sensitive core and a throughput-oriented core. You wouldn't get good results for workloads that are both throughput-oriented and latency-sensitive. What's more, you also wouldn't get good results for workloads that frequently switch from being more latency-sensitive and less throughput-oriented, to less latency-sensitive and more throughput-oriented, because there's considerable overhead in the switching. A unified architecture would no doubt spark the development of a lot of new applications that make it unthinkable to go back.

That's not the point. But the parts of an application posessing a vast amount of DLP (that's the easiest one) and potentially TLP can be characterized as very parallel parts or throughput oriented parts. The parts exhibiting relatively small amounts of DLP and TLP can be characterized as serial or latency sensitive tasks. That's not a distinction I just came up with, you find it in one or the other form already for decades. And I think there is a good reason for that and also that different hardware approaches were developed for tackling these two different classes of tasks.

But which came first, the chicken or the egg?

Do we have software that has tasks targeted at the CPU and tasks targeted at the GPU, because we lack a unified architecture to do it any other way, or do we have a separate CPU and GPU because all the software you can come up with has large sections of code which are purely latency-sensitive or throughput-oriented. Based solely on a snapshot of the situation as it stands today, with no further context, either explanation is as good as the other.

We have to look at history and future trends to see the answer. Or better yet: answers. The GPU was originally created to process one and only one task much faster: old school rasterization graphics. It started off ridiculously simple with one texture unit and one raster unit. But that use case was important enough for gamers to want to buy a graphics card. So at that moment we had a separate CPU and GPU because all of the valued applications at that time had workloads that could be clearly divided into graphics, and everything else. But there was en enormous desire to go back to the flexibility offered by the CPU, and over the years the GPU became highly programmable and latency sensitive. And at the same time, the CPU became a lot more parallel for various other workloads. So now the division isn't so clear-cut any more. You can run OpenCL kernels on either the CPU or the GPU, and for various reasons one or the other could be faster.

So once again I come to no other conclusion than a strong convergence. We are stuck for now with these heterogeneous systems for historic reasons that are fleeting. I'm not saying graphics isn't important any more. I'm saying graphics evolved from single-texturing into something that is both throughput-oriented and latency-sensitive.

That is not exactly limited by the instruction latency. It is usually limited by the task scheduling latencies which are orders of magnitudes larger than instruction latencies and often heavily influenced by driver overhead. That overhead is what AMD and nV want to get rid of (and can do so already in some cases in the latest GPUs).

By instruction latency I meant the latency between the start of one instruction in a thread, and the ability to execute the next dependent one. So aside from execution latency this includes any latency added by scheduling, operand collection, and writeback.

Anyway, if AMD and NVIDIA want to get rid of some of that latency, then that clearly converges the GPU closer to the CPU architecture again.

What hundred cycles latencies are you talking about regarding old GPUs? I have something at the back of my head telling me that older GPUs tended to stall if something wasn't in the L1 texture cache (was still the case with the GF7/PS3 GPU)?

Before the shader cores were decoupled from the texture units (and unified), texturing and arithmetic operations were part of one long pipeline with over 200 stages. This covers the entire RAM latency. The texture cache is only there to lower bandwidth in case of a hit. It would only stall when running out of bandwidth.

Today, they can easily tolerate a few cache misses and hundreds of cycles latency to load it from DRAM without stalling. So something appears to be wrong here. GPUs definitely got more throughput oriented and more latency tolerant to increase their performance.

No, they didn't become more latency tolerant. The GeForce 8800 was optimized for two or three registers per thread. If you needed more than that, a high number of texture fetches would cause it to stall due to not having sufficient threads to cover the latency (assuming a sufficient texture cache hit rate to not stall due to bandwidth).

Davros · Mar 14, 2013

While your here Nick SwiftShader support for earlier directx versions please

Nick · Mar 14, 2013

Gipsel said:
Nick said:

Once you're out of data parallelism, everything becomes latency sensitive. That includes graphics.

Click to expand...

But graphics is still quite a bit away from that point.

No, it's there. It has been at that point for a while even. You see, there is a trivial way to infinitely increase the data parallelism of graphics: process more frames simultaneously. However, this is highly undesirable for real-time graphics since it is observed as lag. So GPU manufacturers have to keep a tight balance between extracting more data parallelism and keeping the lag acceptable. Given that they are already in that situation, i.e. there is no more data parallelism within frames and they can't increase the lag to more frames than the few they already have, this is irrefutable proof that (real-time) graphics is out of data parallelism and therefore has become a latency sensitive workload.

So please allow me to refine my statement: Once you're out of data parallelism, for a given amount of time, everything becomes latency sensitive. It becomes embarrassingly obvious when you read it like: Once you're out of data parallelism, for a given macro-latency, everything becomes micro-latency sensitive.

Arun · Mar 14, 2013

Nick said:
No, it's there. It has been at that point for a while even. You see, there is a trivial way to infinitely increase the data parallelism of graphics: process more frames simultaneously. However, this is highly undesirable for real-time graphics since it is observed as lag. So GPU manufacturers have to keep a tight balance between extracting more data parallelism and keeping the lag acceptable. Given that they are already in that situation, i.e. there is no more data parallelism within frames and they can't increase the lag to more frames than the few they already have, this is irrefutable proof that (real-time) graphics is out of data parallelism and therefore has become a latency sensitive workload.

What in the world are you talking about? The only thing thing that processes multiple frames at a time is AFR and that's for reasons completely different than lack of available parallelism.

GK110 has 15 SMX with a maximum of 2K threads/SMX for a total of 30K threads. That's ~3% of a 1024x1024 rendertarget making the overhead rather negligible - and that's before considering that a single shadowmap cascade is moving to 2Kx2K and the main framebuffer is moving to 1440p (4MPixel) and 4K (8MPixel). Oh, and GK110 can work at full efficiency with less than 1/4th as many threads if there's sufficient ILP to extract (quite likely in typical long programs). So you're literally only processing ~0.5% of a rendertarget at a time...

One case where this can be a real bottleneck is when working on very small rendertargets (e.g. mipmap generation for average luminance in tonemapping) but this can very often be fixed with DirectCompute already today (e.g. parallel reduce for tonemapping). And some GPUs today are already capable of having multiple independent GPGPU kernels running at the same time so there's no switching overhead (in theory this could be extended to graphics but hardly seems required for now). As Gipsel said, the general overhead for starting tasks on modern GPUs is a bigger problem than available data parallelism, and that is being actively optimised by GPU IHVs.

So please allow me to refine my statement: Once you're out of data parallelism, for a given amount of time, everything becomes latency sensitive. It becomes embarrassingly obvious when you read it like: Once you're out of data parallelism, for a given macro-latency, everything becomes micro-latency sensitive.

I agree completely. And it's even more completely irrelevant.

Gipsel · Mar 14, 2013

Nick said:
Let's go back in time again to when the GPU's vertex and pixel processing were separate and fixed-function. Anyone suggesting unification would have heard a lot of arguments why it's not a good idea. But when things became programmable, and then floating-point, it had progressed a lot closer to being feasible and desirable. And still even right before the GeForce 8800 was launched, NVIDIA's David Kirk was saying the cost of unification is huge. Today, not having unified shader would be unimaginable.

That statement was already heavily contested back then.

Nick said:
John Carmack was a big proponent of floating-point pixel shaders, and GPU manufacturers also had a hard time keeping up with the increase in ALU:TEX ratio of the shaders that developers wrote, which lead to the decoupling of the texture and arithmetic pipelines to achieve lower latencies.

I think you mix something up here. Decoupling the TMUs from the ALUs makes it more flexible and also enables better hiding the latency of a memory accesses behind the arithmetic instructions of other wavefronts/warps. It doesn't lower access latencies, but it usually increases throughput.

Nick said:
The important thing to observe here is that latency-sensitivity and throughput-orientedness is in fact orthogonal. So it doesn't make a lot of sense to have a latency-sensitive core and a throughput-oriented core. You wouldn't get good results for workloads that are both throughput-oriented and latency-sensitive.

I'm not sure I get this. Just a simple example.
Latency oriented: compute a certain algorithm on a single or just on a few data elements as fast as possible
Throughput oriented: compute a certain algorithm on lots and lots (for instance a few million) data elements as fast as possible
In the second case, you usually don't care about the latency of a single instruction or memory access at all, even ILP is irrelevant with the right architecture (without compromising performance!), in the first one you care about those things pretty much.

Having two types of cores enables you to offload throughput oriented task to the throughput cores, while the latency optimized cores would take care of the tasks they excel at. For small problems, the efficiency of a TOC drops off quite a bit, so it is sensible to use the LOCs instead. That's the theory of heteregenuous computing that we all know.

It is of course possible to modify the LOCs to become more throughput oriented as you propose. Wider SIMD units are one of the tools to do so. Looking at the efficiency (some measure of performance/Watt, performance/mm², or something like that), it goes up for larger problems compared to the same core with narrower SIMD units. In principle, one can define some kind of a critical problem size (where the efficiency rise levels off) as a property of the design. Larger problems won't increase the efficiency anymore for the given design.
But doing so, will eventually lower the efficiency for very small (i.e. serial/scalar) tasks. For smaller SIMD units, this is easily tolerable as it is a relatively small effect and the advantages outweigh it by far. But putting excessive size SIMD units and/or other throughput optimization techniques into the design, will eventually start to seriously compromise the serial performance. Unfortunately that still is and will be a very important measure for general workloads. So you open up an "efficiency gap" between a "true" LOC with just medium sized SIMD units and an LOC/TOC hybrid. The same gap also exists for pure TOC. But combining a LOC and a TOC enables to switch the execution from one type of core to the other one at the intersection point.

Just as a quickly sketched illustration (nothing is to scale):

The task for the future is to find efficient mechanisms to do so and to move up the efficiency for intermediate sized problems. This will be probably tackled from both sides. TOCs may evolve to show a steeper scaling of the efficiency with the problem size and LOCs will moderately increase their efficiency for these tasks too. This overlap region is the blurry line I mentioned in my prior post.

Nick said:
What's more, you also wouldn't get good results for workloads that frequently switch from being more latency-sensitive and less throughput-oriented, to less latency-sensitive and more throughput-oriented, because there's considerable overhead in the switching.

There is. But will it stay that way? The solution proposed in the old thread and reiterated here would need just something in the range of tens of clock cycles to hand over the task. That amortizes quite fast. Already now we see that modern GPUs successfully reduce the scheduling latency (especially for cases where it doesn't need to involve the driver anymore, i.e. when the GPU creates tasks for itself).

Nick said:
But which came first, the chicken or the egg?

Some problems naturally contain a certain amount of parallelism. So this was observed first of course. That one tried to tune algorithms to expose more parallelism to better fit the architecture came later.

Nick said:
I'm saying graphics evolved from single-texturing into something that is both throughput-oriented and latency-sensitive.

Curently, it is mostly latency sensitive when the software strangle the system with a huge amount of draw calls for instance. But then we are at the task scheduling latency again, which is a different problem.

Nick said:
By instruction latency I meant the latency between the start of one instruction in a thread, and the ability to execute the next dependent one. So aside from execution latency this includes any latency added by scheduling, operand collection, and writeback.

And as I said, this is largely irrelevant for complex workloads with limited parallelism. Task scheduling latency is the killer for that right now. And the GPU vendors actively work on that as I mentioned already a few times.

Nick said:
Anyway, if AMD and NVIDIA want to get rid of some of that latency, then that clearly converges the GPU closer to the CPU architecture again.

No, it enables you to start throughput tasks with a lower latency.

Nick said:
Before the shader cores were decoupled from the texture units (and unified), texturing and arithmetic operations were part of one long pipeline with over 200 stages. This covers the entire RAM latency. The texture cache is only there to lower bandwidth in case of a hit. It would only stall when running out of bandwidth.

I've heard other things about the GF7 architecture. Stalling because of a texture cache miss was a serious problem, as it couldn't be efficiently filled with other fragments to process. The pipeline simply always assumed a fixed latency for a texture operation (i.e. an L1 hit) and tried to keep the ALUs busy with a fixed number of pixels (roughly comparable to processing of just a single but larger warp). Any longer latency would stall the pipeline. Decoupled architectures get around exactly this problem, they can hide the full memory access latency by dynamically scheduling more threads (warps/wavefronts) to fill up the ALU slots (if there are enough threads).

Nick said:
No, they didn't become more latency tolerant. The GeForce 8800 was optimized for two or three registers per thread.

That's way too low. You can't do anything with just two regs. Heck, a lot of instructions use already 3 input registers.

And you would hit the limit of concurrently running warps per SM much earlier than your register file gets full. I would think the allocation granularity in the reg file is at least 4 registers, if not 8 (would have to look it up).

Nick said:
If you needed more than that, a high number of texture fetches would cause it to stall due to not having sufficient threads to cover the latency (assuming a sufficient texture cache hit rate to not stall due to bandwidth).

A high number of texture fetches will always stall the execution, no matter how many threads you run as you can't serve more fetches as you have TMUs in that SM (the number is usually less then the number of elements in a single warp).

It's the same as a continuous stream of moves to/from memory would be limited by the AGUs and/or L/S units on a CPU, no matter what the size of your SIMD engines is or if the CPU can do SMT or not.

The number of threads one can run on one SM (which is indeed often determined by the number of used regs, local/shared memory is another constraint) together with the instruction (mainly the ALU:TEX ratio) mix defines the amount of latency one can hide for an individual access. But you can use more than just two or three regs, believe me.

Arun · Mar 14, 2013

Nick said:
No, they didn't become more latency tolerant. The GeForce 8800 was optimized for two or three registers per thread.

Ah, looks like that good ole Beyond3D article is wrong. I think Rys probably meant 2 or 3 *Vec4* register equivalent, as G80 had 8192 registers for 768 threads, or 10 scalar FP32 registers/thread at full occupancy (2.5xVec4). However further analysis in some very good papers indicated it was usually better to use more registers and let the scheduler extract ILP, so realistically you should have 32 registers/thread with 256 threads.

I've heard other things about the GF7 architecture. Stalling because of a texture cache miss was a serious problem, as it couldn't be efficiently filled with other fragments to process. The pipeline simply always assumed a fixed latency for a texture operation (i.e. an L1 hit) and tried to keep the ALUs busy with a fixed number of pixels (roughly comparable to processing of just a single but larger warp). Any longer latency would stall the pipeline. Decoupled architectures get around exactly this problem, they can hide the full memory access latency by dynamically scheduling more threads (warps/wavefronts) to fill up the ALU slots (if there are enough threads).

That seems extremely unlikely. The GF7 scheduler worked on 1024 pixels at once (see: branch divergence) in quads, i.e. 256 pipeline slots could theoretically be filled by the available parallelism.

And since they nearly certainly had only a single instruction pointer, that would mean the full 1024 pixels would have to stall if a single pixel didn't hit the L1 cache! The goal of a GPU's L1 texture cache is to extract spatial locality, not temporal locality, so the hit rate is certainly <80% and this would cause the entire pipeline to stall nearly all of the time.

It seems much more likely that they optimised for normal memory latency and can hide that perfectly. But the problem is memory latency is far from being a constant and it varies depending on contention from other subsystems, so in bandwidth-heavy scenarios (but not necessarily bandwidth limited at all!), you will stall a significant amount of time because of latency spikes. One advantage of a decoupled architecture is that spikes can be mostly hidden as long as the average is low enough.

EDIT: And Davros wins the internet for correctly answering the eternal Chicken or Egg question below. Have a cookie!

Davros · Mar 14, 2013

Nick said:
But which came first, the chicken or the egg?

Actually it was the Egg.

Nick said:
You see, there is a trivial way to infinitely increase the data parallelism of graphics: process more frames simultaneously.

You cant because you dont know what future frames are going to be

Gipsel · Mar 14, 2013

Arun said:
The GF7 scheduler worked on 1024 pixels at once (see: branch divergence) in quads, i.e. 256 pipeline slots could theoretically be filled by the available parallelism.

And since they nearly certainly had only a single instruction pointer, that would mean the full 1024 pixels would have to stall if a single pixel didn't hit the L1 cache! The goal of a GPU's L1 texture cache is to extract spatial locality, not temporal locality, so the hit rate is certainly <80% and this would cause the entire pipeline to stall nearly all of the time.

It seems much more likely that they optimised for normal memory latency and can hide that perfectly. But the problem is memory latency is far from being a constant and it varies depending on contention from other subsystems, so in bandwidth-heavy scenarios (but not necessarily bandwidth limited at all!), you will stall a significant amount of time because of latency spikes. One advantage of a decoupled architecture is that spikes can be mostly hidden as long as the average is low enough.

That explanation sounds reasonable. But besides what's the exactly chosen latency that can be covered, it is still a static one determined by the design. My last sentence in that paragraph characterizing the difference to later GPUs therefore still holds: Decoupled architectures get around exactly this problem, they can hide the full memory access latency by dynamically scheduling more threads (warps/wavefronts) to fill up the ALU slots (if there are enough threads).
As you said, it's an optimization not to stall before one hits the bandwidth limitation.

Nick · Mar 14, 2013

Gipsel said:
It only becomes harder if you assume that the serial tasks die out eventually. I give you that there is no hard line of distinction and that this blurry line even moves over time. But as the workloads not only get more complex but also simply larger (which is often the easier route), this line will be there for the foreseeable future and I would predict it will still be there for quite some time in the unforeseeable future.

You're absolutely right that there's ever more data parallelism within each frame due to more complex workloads. However, those complex workloads also come with new dependencies. And dependencies means latency sensitivity. It is why they've been walking the line between throughput and lag for some time now, and the only solution to keep scaling is to keep lowering the average instruction latency.

Nick said:
Nick said:

Sure, these so-called throughput oriented workloads can use thousands of ALUs instead of a few, but from the individual ALU's point of view it's all just a thread with similar characteristics to the one you'd feed a scalar core. Again, once you're out of data parallelism, which means your SIMD units can't be made wider without losing performance, everything becomes latency sensitive.

Click to expand...

You don't have to make the SIMDs wider over time. Where did you get this idea? The SIMD width does not determine if you have a latency or throughput oriented architecture. It may move your core a bit along that axis, but a lot of other factors come into play there too.

I didn't say you have to keep making the SIMD units wider over time. It was a theoretical argument, and it may or may not be a wise thing to do it in practice due to, as you mention too, a lot of other factors. My claim is that these other factors includes latency. But before I get back to that, I fully agree you don't necessarily need wide SIMD to extract data parallelism. You can extract the exact same data parallelism with fully independent scalar cores, but that would be hugely wasteful. So SIMD units are used to extract the pure DLP in an area and power efficient manner. So under the assumption that we don't want to be wasteful, and eliminating other factors to not widen SIMD width (branch and task granularity), does SIMD width have an effect on latency - that's the question at hand.

In other words, with very high DLP and no granularity issues to worry about, would you want to widen your SIMD units? The answer is yes, but with every step you'll expose more latency sensitivity in your workload. And again you only have to look at the history of GPU architectures for proof. There has been a massive increase in ALU count, which went both to wider SIMD and more cores, but instruction latencies had to go down because there are constraints on on-die storage area, bandwidth, wire power, and lag.

It may not exactly double, but it goes up and will continue to do so. And the internal bandwidth (cache, registers) usually doubles. It can bring you quite far with moderate cache size increases. Just look at the developement in the past.

I'm not in the least denying that bandwidth keeps increasing and caches take us quite far. But it's simply not enough to keep up with the faster increase in computing power. Using only increased pin count and frequency and SRAM cache is just going to completely dominate the chip area with structures for these things in only a couple generations. And I'm not even talking about the power implications of such as dumb 'keep doing what we're doing'-approach. A DRAM cache is hailed as the next necessary step, but again that's not a silver bullet. It just fills a gap in the memory hierarchy that has formed and comes at a significant cost. Being too aggressive with it is just going to raise the cost to the point where other techniques are cheaper. And in fact modest amounts of other techniques should be combined with it for the optimal cost-effective result.

GPUs have always combined direct increases in bandwidth, cache technology, and latency reduction to improve the cache effectiveness. There hasn't been any disruptive breakthrough technology that is cheap enough to be a silver bullet and allow ignoring any of the other techniques. A DRAM cache is just another cache level, and like the cache levels that were added before it, it won't eliminate the need for latency-oriented optimizations optimizations.

Nick said:
Nick said:

So the only way to feed your ALUs is to get more use out of your caches. To increase the hit ratio you need fewer thread, and this can only be achieved through latency-oriented techniques.

Click to expand...

That's actually not true in the general case. Assuming you have a lot of DLP and neighboring work items ("threads" in nV speak) are processing a lot of spatially related data (which is often the case), the hit rate is probably very close (may actually be better in an extreme case) to the case where you process them one ofter another. Spatial coherence can be used the same way as temporal coherence. GPUs traditionally rely to a great deal on the former, CPUs on the latter (doesn't exclude neither to use both).

Note my highlights.

That's a lot of assumptions. And they all have to be largely true for your conclusion to be true. For argument's sake let's say each of them is 80% true, and you have a cascade of three of them. That means your end result is 51% true. Phew, you made it. Indeed today's GPUs assume all these things to be mostly true and it is why we still have have heterogeneous systems. But it's hanging by a thread. There are already workloads where these assumptions do not hold and the GPU is put to shame by the CPU. Guess what's going to happen when CPU's gain GPU qualities with AVX2 and beyond.

See, I'm not going to claim the opposite, in that a unified architecture would be better than a heterogeneous at any workload. But we're witnessing that the assumptions under which the GPU is faster, are getting less certain. DLP is limited, spatial correlation between threads is limited, correlated threads may run on other cores and not have access to each other's caches, etc. And to the extent that the assumptions are still true, the CPU is benefiting from them with technology borrowed from GPUs.

This is why Kepler is worse at GPGPU than Fermi. NVIDIA had to specialize for the one area where the assumptions still hold the strongest. But as detailed above even graphics is suffering to keep the assumptions about its characteristics true enough while staying within the imposed limitations of lag, bandwidth, power, and cost.

silent_guy · Mar 14, 2013

Can you clarify the lag argument, because I don't get it: are you saying that a GPU, that is: the hardware, not the driver that runs on the PC, is working on multiple frames in parallel?

If so: I find that very unlikely.
If not: then how does lag enter the picture at all?

OpenGL guy · Mar 14, 2013

Nick said:
No, it's there. It has been at that point for a while even. You see, there is a trivial way to infinitely increase the data parallelism of graphics: process more frames simultaneously. However, this is highly undesirable for real-time graphics since it is observed as lag.

It's also undesirable as it breaks any effect that depends on the results of the previous frame(s). It also would wreak havoc with things like occlusion queries where previous frames contain the queries for later frames.

Gipsel · Mar 14, 2013

Nick said:
You're absolutely right that there's ever more data parallelism within each frame due to more complex workloads. However, those complex workloads also come with new dependencies. And dependencies means latency sensitivity. It is why they've been walking the line between throughput and lag for some time now, and the only solution to keep scaling is to keep lowering the average instruction latency.

As said already, forget the instruction latencies, they are hardly relevant. If they are relevant, it's not a throughput problem anymore and you are probably better off with a CPU core anyway.

Nick said:
That's a lot of assumptions. And they all have to be largely true for your conclusion to be true. For argument's sake let's say each of them is 80% true, and you have a cascade of three of them.

It's basically just a single condition for a througput task: high spatial locality, which is often the case, graphics being a prominent but far from the only example. The caches usually do just fine for that.

Nick said:
There are already workloads where these assumptions do not hold and the GPU is put to shame by the CPU. Guess what's going to happen when CPU's gain GPU qualities with AVX2 and beyond.

Now we are back again at square one: There are always going to be latency and throughput oriented tasks, where the line is blurry in the intermediate range and may move over time. Try to figure out, which classes your examples belongs to.

Btw., probably a significant part of that second example is done by fixed function hardware and not by programmable hardware at all (at least major parts of the decoding).

Nick said:
But we're witnessing that the assumptions under which the GPU is faster, are getting less certain. DLP is limited, spatial correlation between threads is limited, correlated threads may run on other cores and not have access to each other's caches, etc.

We see that the assumptions under which a multiple issue CPU with SIMD units is faster than a simple and slow single issue design without SIMD, are not certain. The ILP is limited, the amounts of accessed data can be far greater than any cache, we may have to snoop other caches, etc. Why not going back to an old 486 just with a wider external bus? The answer is simple: Because it would be slower on average!

Nick said:
This is why Kepler is worse at GPGPU than Fermi. NVIDIA had to specialize for the one area where the assumptions still hold the strongest. But as detailed above even graphics is suffering to keep the assumptions about its characteristics true enough while staying within the imposed limitations of lag, bandwidth, power, and cost.

It was a strategic decision to abandon the hotclock and expensive dynamic scheduling in favour of more raw power, which may (dependent on workload) be used at a lower rate, but still provides a lot better performance/Watt. That is valid also for compute tasks. That nVidia uses relatively small (compared to AMD) register files and local memory which reduces the performance in some scenarios has nothing to do and does not mean that nV "specialize for the one area where the assumptions still hold the strongest". They accepted that obviously as a compromise to increase the average performance/Watt more, than what would have been possible in another way.
You are definitely too fast to declare the death of improvements for GPGPU and even graphics rendering. Your are simply wrong with the latter and you completely miss the trend set on AMD's side with GCN (being a tremendous update for compute) or for instance the fundamantal move to make smaller tasks more efficient.

Nick · Mar 14, 2013

Gipsel said:
Nick said:

And yes, the dark silicon issue is real too, but fighting it by moving data around is only making the bandwidth wall hit you in the face faster.

Click to expand...

And again, you don't need to move any more data! You just move it from RAM or the LLC to a different unit. The product of moved bits*distance is constant.

Gipsel said:
Nick said:

NVIDIA has revealed to also be experimenting with an architecture that has a really tiny register file right next to the ALUs. The purpose of this seems very similar to the CPU's bypass network; to minimize the latency between dependent arithmetic instructions.

Click to expand...

No. The purpose is to minimize the average distance the data has to move. It's mainly a power optimization.

You just contradicted yourself. And no, you can't strictly partition the data between things that can and things that can't move a shorter distance. If that was the case, why use the LLC instead of RAM if it all stays put anyway? Clearly you think something is worth sharing, but why end at the LLC?

There's a sliding scale between data locality. Results that just rolled out of your ALUs can be either reused the next cycle, or a few cycles, or a few tens of cycles, or hundreds, thousands, millions, billions, or never. And yes, you could theoretically analyse some of these things ahead of time and bypass caches, but in my experience it's terribly easy to shoot yourself in the foot. The access pattern depends on what algorithms the application developers use, as well as on the user input. And even if you want to dynamically adapt to each situation, that takes a lot of latency-sensitive analysis.

It may sound terrible to erase your L1 cache with data that isn't going to be reused until the next frame, but that volume of data is negligible compared to the L1 bandwidth. It will be refilled with relevant data before you know it. It's a tiny and ever still smaller price to pay to allow developers to create whatever they like while being hardware-agnostic. It's not hard to find horror stories of GPGPU developers who found that on one vendor's hardware they had to use a completely different kind of temporary storage to get acceptable results. The average developer just won't go that far to rewrite their software multiple times.

The cost of software development is continuously rising and this is an often forgotten part in the cost of a platform as a whole. Your hardware can be insanely fast if used the right way, but if it's too high a cost to hire specialists, while only part of your customers will benefit from it, it's just not worth the ROI. AMD recognized the problem of requiring ninja programmers for GPGPU, but HSA does not eliminate the deeper issues and won't be able to compete with the broad and sustainable ROI from AVX2+.

Btw., Radeon GPUs had a bypass network for ages already.

Interesting. Could you point me to anything that describes it? Thanks.

It clearly wasn't, considering the traditional prevalence of 4 component vector operations in the graphics ecosystem and the tremendous compute density this enabled. It just wasn't the best fit anymore for more modern workloads.

That's what I said. VLIW5 and even VLIW4 became overkill due to changes in workload characteristics, so they switched to single-issue (which I don't think is the right move since modest multi-issue does have valuable benefits).

What are those signs? I'm curious.

I understand your question is genuine but I'm not going to derail this thread with a big AMD versus NVIDIA architecture discussion. Anything relevant to unified architectures has already been discussed and it would take too much weeding through other architectural differences to get to a conclusive agreement about single-issue versus dual-issue. This is a tiny difference anyway compared to the CPU's IPC versus the GPU's CPI which is far more relevant to the topic here. If you're adamant about it I'd be happy to share my opinions on GPU to GPU differences if you cared to create a new thread about it.

In fact, a GCN CU is able of multiple issue. One CU can issue up to 5 instructions per clock, just not 5 instructions for the same thread as with VLIW5 (which would require ILP). It can issue instructions for up to 5 threads in the same cycle (instructions going to different kind of execution units, one can't issue 5 vector ops of course). This doesn't require (expensive) dependency checking. It appears sensible to me.

I know these things but I have to admit I'm not entirely sure what to call this multi-issue-but-not-from-a-single-thread behavior. It's certainly sensible in the sense that GF100 was also sensible to be single-issue but note that GF104 added dual-issue and both GK104 and GK110 inherited it. VLIW5/4 long made sufficient sense too. But if AMD went from quad-issue to single-issue while NVIDIA went from single-issue to dual-issue... I'll let you finish that sentence, or I can do it in another thread.

And what you think CPU registers are made of? Fairy dust and unicorns? It's basically also just a very fast, small, multiported SRAM made of flipflops.

It's uncalled-for to ridicule it. There is overlap between register designs and latch designs and SRAM designs and DRAM design. But it generally ranges from lots of transistors to one, in that order. Each is worthy of its categorization despite the overlap.

What CPUs use for their register storage and what GPUs use for their register storage is radically different. And that's because it's just not at a comparable level of the storage hierarchy. So the large capacity difference should not be mistaken for an insurmountable advantage. CPUs have plenty of temporary storage of various kinds. In fact they were once criticized for spending a lot of area on caches versus GPUs. So it's quite an ironic argument. And as discussed before, the CPU has sufficient opportunity to adjust its register file size if need be.

The difference is that GPUs can use basically standard SRAM macros for it (wouldn't be very practical otherwise), while CPU registers usually need some more extensive effort to get it to run at multiple GHz and zero cycle latency (which also burns quite a bit of power and needs more space).
Most SRAM in CPUs are caches for the memory hierarchy, most SRAM in GPUs isn't.

Exactly. So we can't compare it as such. Their usage and purpose differs greatly but the results obtained are much more closely comparable than the differences appears to make some assume.

Nick · Mar 14, 2013

Gipsel said:
It's the same guy, Bill (William J.) Dally working for nV and Professor at Stanford University.

edit:
A paper describing a possible future chip used to be hosted here, but it is still down as consequence of nV's struggle with some security issues. I'm not exactly sure, but it could be this paper, which someone mirrored here. A presentation of this architecture (with some differences in details) can be found in this video (starting at about 30 minutes into the talk).

Ah, interesting. Well then he's the guy in the NVIDIA video I linked to earlier with the 16-entry operand register file (ORF) that clearly would bring their GPUs closer to a CPU architecture.

Exophase · Mar 14, 2013

You know Nick, I've read hundreds of posts from you on several forums over the years, and almost all of them are about you defending your vision of unified computing.. with the same arguments over and over again, often with the same people.

This thread is inextricably tied to that vision, but I thought it might be interesting to actually discuss what useful ISA additions would be for replacing fixed function GPU hardware, regardless of what anyone thinks the practical implications are. But if you'd rather continue your same arguments and ignore posts that are actually on topic just let me know, so I don't waste my time again.

Nick · Mar 14, 2013

Exophase said:
Staying out of the oft-repeated latency vs throughput optimized core argument and more in response to the original question..

Thanks!

- Linear interpolation instruction

As far as I'm aware all latest GPUs implement it using just FMA.

- Anything to make perspective correct barycentric coordinate calculation faster/easier, including a single instruction that does just that (but can probably be generalized down into something simpler)

It was my understanding that some but not all GPUs do all of that in the shader cores too with just FMA and DIV.

- More variable precision control for long latency operations like division

SSE/AVX has (v)rcpps for a rough reciprocal approximation, which can be refined using Newton's method. I wouldn't mind one that better handles corner cases though.

- Mixed precision integer multiplication

Care to elaborate on uses for this? There are various SIMD instructions for integer multiplication of a certain width, but not "mixed" width.

- Mixed precision select instructions - like for instance, you may only need one byte or even one bit to know if you want to select a field (blend instruction in x86 parlance) that could be several bytes

That should be easy. blend with a select register actually consists of a movmsk micro-instrution who's result gets passed down to the immediate argument blend instruction. This might even have been suggested on the Intel forum before, if my memory serves me...

- To complement the above, compare/test instructions with bitwise outputs; the useful thing here is you can perform a ton of logical operations in parallel with these, if you accumulate a bunch in a vector in a later pass.. of course you can use a parallel pext to compress these as well

Indeed seems useful not to waste AVX registers for masks.

- Some sort of vector sort instruction

What I'm thinking for the last one is if you're doing say, a TBDR in a software rasterizer. You calculate triangle rasterization (visible or not) and depth values over a bounding box for a triangle and accumulate a tile's worth of depth values and triangle IDs. Now I might not be thinking this through properly since I haven't done a software TBDR, but it seems that at this point the conventional step would be to use table lookups to get triangle setup parameters for each pixel. It might make sense to be able to sort by triangle ID and render the same ID in-order. This could also be used to efficiently handle multiple draw calls in the same tile.

Interesting, but I think various discussions about TBDR have concluded that it is not suitable for a tessellated scene.

Sorting is of great importance in a lot of software though. But I wonder if a wide horizontal vector operation like that can be implemented efficiently.

Exophase · Mar 14, 2013

Alright, thanks for getting to my post, you can disregard the last frustrated one

Nick said:
Thanks!

As far as I'm aware all latest GPUs implement it using just FMA.

I don't know about the ISAs on the big discretes (which I need to look over again sometime), but I do know that there are at least mobile GPUs that have instructions that perform MUL + ADD in more complex arrangements than just FMA. They also have simple input modifiers like 1 - X and small shifts/multiplications. I don't really know what of this makes sense to incorporate into functional units but it could help. SSE does have a lot of pretty specialized instructions as well, but not that focused on graphics.

Nick said:
It was my understanding that some but not all GPUs do all of that in the shader cores too with just FMA and DIV.

Based on what I've read, when people say dedicated interpolation units have been dropped they mean the application of barycentric coordinates, which is linear-interpolation, not the calculation of the barycentric coordinates themselves, which is non-linear. I don't know what the latency is like for divides in GPUs so I don't know if there's any cost to having a big dependency on one early on or not.. but for the sort of ISA in a unified system that's more CPU like I could see anything that gets around having to deal with the divide latency upfront as useful.

Nick said:
SSE/AVX has (v)rcpps for a rough reciprocal approximation, which can be refined using Newton's method. I wouldn't mind one that better handles corner cases though.

It's variable, but it requires multiple instructions and not great granularity. Anyway, it'd apply more for dedicated divider units that use subtraction based methods as the current ones do, still getting optimized alongside the options for multiplicative based methods.

Nick said:
Care to elaborate on uses for this? There are various SIMD instructions for integer multiplication of a certain width, but not "mixed" width.

These are pretty common in DSPs.. I don't know where they apply in modern high end GPUs but I know you don't have to dig back terribly far to find more odd multiplication width pairs in GPUs.. they might still lurk in fixed function stuff. But I don't have applications outside of things like emulating archaic platforms like Nintendo DS

Nick said:
That should be easy. blend with a select register actually consists of a movmsk micro-instrution who's result gets passed down to the immediate argument blend instruction. This might even have been suggested on the Intel forum before, if my memory serves me...

Good, they should do it then

Nick said:
Interesting, but I think various discussions about TBDR have concluded that it is not suitable for a tessellated scene.

Was LRB TBDR or was it just tiling? I think some form of tiling is a must for rendering on something with a CPU-like cache hierarchy/bandwidth. Why is it not suitable w/tessellation?

If you don't have TBDR similar arguments can be made for hierarchical/early Z, I think.

Nick said:
Sorting is of great importance in a lot of software though. But I wonder if a wide horizontal vector operation like that can be implemented efficiently.

I don't know how efficient hardware can do it, but I'm pretty confident that a sort network in hardware would be a lot better than the shuffles and masks you have to use now.

I also wonder if a merge instruction would be cheaper than a full sort instruction.

Intel had a paper on parallel merge sort w/SIMD.. it was interesting but the kernels looked really expensive.

Further thoughts on fixed function ISA:

What's most useful for acceleration compressed textures? How about table lookup/shuffles over quantities below 8-bits? Or using different index widths vs access widths? I use the 8-bit lookup instructions in NEON a fair amount, but often (for graphics things) what I really want to do is lookup 16-bit values with an 8-bit index, requiring two lookups and an interleave. Sometimes what I really want is to lookup 16-bit values using only 4-bit indexes. Parallel pext helps a lot with this, but being able to do it directly is even better.

Nick · Mar 14, 2013

Davros said:
No it does but on the other hand 2 separate chips will allways be faster (try a but a geforce 680 and an intel sandybridge on the same die it cant be done)

That's an interesting twist. I'm not suggesting to unify something like a GTX 680 at all, I'm suggesting to unify integrated graphics. So you agree the latter is likely to happen?

No they arnt, (clbench)

That would fall under the definition of "trading blows". I gave examples of CPUs beating (integrated and discrete) GPUs. You're giving examples of (discrete) GPUs beating CPUs. Fits my argument just fine, especially since integrated GPUs are far weaker.

Nick · Mar 14, 2013

tekyfo said:
Well, each one of your instructions works on 4 times the data. So, the number of vectors that fit into the caches is quartered.

Four times what data precisely? Where does this factor of 4 come from? Are you referring to SSE registers having four 32-bit elements? AVX-256 has eight.

Ignoring for now that pixels have to be processed in quads anyhow, yes, that does imply that the caches have to be shared by four or eight pixels (and then there's fibers/strands and Hyper-Threading). And indeed that's the exact argument that I'm using against the GPU's massive use of threading...

But that's what convergence is about. There's a sweet spot somewhere between today's CPU architectures and today's GPU architectures.

Davros · Mar 14, 2013

"Sweet spot" infers compromise, thats not what we want we want the most powerful cpu possible and the most powerful gpu possible

The ISA for a UPU

Nick

Davros

Nick

Arun

Unknown.

Gipsel

Arun

Unknown.

Davros

Gipsel

Nick

silent_guy

OpenGL guy

Gipsel

Nick

Nick

Exophase

Nick

Exophase

Nick

Nick

Davros

Similar threads