The ISA for a UPU

Nick · Mar 6, 2013

Hi all,

For some time now I've been contemplating what the instruction set of a unified architecture should look like. AVX2 is obviously a huge step in the right direction by adding gather support and FMA, but to adequately fulfill the roles of a low-end GPU it seems like a few more things will be needed...

In particular, to implement the fixed-function portions of the GPU we still need a relatively large number of legacy CPU instructions. I observed though that many of these instructions just move small bit fields in the right places. Therefore I think that SIMD versions of the bit-level gather/scatter instructions (pext/pdep) could make a significant difference. Note that unlike fixed-function hardware, they could be used to implement many custom algorithms as well. Programmable rasterization, anti-aliasing, and even texture filtering have lately become popular research topics, but they have yet to be implemented efficiently without sacrificing flexibility or costing lots of single-purpose die area. The pext/pdep instructions could help achieve high efficiency at a low cost, and their uses would even go beyond 3D.

So I'm looking for more examples of instructions that would both aid in implementing legacy functionality and enable developers to create new experiences.

Thanks for your suggestions,
Nick

ssp · Mar 6, 2013

Nick said:
Hi all,

For some time now I've been contemplating what the instruction set of a unified architecture should look like. AVX2 is obviously a huge step in the right direction by adding gather support and FMA, but to adequately fulfill the roles of a low-end GPU it seems like a few more things will be needed...

In particular, to implement the fixed-function portions of the GPU we still need a relatively large number of legacy CPU instructions. I observed though that many of these instructions just move small bit fields in the right places. Therefore I think that SIMD versions of the bit-level gather/scatter instructions (pext/pdep) could make a significant difference. Note that unlike fixed-function hardware, they could be used to implement many custom algorithms as well. Programmable rasterization, anti-aliasing, and even texture filtering have lately become popular research topics, but they have yet to be implemented efficiently without sacrificing flexibility or costing lots of single-purpose die area. The pext/pdep instructions could help achieve high efficiency at a low cost, and their uses would even go beyond 3D.

So I'm looking for more examples of instructions that would both aid in implementing legacy functionality and enable developers to create new experiences.

Thanks for your suggestions,
Nick

An instruction I wish were available is "extract-and-convert" that would take one of the A/R/G/B fields of a uint32, convert it to single precision floating point, divide by 255, and then store it in the destination. Similarly, convert-and-deposit would be useful for going the other way. As an example,

extcvt xmm0, xmm1, 0x02

would extract the four red channels from xmm1 and deposit the floating point equivalents in the four single precision channels of xmm0. And

cvtdep xmm1, xmm0, 0x02

would take the four floating point channels in xmm0 and deposit the 8-bit equivalents in the red channels of xmm1.

One straightforward generalization would be to support more formats through the immediate. In the example above, "0x02" means that floating point destination n corresponds to the bitfield n * 32 + 16, and that that bitfield is treated as a normalized 8-bit number. That is, the immediate corresponds to three values: the "shift", which is 16, the "pitch" which is 32, and the "type", which is "unsigned normalized 8bit". Other immediates could correspond to, for example "shift=5, pitch=16, type=un6", which would extract the 6-bit green channels from the four 565 pixels stored in the lower 64 bit of an xmm register.

It may also be interesting to have a second source that would be concatenated with the first one before doing the conversion. That would allow the channels of four 64-bit pixels with 16-bit channels to be converted into four single precision channels.

The benefit of these instructions is that a shader can be done with floating point scalar arithmetic without having to keep all the expanded inputs around in registers.

If I'm allowed to redesign the ISA, it could be interesting to make this extract-and-convert functionality available as part of all instructions. That is, for some typical 3-operand instruction:

add d, s1, s2

you would have an additional operand that describes the "shift/pitch/type" of each of the source, destination and format.

Novum · Mar 9, 2013

Well for that to consider, you first need to believe that this is actually what will happen, which I do not.

Huge out of order speculative cores will always be utterly destroyed in terms of performance/watt by a GPU like throughput oriented architecture. On the other hand, a throughput core will never run a branch heavy, cache trashing algorithm as well as a CPU core which has the capability to automatically reorder instructions to hide memory latencies and hazards.

Anyway, things like having scatter/gather on CPU will certainly benefit us in other ways, so I'm not against that.

What I really want though is a tight integration of both cores in one chip with low latency to get results back.

Nick · Mar 10, 2013

ssp said:
An instruction I wish were available is "extract-and-convert" that would take one of the A/R/G/B fields of a uint32, convert it to single precision floating point, divide by 255, and then store it in the destination. Similarly, convert-and-deposit would be useful for going the other way. As an example,

extcvt xmm0, xmm1, 0x02

would extract the four red channels from xmm1 and deposit the floating point equivalents in the four single precision channels of xmm0. And

cvtdep xmm1, xmm0, 0x02

would take the four floating point channels in xmm0 and deposit the 8-bit equivalents in the red channels of xmm1.

One straightforward generalization would be to support more formats through the immediate. In the example above, "0x02" means that floating point destination n corresponds to the bitfield n * 32 + 16, and that that bitfield is treated as a normalized 8-bit number. That is, the immediate corresponds to three values: the "shift", which is 16, the "pitch" which is 32, and the "type", which is "unsigned normalized 8bit". Other immediates could correspond to, for example "shift=5, pitch=16, type=un6", which would extract the 6-bit green channels from the four 565 pixels stored in the lower 64 bit of an xmm register.

Thanks for your suggestion, that would indeed be quite valuable! My only remark is that it seems a bit too dedicated to me. What if someone wants a 3454 format (not necessarily representing a color)? Can this not be supported efficiently without a hardware update? This is one of the problems I have with today's GPUs. They're really only optimized for what the API specifications say they should be capable of. A developer trying to do something new quickly runs into hardware limitations. The CPU doesn't strongly favor one application or algorithm over the other, and I'd like to see that quality being retained as things evolve toward unification.

With that in mind, I believe the 'extract' part of your 'extract-and-convert' operation would be offered by the pext/pdep instructions, in a more generic form. The 'convert' part, which includes a normalization, does seem quite common and generic so it's worth considering supporting that as a single instruction.

Considering the execution port and latency (including bypass latency) of a conversion instruction, it appears to be typically implemented by a floating-point adder. I imagine it would be easier to do the normalization if the conversion was done in a multiplier. I'll have to give that some thought, since the implementation should be very cheap to consider adding an instruction like this...

Nick · Mar 10, 2013

Novum said:
Well for that to consider, you first need to believe that this is actually what will happen, which I do not.

Huge out of order speculative cores will always be utterly destroyed in terms of performance/watt by a GPU like throughput oriented architecture.

Please allow me to stop you right there. This seems like a prejudice based on past experience. Since CPU-GPU unification has never properly been done before, I really need you to think outside the box about what could be, no about what has been or what currently is the case.

With that in mind, a Haswell quad-core CPU is expected to deliver about 400 GFLOPS at about 65 Watt (which might be fairly conservative). 65 Watt will get you a GPU such as the GeForce GTX 650, which delivers about 800 GFLOPS. Considering that Haswell really isn't the unified architecture I'm talking about, I don't consider it to be "utterly destroyed" by the GPU's throughput. Also note that the GPU really needs a CPU to run the driver, the application, the operating system, etc. It is completely helpless on its own. So you have to add some of the power consumption of the CPU to make a fair comparison. So it's not even a 2x advantage in theoretical compute throughput per Watt.

Now consider that AVX can be extended up to 1024-bit. I don't think I'm going out on a limb when I expect this and perhaps a few more tweaks could help nullify the advantage a CPU+GPU would have over a unified architecture.

Granted, this still leaves a fair bit of dedicated hardware to be implemented in software, but that's exactly why in this thread I'm looking for specialized instructions that could offer much of the same advantage as fully dedicated hardware. Ideally such instructions should still be versatile and enable new algorithms. Note that these new algorithms could run faster on a unified CPU than on a GPU that doesn't have fixed-function hardware for it!

Anyway, things like having scatter/gather on CPU will certainly benefit us in other ways, so I'm not against that.

Think about what you really said there. There will be workloads for which a CPU with AVX2 will be more suited than the GPU. Now, regardless of what they will really look like, imagine what else could be possible with AVX3, AVX4, etc. Eventually the CPU should become pretty darn good at graphics, and you won't be limited by APIs or drivers or heterogeneous behavior. At that point I don't think it will make sense for much longer to keep a GPU on the same die that is only slightly better at very specific tasks. It's better to make the whole chip available to developers by providing a unified architecture.

What I really want though is a tight integration of both cores in one chip with low latency to get results back.

Sure, but why would those cores have to be side-by-side? Note that the x87 FPU used to be a separate co-processor just like the GPU, but then it got integrated straight into the x86 execution pipeline. This eliminated duplicate logic, reduced power consumption, improved latency, and reduced bandwidth needs.

It's a bigger task to unify the CPU and GPU, but I believe we could have the same set of benefits. This is your chance to contribute to what would be a revolutionary change in how we think about computing. Everything can be expressed as code, and to make it run fast we need a versatile architecture that can extract various levels of ILP, DLP and TLP.

Novum · Mar 10, 2013

Nick said:
With that in mind, a Haswell quad-core CPU is expected to deliver about 400 GFLOPS at about 65 Watt (which might be fairly conservative). 65 Watt will get you a GPU such as the GeForce GTX 650, which delivers about 800 GFLOPS.

You are cherry picking your numbers.

GTX 650 is TSMC 28nm, Haswell is Intel 22nm. How can you even compare that? I have no doubt that NVIDIA could at least double the FLOPS/Watt on that Intel process. Also AMD GPUs are even more efficient (1,2 TFLOPs in that TDP range on 28nm)

Not to forget that the GPU has a vastly more powerful GDDR5 memory interface that gobbles up a lot of that TDP you are stating. AND it has a rasterizer/TMUs etc. that all are additional FLOPS that consume power even if you are not using that logic.

Alexko · Mar 10, 2013

Radeon HD 7950M: 50W, 1792 GFLOPS.

http://www.notebookcheck.net/AMD-Radeon-HD-7950M.72676.0.html

ninelven · Mar 10, 2013

I think perf/watt issue has been gone over enough....

But I will ask if a uISA is really that important when you can use LLVM with a custom (machine specific) backend?

Gipsel · Mar 10, 2013

Nick said:
I'd like to see [..]
I believe [..]
I imagine [..]

To pick Novums's side here and as you said he is prejudiced, I have to ask you if your suggestions and evangelization of your "UPU" isn't just wishful thinking.

Nick said:
Note that the x87 FPU used to be a separate co-processor just like the GPU, but then it got integrated straight into the x86 execution pipeline. This eliminated duplicate logic, reduced power consumption, improved latency, and reduced bandwidth needs.

I suggest you look at AMD's implementation. It still is some kind of coprocessor there. And it never changed in AMD's case. The FPU always had its separate scheduler.
What I said already two years(?) ago regarding this topic was going roughly along these lines:
Throughput optimized cores will always be better at throughput tasks, latency optimized cores will always be better at task requiring low latencies of execution. In the future, we will likely have vast transistor numbers available but no possibility to power them all at the same time (dark silicon problem). The reasonable solution is to put several different cores onto the same die, each optimized for a certain subgroup of tasks. All what is needed now are means of distributing the right parts of a program to the right parts of the processor and a low latency communication path for handing over tasks and synchronize the execution. A coprocessor model isn't the worst what one could think of. Each "CU" should have of course their own instruction fetch (and L1), one just calls a subroutine for asynchronous execution over there [acall mem64 kernel, mem64 parameterlist], and can check for its completion through some synchronization event, all that with just a few nanoseconds latency of course).

Nick said:
Everything can be expressed as code, and to make it run fast we need a versatile architecture that can extract various levels of ILP, DLP and TLP.

I think I said this before, too, but modern GPUs can use ILP, DLP and TLP already. And they are much better in the DLP field than your usual CPU core, no matter how large SIMD engines you are going to bolt on. Of course both types of cores evolve, but they don't necessarily converge. The requirements are just different.

liolio · Mar 11, 2013

Alexko said:
Radeon HD 7950M: 50W, 1792 GFLOPS.

http://www.notebookcheck.net/AMD-Radeon-HD-7950M.72676.0.html

it is a bit unfair how much CPU must be burnt for that GPU to express all its potential?

Nick · Mar 11, 2013

Novum said:
You are cherry picking your numbers.

GTX 650 is TSMC 28nm, Haswell is Intel 22nm. How can you even compare that?

Because that is the reality. Intel has an advantage in process technology, and it made AVX extendable up to 1024-bit. These are things that needs to be taken into consideration since Intel is the most likely to provide a viable unified architecture first. Besides, they wouldn't compete against discrete GPUs. They would compete against their own integrated graphics.

Anyway, please carefully read the rest of the argument where I used those numbers. I was merely responding to the "utterly destroying" claim, which is provably false by comparing a recent GPU against a soon to be released CPU. If you don't think the GeForce GTX 650 is a GPU worth comparing against, then keep in mind that Haswell also isn't the future unified CPU that I really want to compare against.

Not to forget that the GPU has a vastly more powerful GDDR5 memory interface that gobbles up a lot of that TDP you are stating.

That really shouldn't be taken out of the equation. GPUs are inherently more wasteful with bandwidth because they have a low cache hit rate, which is caused by running many threads to hide latency. Locality of reference is a key issue for GPUs to continue scaling, as it makes them run into the bandwidth wall and it also increases power consumption.

To combat it, they have to increase the cache hit rate by lowering the thread count, which requires several techniques which make it more CPU-like. Hence there will be continued convergence, and eventually unification.

Nick · Mar 11, 2013

ninelven said:
But I will ask if a uISA is really that important when you can use LLVM with a custom (machine specific) backend?

That is an excellent question, and yes, I do think it is important to share the same ISA between high ILP code and high DLP code. LLVM, or any compiler or framework for that matter, cannot manage on what processor to run which portions of your code. You have to split it up yourself, you have to deal with load balancing and batching, you have to deal with synchronization, you have to deal with communication overhead, and you may even have to rewrite your algorithms. And you may have to do all of this again for different configurations.

That's a ton of effort, and it's a major reason why GPGPU hasn't gotten very far in the consumer market. Bringing the GPU on the same die should have helped, but those integrated GPUs are weaker so the heterogeneous overhead is still very high relative to the potential gains. It would simply be better if each CPU core was equipped with SIMD units equivalent to those of a GPU. The compiler would merely have to create one stream of code, and the hardware can essentially switch between high ILP and high DLP from one cycle to the next. You don't have to worry about how to split things or how to synchronize it or how to hide the communication latency.

Ethatron · Mar 11, 2013

How do you propose to solve the problem of variable registers. You have 16 SIMD-"cores" in AVX (16 registers), GPU can raise that every half year because there is a monopoly for code-generation (the driver), and the ISA can change/accomodate all the time.
I'm not sure 16 byte instructions would be affordable for CPUs, even with that wide SIMD-width.

Nick · Mar 11, 2013

Gipsel said:
To pick Novums's side here and as you said he is prejudiced, I have to ask you if your suggestions and evangelization of your "UPU" isn't just wishful thinking.

I didn't just say he is prejudiced. I proved it. Wishful thinking would be desiring something that goes against rationale. But CPUs are clearly gaining higher compute throughput fast, and there are some silicon trends which force the GPU to become more CPU-like. So I don't think you can accuse me of wishful thinking.

I suggest you look at AMD's implementation. It still is some kind of coprocessor there. And it never changed in AMD's case. The FPU always had its separate scheduler.

I don't see a problem with that. For all intents and purposes it is fully unified. It will be interesting to see whether AMD will adopt AVX2 or whether they have long-term plans for Fusion that make it worthy of its name. If a separate scheduler is the key to that, great!

Throughput optimized cores will always be better at throughput tasks, latency optimized cores will always be better at task requiring low latencies of execution.

Vertex processing units will always be better at vertex processing, and pixel processing units will always be better at pixel processing. And yet they became unified. So your statement, true as it may be, does not imply that the CPU and GPU will never unify.

In the future, we will likely have vast transistor numbers available but no possibility to power them all at the same time (dark silicon problem). The reasonable solution is to put several different cores onto the same die, each optimized for a certain subgroup of tasks. All what is needed now are means of distributing the right parts of a program to the right parts of the processor and a low latency communication path for handing over tasks and synchronize the execution. A coprocessor model isn't the worst what one could think of.

Actually that would be quite terrible. Computing is not what consumes the most power, it is moving data around. So it's better to have cores that can extract both high ILP and high DLP and perform some specialized operations (the topic of this thread) so your data can stay close.

What's also often forgotten is the impact on software development. Developers just don't want to have to deal with a dozen types of cores. Two is bad enough as it is. One would be much better.

I think I said this before, too, but modern GPUs can use ILP, DLP and TLP already. And they are much better in the DLP field than your usual CPU core, no matter how large SIMD engines you are going to bolt on.

GPUs are limited to dual-issue from the same thread, so ILP is very limited. And whatever makes them better at DLP, can also be used in the CPU. Long-running SIMD instructions would be my number one recommendation, next to wider SIMD units of course. In combination with the existing Hyper-Threading and out-of-order execution that should quite suffice to close the gap.

Nick · Mar 11, 2013

Ethatron said:
How do you propose to solve the problem of variable registers. You have 16 SIMD-"cores" in AVX (16 registers), GPU can raise that every half year because there is a monopoly for code-generation (the driver), and the ISA can change/accomodate all the time.
I'm not sure 16 byte instructions would be affordable for CPUs, even with that wide SIMD-width.

I have no idea what you are referring to. Registers and SIMD-"cores" are completely different things. The CPU can run anything and everything the GPU can. Just look at OpenCL implementations for the CPU, and software renderers with dynamic code generation.

Ethatron · Mar 11, 2013

I mean that the ISA of a CPU is essentially frozen, and that of x86 is in a bad shape as well. How do you propose to offer more than 16 registers over time? (Say raising them every half a year) From an ISA perspective ofc, the register-file contains more than enough.

Gipsel · Mar 11, 2013

Nick said:
I didn't just say he is prejudiced. I proved it. Wishful thinking would be desiring something that goes against rationale.

So you proved it with your wishful thinking?

But sorry, you proved nothing. All you offered is some kind of (in my opinion flawed) conjectures.

Nick said:
I don't see a problem with that. For all intents and purposes it is fully unified. It will be interesting to see whether AMD will adopt AVX2 or whether they have long-term plans for Fusion that make it worthy of its name. If a separate scheduler is the key to that, great!

I suggest you think again about that. It is not fully unified. The FPU in BD/PD/SR is even shared between two cores. How can this be fully unified with the integer core? If you neglect the decoders (but in SR each core will have its own), the FPU is still a somewhat separate coprocessor integrated on the same die. Intel has a fully unified approach for it's FPU/SIMD units (the same port can accept FPU/SIMD as well as other instructions), AMD does not (the FPU/SIMD pipelines are really separate and don't share the issue ports). But that does not matter here too much anyway.

Nick said:
Vertex processing units will always be better at vertex processing, and pixel processing units will always be better at pixel processing. And yet they became unified. So your statement, true as it may be, does not imply that the CPU and GPU will never unify.

You are wrong. You miss the important point here. Vertex and pixel processing loads do not differ in some fundamental point of view. Both are throughput oriented workloads on 32bit floats. Vertex and pixel shader can access the exact same type of resources and do the exact same type of calculations. The split made sense when one had 32bit floats in the vertex stage but could save on the pixel shaders by getting along with 8/16/20/24bit data types. But these times are gone for some time. Almost everything in the shaders works now with 32bit floats. As both vertex and pixel loads were basically identical (both throughput oriented), the vertex shaders could execute pixel shaders with the same speed. So it is simply not true, that dedicated shader units would be faster.

But you want to overcome the fundamental difference between latency and throughput oriented workloads. That is not going to happen so easily as the processed data types won't change anything on that distinction.

Nick said:
Actually that would be quite terrible. Computing is not what consumes the most power, it is moving data around. So it's better to have cores that can extract both high ILP and high DLP and perform some specialized operations (the topic of this thread) so your data can stay close.

That was discussed already endless times I think. A throughput oriented task usually access amounts of data not fitting into the L1 or L2. Streaming it through a cache structure optimized for low latency access is just wasteful, also from a power perspective. Calling an asynchronous subroutine on a dedicated througput unit with its own L1/L2 specialized to those tasks don't increase data movement at all. The data just gets moved from the LLC or the memory hierarchy to a different place than your latency optimized core.

Nick said:
GPUs are limited to dual-issue from the same thread, so ILP is very limited.

If you look back a bit, AMD's VLIW architectures could issue up to 5 different ops from the same thread in one cycle. That actually even beats Intel's 4 wide OoOE cores. Does it help a lot or is it even necessary? Obviously not as usually one can exchange ILP for DLP and TLP in throughput oriented cores quite easily. It's just a matter of balancing execution latencies and data access latencies with the expected amount of DLP and TLP to keep everything running at full load with minimized hardware effort.

Nick said:
And whatever makes them better at DLP, can also be used in the CPU.

You mean like the 8 MB of register files of a Tahiti GPU? I want to see that in a CPU with heavily multiported reg files running at 4 GHz.

Or all the write combining and memory access coalescing going on to use the available bandwidth as efficient as possible but causes a massive (about one order of magnitude) latency penalty which would be completely prohibitive for a latency oriented task? Putting this on top of a latency optimized cache architecture would just burn an unnecessary amount of power. It's very likely more efficient to keep these structures separate maybe save for the LLC (with special provisions to avoid excessive trashing of the data for latency optimized tasks by throughput tasks).

MfA · Mar 11, 2013

It's probably best to design for SIMD/MIMD hybrid ala Dally's proposal (with MIMD coming at increased power consumption) and then allow some kind of pipeline linking for VLIW similar to what AMD had at reduced performance (with the same staggered register access to allow you to keep the number of ports down). A processor designed to extract a little extra parallelism at a cost of adding a whole lot more ports to the register file are probably never going to be ideal for GPUs (which is not to say Intel's process advantage might still not be able to pull it out ahead of course).

Dally's proposal is still the closest to the ideal from anyone with some authority to appeal to.

Gipsel · Mar 11, 2013

MfA said:
It's probably best to design for SIMD/MIMD hybrid ala Dally's proposal (with MIMD coming at increased power consumption) and then allow some kind of pipeline linking for VLIW similar to what AMD had at reduced performance.

But Dally's proposal also included a bunch of latency optimized cores in addition to the SIMD/MIMD throughput cores.

MfA · Mar 11, 2013

You can tie those cores to SIMD/MIMD arrays and then leave the arrays doing nothing when you're running single threaded code on them ... then you can call them unified too

The ISA for a UPU

Nick

ssp

Novum

Nick

Nick

Novum

Alexko

ninelven

PM

Gipsel

liolio

Aquoiboniste

Nick

Nick

Ethatron

Nick

Nick

Ethatron

Gipsel

MfA

Gipsel

MfA

Similar threads