The ISA for a UPU

Ethatron · Mar 18, 2013

Sorry, yes a single GPU SIMD-unit/ALU-block isn't partitionable, though what I wanted to express is: if you would put several GPU SIMD-units together to form a SIMD-vector as wide as a CPU SIMD-register, call that a PU (I don't have the OpenCL doc from AMD here, there is the right abrevation), then the advantage of that PU - which consists of multiple 4-wide SIMD-units - is fe. that it's partitionable, while the CPU SIMD-vector (a register) is not.

Gipsel · Mar 18, 2013

OpenGL guy said:
The 4 vector units in a CU can process separate instructions. This is why I recommend at least four 64 thread work groups per CU as opposed to 256 thread work groups.

Are you saying the 4 wavefronts of a workgroup with 256 work items can not be distributed over the SIMDs but all 4 have to run on the same SIMD? Why? Do the barriers only work within a single instruction buffer/SIMD? That's hard to believe as the ISA manual states that a work group can consist of up to 16 work groups (more than a single SIMD can handle).

Gubbi · Mar 18, 2013

Nick said:
As I've discussed before, CPUs can achieve high throughput with fewer threads and registers by having lower latencies and by hiding them with out-of-order execution. It is highly debatable whether increasing the thread count would improve performance since Hyper-Threading only increases performance by 30% for the best known cases, and even with just two threads it is known to cause cache contention which nullifies any advantage in the worst cases. It is probably a better idea to hide more latency with long-running instructions since they keep the cache accesses coherent.

You're going to see the same cache contention whether data is read and stored in one single über-context or eight to sixteen smaller ones. The same work needs to be performed. Cache contention is a result of lack of size or associativity.

Your serial monster CPU is going to need a much bigger ROB, which will impact cycle time AND power consumption.

Nick said:
CPUs already use lower clocks when all cores are active, than when just one or a few are active. By keeping track of the amount of SIMD instructions they could adjust even more dynamically to high ILP or high DLP workloads or a mix of them.

The structure of the ROB and register file is fixed in silicon. You need bigger structures to support your larger instruction window. Accessing these bigger structures is slower and require more energy.

Nick said:
That is clearly a false assumption. Haswell will be substantially faster at throughput computing than its predecessor and yet its single-threaded performance will also improve.

It is clearly not! Haswell uses a lot more silicon. Wider issue width, wider execution units, wider data paths.

Nick said:
I also see no reason to assume that widening the SIMD units to 512-bit and adding support for AVX-1024 would have any negative effect on single-threaded performance either.

If there is a need for these units, then it will make sense to widen them. But widening them spends more si real estate and burns more power (both active and idle). Power that could be used to increase single thread performance. Widening them only makes sense when the performance/power trade-off allows it.

Cheers

OpenGL guy · Mar 18, 2013

Gipsel said:
Are you saying the 4 wavefronts of a workgroup with 256 work items can not be distributed over the SIMDs but all 4 have to run on the same SIMD? Why? Do the barriers only work within a single instruction buffer/SIMD? That's hard to believe as the ISA manual states that a work group can consist of up to 16 work groups (more than a single SIMD can handle).

No, that's not what I am saying. What I am saying is that it is generally more efficient to use workgroups of 64 threads.

Gipsel · Mar 18, 2013

OpenGL guy said:
No, that's not what I am saying. What I am saying is that it is generally more efficient to use workgroups of 64 threads.

I got that, but why? If the four wavefronts of a 256 work item workgroup get distributed over all SIMDs in a CU, what is the major difference to four smaller workgroups consisting of a single wavefront? Just that one may synchronize larger groups? Is there any difference left if the kernel needs no barrier?

OpenGL guy · Mar 18, 2013

Gipsel said:
I got that, but why? If the four wavefronts of a 256 work item workgroup get distributed over all SIMDs in a CU, what is the major difference to four smaller workgroups consisting of a single wavefront? Just that one may synchronize larger groups? Is there any difference left if the kernel needs no barrier?

Barriers are one advantage. Another is that if the machine is full, the machine can schedule a new workgroup sooner if the workgroup is smaller as you only need to wait for a single wavefront to finish, not 4 wavefronts.

Gipsel · Mar 18, 2013

OpenGL guy said:
Barriers are one advantage. Another is that if the machine is full, the machine can schedule a new workgroup sooner if the workgroup is smaller as you only need to wait for a single wavefront to finish, not 4 wavefronts.

Okay, another (in most cases not too large) effect.

It's OT here, but anyway: I would think one uses larger work groups only if one gets some benefit from data sharing over that larger work group. Otherwise it doesn't make much sense to begin with. So are there really some people choosing larger work groups for no reason?

OpenGL guy · Mar 18, 2013

Gipsel said:
Okay, another (in most cases not too large) effect.
It's OT here, but anyway: I would think one uses larger work groups only if one gets some benefit from data sharing over that larger work group. Otherwise it doesn't make much sense to begin with. So are there really some people choosing larger work groups for no reason?

Yes and no. OpenCL, unlike DirectCompute, doesn't require that the application specify the workgroup size for a kernel at compilation time. To avoid recompilation, we use a default that is as large as our maximum (256 threads in OpenCL), this way we can get correctness no matter what workgroup size is used with the dispatch. At dispatch time, the application can use NULL for the workgroup size, indicating we should use whatever was used to compile the program.

So we could use a default of 64 threads per workgroup and then hope applications query the workgroup size used for compilation, but I suspect we would break a bunch of applications.

Along with this, if the application doesn't specify the workgroup size, then we also don't know what are the best dimensions to use. Say the application is using a 2D dispatch, which is best: 64x1, 32x2, 16x4, 8x8, 4x16, 2x32, 1x64? Well it really depends on the data access patterns and such, so the driver really can't guess at what's optimal. I've had this discussion with some developers and they just don't get it, even if I point out how the competition's performance can improve as well. Even if we could change each dispatch to find the one that ran best, we don't know that the next dispatch will behave the same way.

Nick · Mar 18, 2013

Nick said:
Nick said:

Clearly that's just stupid. Not only is it hugely impractical to have the CPU take part in the low-level rendering processes, the round-trip latencies make any such attempt futile.

Click to expand...

Nick said:

I'd really like to better understand why GPUs suffer from task scheduling latency exactly and what's being done about it (details please). And why do you think it's a fundamentally different problem from other latency issues?

Click to expand...

The bolded part of your first quote answers your second. Currntly, if you want to do something on a GPU, there some significant driver overhead, buffers gets copied in RAM and sent over PCIe, the results have to come back. That's the overhead that currently kills medium sized tasks. If everything is on the same die, one has the same address space and equal access to everything , one could start a task on a CU array just by a special call providing a pointer to descriptor structure (detailing the arguments, and a pointer to the code to b executed), one could get almost completely rid of it. The penalty for starting a task would go from hundred microseconds or so to nanoseconds.

Thanks for those details. So the solution is some more convergence. However, note that only part of the issues would be addressed by having closeby latency optimized scalar cores handle task scheduling. You'd still have tasks that are not parallel enough for the many threads that current compute cores demand. As they become ever more parallel, the only solution will be more convergence.

AMD, intel and nVidia are all working on reducing this overhead. In certain situations one can go around this (or at least hide the "setup cost" behind the execution of another task) with the latest generation GPUs. For example, newer GPUs are basically able to create tasks for themselves (dynamically spawning new threads) for which one gets around this round trip latency to the CPU core.
That's what Novum explicitly mentioned in his post at the very beginning of the thread: He wished for a low latency communication to the throughput cores. And that is being worked on.

Indeed, and I suggested further convergence than that. Maybe not this year or the next, but several years down the road. It's obvious things are heading in that direction overall.

Thanks that you agree that Haswell will be limited to 2 loads and 1 store per cycle for code exclusively consisting of memory accesses (that's what I wrote).

Don't inflate what I agree upon. Haswell's 2+1 load/store with three AGUs is a fact, and it's an improvement over all previous Intel architectures. But the code doesn't exclusively consist of memory accesses. x86 became the dominant ISA despite its relatively low number of logical registers and despite Intel hanging on to 1+1 L/S for a long time. CPU's can easily schedule around any L/S contention and Haswell is going to be even harder to bottleneck. GPUs on the other hand can easily become L/S bottlenecked and they demand extra threads to work around temporary contention.

How far up can one usually move a load? Isn't that often limited in average code?

First of all, with two load ports it never has to be moved far. Of all the port contention a CPU, or a GPU for that matter, has to deal with, this is a huge luxury. But even so, CPUs can move it a long way and the code doesn't have to be much of a limitation. Especially when you're executing a loop with independent iterations, as would be typical for a throughput workload, the CPU can be executing instructions from multiple iterations.

Nick said:
Nick said:

In comparison the GPU is massively bottlenecked by L/S and texture fetch. Also, since it takes multiple cycles to issue a texture fetch I can easily imagine situations where other threads start convoying behind it even though there's independent arithmetic instructions below the texture access in each/some of the threads. The scheduler just won't schedule them.

Click to expand...

That's just wrong. It's not how the schedulers work in current GPUs. If there is any warp/wavefront with an independent instruction left on a CU/SM, it will get scheduled. And the issue of a texture access won't block other threads from issue (not even the same). At least for GCN (and I would doubt it for Kepler, too) the issue of a vector memory access doesn't take any longer than an arithmetic instruction (the completion of the access does). The only possibility to block the execution is when you've hit a limit of the memory hierarchy (throughput, maximum number of outstanding accesses [which is likely very high as GCN could have up to 16 outstanding vector memory reads and 8 writes per wavefront, scalar accesses come on top of it] or something like that).

I don't think so. You're simply going to run out of threads if the contention on your texture or L/S port is too high, even if just locally. My point was that CPUs can still execute independent arithmetic instructions in this situation. Not only that, it can also reorder the memory accesses themselves.

What I'm really trying to get at is that CPU architectures are on average every bit as good at high throughput workloads as the GPU. It just lacks SIMD width. In other words you don't need a massive number of threads to label something as an architecture capable of high throughput. And the low thread count improves access locality, which is a problem for the GPU going forward. We do need wider SIMD width in a unified architecture, but that's easy enough after a few more silicon nodes, and we can use long-running instructions to deal with a drop in cache hit rate as discussed above.

So we're much closer to making unification viable than you might think from comparing thread count or register file size.

ninelven · Mar 18, 2013

Nick said:
What I'm really trying to get at is that CPU architectures are on average every bit as good at high throughput workloads as the GPU

Except they are not. High throughput workloads are often highly bandwidth dependent. And bandwidth isn't free. There is a reason Xenon phi has a 512bit GDDR5 memory interface. I'm sure if Xenon phi only had a 128bit DDR3 memory interface it could have even better FLOPs/watt. But in real world usage...

Gipsel · Mar 18, 2013

Nick said:
Thanks for those details. So the solution is some more convergence. However, note that only part of the issues would be addressed by having closeby latency optimized scalar cores handle task scheduling. You'd still have tasks that are not parallel enough for the many threads that current compute cores demand.

Currently, the needed amount of parallelism is grossly exaggerated as one has to amortize the huge setup costs.

Nick said:
Don't inflate what I agree upon. Haswell's 2+1 load/store with three AGUs is a fact, and it's an improvement over all previous Intel architectures. But the code doesn't exclusively consist of memory accesses. x86 became the dominant ISA despite its relatively low number of logical registers and despite Intel hanging on to 1+1 L/S for a long time. CPU's can easily schedule around any L/S contention and Haswell is going to be even harder to bottleneck. GPUs on the other hand can easily become L/S bottlenecked and they demand extra threads to work around temporary contention.

Read what I wrote! I just stated the basic fact, that almost every architecture (that includes GPUs as well as common x86 CPUs) will be limited by the throughput of their memory system if that is pegged. Nothing more. No need to argue about something obvious.

Sure, the L1 bandwidth per flop is quite a bit lower on GPUs, but they also enjoy a much larger register space so they don't need to use the L1 as an extended register file like x86 CPUs are doing. Depending on the code and conditions, that may be an advantage or a disadvantage (to either side). That's the way architectural differences play out. But that is no solid argument for nothing.

Nick said:
First of all, with two load ports it never has to be moved far.

How does the amount of load ports influence the memory latency?

Nick said:
I don't think so. You're simply going to run out of threads if the contention on your texture or L/S port is too high, even if just locally.

So you are telling me, that a GCN CU easily runs out of its up to 40 wavefronts (a wavefront can have up to 16 reads and 8 writes pending)? And usual code also has a few arithmetic instructions in between. Even when looking just on a single IB and vector ALU, a burst of memory accesses from one wavefront isn't going to stall the vALU/sALU/LDS instructions from the other wavefronts, let alone the vALU/sALU/LDS instructions from the wavefronts running from the neighboring instruction buffers and vALUs.
I've no idea how long the queue in front of the AGUs is (i.e. how many vector memory instructions can be issued before the issue port gets blocked [it's available again after a few cycles, when the AGUs processed the next access in the queue]), but in any case that would block exclusively the vmem issue port for a short time, not the vALU port, not the sALU port, not the branch port, not the LDS port, not the export (mem write) port, and for sure not the handling of the internal instructions (like synchronization, they are handled directly in the instruction buffer and don't need an issue port). That means all arithmetic or local memory instructions from other wavefronts continue to be issued. If you have so huge amounts of memory accesses that this would play a role, you end up being bandwidth limited anyway. But short bursts of memory accesses (let's say 8 reads directly after each other with no arithmetic instructions in between) are handled usually quite well, if that is just a part of the kernel. There is apparently no performance degragation from such a grouping.

Nick said:
My point was that CPUs can still execute independent arithmetic instructions in this situation.

So do GPUs.

Nick said:
What I'm really trying to get at is that CPU architectures are on average every bit as good at high throughput workloads as the GPU. It just lacks SIMD width.

Is it as good in the sense that it of course can finish the task.

But it burns way more power for the same throughput. Having a 4GHz CPU with fast caches and low latency is great for serial performance, but it is an unecessary power burden for throughput tasks. You don't need a L1 hit to beserved in a single nanosecond and an FMA instruction in 1.3ns. Having more leeway on the latency side (GCN does an SPFMA in ~4ns, a DPFMA currently in ~16ns [latency]) enables a more power efficient design of the functional units. We don't even have to start with the simpler and therefore less consuming scheduling logic. It's obviously better to have a different set of lower clocked but wider execution units next to the latency optimized ones. And it's hard to impossible to scale the frequency for a pipeline a factor of 4 without running into inefficiencies.

Just look what happened to the serial performance of Larrabee/KnightsCorner. That is a more or less throughput optimized design which kept some latency optimizations (or it didn't go as far as GPUs) to fare better at intermediate sized tasks with dynamic and varying amounts and types of parallelism. Nevertheless, it clocks just above 1 GHz. It does an SPFMA also with ~3.5 to slightly below 4ns. Why does it have 62 cores@1.1GHz and not 16 cores@3.5GHz? Coincidence or a pattern?

Davros · Mar 18, 2013

Ive got a question,
It doesnt seem possible to run a gpu at cpu speeds eg: 3.5ghz so would a upu be limited to the 1 ghz mark ?

hkultala · Mar 19, 2013

Davros said:
Ive got a question,
It doesnt seem possible to run a gpu at cpu speeds eg: 3.5ghz so would a upu be limited to the 1 ghz mark ?

It's perfectly possible to design a GPU that would run at 3.5 GHz.

Such GPU would just have ~5 times less execution units that the current 1 GHz GPUs and would still consume more power.

So it would be like 1.5 times slower and 2 times less power-efficient.

And then there is the question whether the core would support OOE or not; OOE is needed for high single-thread performance, but increases power consumption and is not needed for multi-threaded workloads.

It's all about tradeoffs, what to optimize the chip for. High clock speeds and OOE are needed for high single-thread performance, for latency-critical applications. Lowwe clock speed with huge amount of parallel function units and no OOE(using multi-threading inside a core to hide latenies) are best for embarassingly parallel workloads such as GPU's.

There is however some reseach done in how to make ordinary high-performance CPU execute highly-threaded code more efficiently: Morphcore, allowing execution of 2 threads on OOE mode and 8 threads when OOE is turned off.

http://hps.ece.utexas.edu/people/khubaib/pub/morphcore_micro2012.pdf

Davros · Mar 19, 2013

So if we have an upu single threaded performance will be awful because of the low clockspeed ?

liolio · Mar 19, 2013

As a side note reading that thread I would think that GPU has won "the war", Nvidia got the contract and will be powering Exascale and so on, AMD approach to APU (with a "big" GPU relatively to the competition) won, and so on.

None of that happened yet, actually the first gen of many cores pretty much just shipped (Xeon Phi and Power A2), the fight is ahead of us not behind.
The disregard toward many cores while not surprising on a forum focused on CPU is not too surprising but I think that the pretense, or sounds like it to me or close, that GPUs already "won" is quiet a stretch.

liolio · Mar 19, 2013

Another side note, I don't get how it is relevant those days when comparing GPU and CPU to speak about memory types and bandwidth.
Nvidia announced Volta and stacked memory but the same tech is coming to CPU, the difference wrt bandwidth between CPU and GPU is about to be leveled out, and I would think that CPU are the one that are to benefit the most out of it.
1TB/s is 4 or 5 times what high end GPUs have to playt with now, for the CPU we speak a x20 jump.

Gipsel · Mar 19, 2013

liolio said:
Another side note, I don't get how it is relevant those days when comparing GPU and CPU to speak about memory types and bandwidth.
Nvidia announced Volta and stacked memory but the same tech is coming to CPU, the difference wrt bandwidth between CPU and GPU is about to be leveled out, and I would think that CPU are the one that are to benefit the most out of it.
1TB/s is 4 or 5 times what high end GPUs have to playt with now, for the CPU we speak a x20 jump.

As CPUs tend to crawl through large (parallel) data sets much slower than GPUs, it isn't such a constraint for pure CPUs as it is for GPUs (or APUs).

ninelven · Mar 19, 2013

liolio said:
Another side note, I don't get how it is relevant those days when comparing GPU and CPU to speak about memory types and bandwidth.

Because for certain workloads you need that bandwidth. If you scale CPU bandwidth up, your FLOPs/watt number is going drop. If you want to see how it is disingenuous, imagine Nvidia released a TITAN GPU with only a single 64 bit GDDR5 memory channel clocked very low (iow modern CPU bandwidth). The FLOPs/watt of such a chip would be very good on paper, but I doubt they would have customers lining up for such a device.

Nvidia announced Volta and stacked memory but the same tech is coming to CPU, the difference wrt bandwidth between CPU and GPU is about to be leveled out...

Indeed, the power GPUs spend per GB/sec of bandwidth will be significantly reduced making comparisons more accurate.

nAo · Mar 19, 2013

This thread is from an era where power is not THE issue. Well, it turns out it is THE issue

Tahir2 · Mar 19, 2013

Coming from a layman perspective and reading this thread, quickly have seen a pattern arise:

1. Points made seem reasonable but examples used are heavily biased to prove a hypothesis
2. Peers mention examples are not reasonable due to x, y reason making the hypothesis an ideal that is not sound or based on current reality
3. Counters are made advising the examples are fine and people just don't get it.

This is repeated several times over.

In the end the thread, and excuse me for the language, seems like a lot of intellectual ego stroking.

Also have read this same thread over the past few years a few times... begins to get a bit pointless after a while unless I am missing something and we are on the verge of a great intellectual discovery?!

The ISA for a UPU

Ethatron

Gipsel

Gubbi

OpenGL guy

Gipsel

OpenGL guy

Gipsel

OpenGL guy

Nick

ninelven

PM

Gipsel

Davros

hkultala

Davros

liolio

Aquoiboniste

liolio

Aquoiboniste

Gipsel

ninelven

PM

nAo

Nutella Nutellae

Tahir2

Similar threads