The ISA for a UPU

Gipsel · Mar 16, 2013

Gubbi said:
The amount of active threads/instructions in flight is ultimately limited by you registers, - you need a place to store results. A CU in an AMD GPU currently has 64KB registers, Haswell has 5.25KB AVX2 register space.

A GCN CU has 256kB vector registers (+8kB scalar ones). Each of the four SIMD units in the CU has its own 64kB of registers.

Gubbi said:
Increasing (or replicating) ROBs will impact either cycle time or scheduling latency, as will a massive two-level register file. Another problem with optimizing for throughput is that work/energy favours wider, lower clocked implementations.

Don't argue with Nick about that. It was mentioned several times already.

Gipsel · Mar 16, 2013

@Nick:
I won't go in any extended discussion with you anymore, I will just answer questions or correct some stuff, you obviously got wrong.

Nick said:
Clearly that's just stupid. Not only is it hugely impractical to have the CPU take part in the low-level rendering processes, the round-trip latencies make any such attempt futile.

Nick said:
I'd really like to better understand why GPUs suffer from task scheduling latency exactly and what's being done about it (details please). And why do you think it's a fundamentally different problem from other latency issues?

The bolded part of your first quote answers your second. Currntly, if you want to do something on a GPU, there some significant driver overhead, buffers gets copied in RAM and sent over PCIe, the results have to come back. That's the overhead that currently kills medium sized tasks. If everything is on the same die, one has the same address space and equal access to everything , one could start a task on a CU array just by a special call providing a pointer to descriptor structure (detailing the arguments, and a pointer to the code to b executed), one could get almost completely rid of it. The penalty for starting a task would go from hundred microseconds or so to nanoseconds.
AMD, intel and nVidia are all working on reducing this overhead. In certain situations one can go around this (or at least hide the "setup cost" behind the execution of another task) with the latest generation GPUs. For example, newer GPUs are basically able to create tasks for themselves (dynamically spawning new threads) for which one gets around this round trip latency to the CPU core.
That's what Novum explicitly mentioned in his post at the very beginning of the thread: He wished for a low latency communication to the throughput cores. And that is being worked on.

Nick said:
Pixel Shader 1.3 offered only 2 logical temporary (vector) registers. Input registers aren't a problem because they get computed on the spot. It's the results that you write and read back (effectively 200+ cycles later) that needed temporary registers.

That's all cleared up already. It is supposed to mean 8 to 12 32bit registers (what G80 offers at full ocupancy). No need to dig that up from 2 pages ago.

Nick said:
Limited? Again, Haswell can do two full SIMD width loads and a store per cycle.

Thanks that you agree that Haswell will be limited to 2 loads and 1 store per cycle for code exclusively consisting of memory accesses (that's what I wrote).

Nick said:
Also, the CPU can easily schedule around any remaining conflicts so stalls due to L/S contention are practically non-existant.

How far up can one usually move a load? Isn't that often limited in average code?

Nick said:
In comparison the GPU is massively bottlenecked by L/S and texture fetch. Also, since it takes multiple cycles to issue a texture fetch I can easily imagine situations where other threads start convoying behind it even though there's independent arithmetic instructions below the texture access in each/some of the threads. The scheduler just won't schedule them.

That's just wrong. It's not how the schedulers work in current GPUs. If there is any warp/wavefront with an independent instruction left on a CU/SM, it will get scheduled. And the issue of a texture access won't block other threads from issue (not even the same). At least for GCN (and I would doubt it for Kepler, too) the issue of a vector memory access doesn't take any longer than an arithmetic instruction (the completion of the access does). The only possibility to block the execution is when you've hit a limit of the memory hierarchy (throughput, maximum number of outstanding accesses [which is likely very high as GCN could have up to 16 outstanding vector memory reads and 8 writes per wavefront, scalar accesses come on top of it] or something like that).

Nick · Mar 16, 2013

Exophase said:
I don't know about the ISAs on the big discretes (which I need to look over again sometime), but I do know that there are at least mobile GPUs that have instructions that perform MUL + ADD in more complex arrangements than just FMA. They also have simple input modifiers like 1 - X and small shifts/multiplications. I don't really know what of this makes sense to incorporate into functional units but it could help. SSE does have a lot of pretty specialized instructions as well, but not that focused on graphics.

Indeed we should keep our eyes open for useful instructions of this kind. All I'm saying is that for a unified architecture it probably doesn't make sense to go beyond what the latest GPUs do.

When I was defending my theory that Haswell would have two FMA units (which proved out right), I read a research paper on fused instructions that went beyond FMA. I believe that the conclusion was that unless all you're doing is processing huge matrices or long polynomials, it's not worth it over having two FMA units. FMA itself is a significant improvement over just MUL and ADD though. I can't seem to find the paper any more, but the discussion is somewhere on realworldtech if you're interested.

Based on what I've read, when people say dedicated interpolation units have been dropped they mean the application of barycentric coordinates, which is linear-interpolation, not the calculation of the barycentric coordinates themselves, which is non-linear. I don't know what the latency is like for divides in GPUs so I don't know if there's any cost to having a big dependency on one early on or not.. but for the sort of ISA in a unified system that's more CPU like I could see anything that gets around having to deal with the divide latency upfront as useful.

NVIDIA's SFU design reveals that the complexity of a division (reciprocal really) on the GPU is around the order of a few arithmetic operations. And I believe several architectures match the latency of their FMA units to the SFU latency. However, they have a 6:1 ratio in the number of FMA units and SFU units (which is also used for interpolation). So their biggest concern isn't latency it's the bottleneck that could occur from having too high a number of operations that require the SFU units. Anyway, back on the topic of a unified architecture, there surely is a lot of room for improvement for the performance of a full precision division. But I don't think the perspective division in graphics has a desperate need for that. There's a 2:1 ratio for FMA and RCP and one Newton iteration suffices, which is cheap if you have FMA anyway. Most importantly, only one perspective division is needed per pixel. The increase in average shader length has already made it insignificant a while ago. You can use a low-throughput high-latency division with no appreciable effect on performance.

These are pretty common in DSPs.. I don't know where they apply in modern high end GPUs but I know you don't have to dig back terribly far to find more odd multiplication width pairs in GPUs.. they might still lurk in fixed function stuff. But I don't have applications outside of things like emulating archaic platforms like Nintendo DS

Well if it's a thing of the past, i.e. GPUs are moving away from it, then I don't think the CPU should be moving toward the old stuff since it would become a burden if it's practically never used.

That said I've started to realize that there are most definitely mixed width multiplication instructions that could be of great help for texture filtering. So thanks for the suggestion to look for these.

Was LRB TBDR or was it just tiling?

I'm terribly sorry but I laughed out loud when I read this. You can't categorize Larrabee's hardware one way or the other. Aside from the dedicated texture units it is highly generic and can be used any way you like. Whether or not Intel's default driver was intended to be TBDR-based or not, or some form of hybrid or configurable renderer, that's a whole other question.

I think some form of tiling is a must for rendering on something with a CPU-like cache hierarchy/bandwidth. Why is it not suitable w/tessellation?

Again I wouldn't say anything is a "must". There are cases where forms of tiling help a lot and other cases where not tiling avoids wasting cycles and power. And that's the beauty of a unified architecture; you can do things in much smarter ways tailored to the situation. The hardware doesn't impose one approach on you. GPUs have already given up fixed-function vertex and pixel processing for something much better, despite the cost, and now we're getting closer to the point where the rendering processes themselves will cease to be fixed-function. For better.

I hope this clarifies why I think it would potentially derail the discussion to respond to why TBDR is not well suited for tessellated geometry. You can find some answers/opinions here: Early Z IMR vs TBDR - tessellation and everything else.

Further thoughts on fixed function ISA:

What's most useful for acceleration compressed textures? How about table lookup/shuffles over quantities below 8-bits? Or using different index widths vs access widths? I use the 8-bit lookup instructions in NEON a fair amount, but often (for graphics things) what I really want to do is lookup 16-bit values with an 8-bit index, requiring two lookups and an interleave. Sometimes what I really want is to lookup 16-bit values using only 4-bit indexes. Parallel pext helps a lot with this, but being able to do it directly is even better.

Note that GPUs dropped support for palletized textures, despite being a perfectly useful way to compress textures if you have custom pallets per texture. It's really the latter restriction that killed this feature, because it can't be done efficiently in dedicated hardware when you're constantly sampling from different textures. That is, it can't be done more efficiently than a generic gather operation from memory.

So gather and PDEP/PEXT are all you can hope for. AVX2's gather is limited to 32-bit indices, but in a couple more silicon shrinks there might be timing room to lower that down without affecting latency. For 4-bit to 16-bit lookup an in-register permute (vpermd) should be of great help already.

Exophase · Mar 16, 2013

Nick said:
Indeed we should keep our eyes open for useful instructions of this kind. All I'm saying is that for a unified architecture it probably doesn't make sense to go beyond what the latest GPUs do.

Going to have to see then if that's "just FMA" or if it includes any of the other stuff I've referred to. But since you're talking about replacing current embedded GPUs it doesn't make sense to only look at what discrete GPU families are doing, meaning you should be looking at Gen and IMG's current GPUs at the very least (or are you drawing the line at some power budget?)

Nick said:
Well if it's a thing of the past, i.e. GPUs are moving away from it, then I don't think the CPU should be moving toward the old stuff since it would become a burden if it's practically never used.

Put it this way: if you have support for 8, 16, and 32-bit multiplication as optimized SIMD cases like AVX2 does then you WILL benefit from mixed width multiplication. Not as much, but it's a certainty that it will come up. Whether or not it's worth it depends on the granularity that the uarch differentiates performance of different multiplication widths. If you can do a 16x32 multiply twice faster than a 32x32 one then it's worth it.

GPUs moved away from higher performance operation of anything < 32-bit; CPUs haven't (as Intel proves) but throughput ones like Xeon Phi have. Meanwhile mobile GPUs (and Gen) have offered lower width operations, because if you don't need the precision it's an obvious perf/W improvement. So where exactly does your stance on this lie?

Nick said:
That said I've started to realize that there are most definitely mixed width multiplication instructions that could be of great help for texture filtering. So thanks for the suggestion to look for these.

Yes, I was thinking filtering right after I posted it

Nick said:
I'm terribly sorry but I laughed out loud when I read this. You can't categorize Larrabee's hardware one way or the other. Aside from the dedicated texture units it is highly generic and can be used any way you like. Whether or not Intel's default driver was intended to be TBDR-based or not, or some form of hybrid or configurable renderer, that's a whole other question.

Don't be such a snob. Of course I was talking about Michael Abrash's and co's LRB renderer, not what the "hardware" did. Don't see how you couldn't have gone straight to that.

The big thing that sticks out to me while is that for software rasterizers, if you end up with a bounding box that says if the triangle is in or out (and how much in, linear barycentric coordinates) and want to go straight to rendering that you then have to step through this block and determine which of the (< 50%) of pixels are visible. Then do it again after depth test. Or do it only after depth test, but then you're doing depth test on a bunch of fragments you know you won't use.

The point is, I can't think of a good way to dispatch a bunch of shaders on this without stepping through the tile and looking if there's going to be a fragment at each location or not. You can do it hierarchically but it's still going to be a fair bit of overhead. If you do a TBDR then you first fill up the tile with IDs of all fragments contributing to it, the overhead of discarding unused fragments IDs is cheap (blend it against the tile) and you have no gaps in it.

The way I see it, if you aren't doing TBDR that sort function could be even more useful..

Nick said:
I hope this clarifies why I think it would potentially derail the discussion to respond to why TBDR is not well suited for tessellated geometry. You can find some answers/opinions here: Early Z IMR vs TBDR - tessellation and everything else.

That thread hardly makes a strong support of your case..

Nick said:
Note that GPUs dropped support for palletized textures, despite being a perfectly useful way to compress textures if you have custom pallets per texture. It's really the latter restriction that killed this feature, because it can't be done efficiently in dedicated hardware when you're constantly sampling from different textures. That is, it can't be done more efficiently than a generic gather operation from memory.

So gather and PDEP/PEXT are all you can hope for. AVX2's gather is limited to 32-bit indices, but in a couple more silicon shrinks there might be timing room to lower that down without affecting latency. For 4-bit to 16-bit lookup an in-register permute (vpermd) should be of great help already.

I'm not seeing your case here. Compressed textures still use packed small bit width lookups, and they go to larger widths. This is useful all the time outside this case, especially if you're using a lookup vector that doesn't have more than 16 indexes to begin with. Yes you can convert your indexes to wider widths with pdep beforehand but why is it not an advantage to have an instruction which does 4-bit directly? Especially w/o knowing the cost of pdep/pext, I'm led to believe the area is not trivial, while picking out fixed bits for input to the shuffle unit would be..

OpenGL guy · Mar 17, 2013

Nick said:
NVIDIA's warps are 32 elements but Kepler has 32 element SIMD units so they only take one cycle (at least for the regular 32-bit arithmetic ones).

Not even close. A simple float add is 9 clocks on GTX 680. However, a single thread can multi-issue, meaning there's only a stall if there's not enough other work to do.

Nick said:
GPUs do have to hide ALU latency. It is larger than one cycle so they have to swap threads to hide it.

I don't know about Nvidia, but on GCN, we don't swap threads for ALU latency, because there is none to speak of: Each SIMD in a CU runs separate wavefronts. Most instructions run in 4 clocks and that's the smallest execution granularity. If you hit a longer running instruction, say dmad, then there's no point in swapping threads because the ALUs are already busy.

Nick · Mar 17, 2013

3dcgi said:
For AMD and NVIDIA hardware the latency for dependent ALU operations aren't worse than a CPU. For AMD it's actually better as there's zero latency if the ALU instruction is in the instruction cache. I'm not sure about Kepler, but Fermi supposedly had ~18 cycles of latency if my memory is correct.

Haswell's FMA latency is 5 cycles. So how can you say the GPU's ALU latency isn't worse when Fermi takes 18 cycles? And that's without taking clocks into account. Also, even if the effective latencies were the same, the CPU would still achieve much higher average ILP per thread thanks to out-of-order execution. So it's of a different relevance.

What do you mean by zero latency if the ALU instruction is in the instruction cache? It always takes multiple cycles to have the result available for a dependent instruction.

I think the latency you're really talking about is the latency when the next instruction isn't in cache or there's a memory operation.

No, the instruction cache has nothing to do with the latency I'm talking about. I'm talking about the ALU latency plus the bypass latency.

Nick · Mar 17, 2013

ninelven said:
Nick said:

ninelven said:

How exactly are you expecting to overcome the bandwidth deficit without increasing the perf/watt gap?

Click to expand...

What bandwidth deficit?

Click to expand...

If you are going to persist in being obtuse *again*, then we can't really have a conversation. So you can either answer the question you damn well know I was asking or...

No I really don't have a clue what bandwidth deficit you're talking about. The bandwidth wall is an argument in favor of unification of the CPU and integrated GPU. They share the same bandwidth and the CPU's computing power can grow faster than the bandwidth while the GPU already exhausts it so there's continuous convergence and eventually unification.

You make it sound instead like it's something that would need fixing to make unification work. So if there's some other bandwidth deficit you had in mind that impedes unification, I'd really love to know about it and I would happily answer how I expect it will be overcome.

OpenGL guy · Mar 17, 2013

Nick said:
What do you mean by zero latency if the ALU instruction is in the instruction cache? It always takes multiple cycles to have the result available for a dependent instruction.

This is not true for GCN.

Nick · Mar 17, 2013

Gipsel said:
There is no such thing as an "OpenCL workload". You can easily formulate scalar/serial workloads or expose only a low amount of parallelism with OpenCL.

But I don't have to formulate such a corner case to see OpenCL applications run faster on a CPU than on a heterogeneous architecture. In other words, workloads that people assume would run faster by making use of the GPU, aren't running faster at all due to inherent inefficiencies in the GPU architecture and/or due to the heterogeneous overhead.

Arun · Mar 17, 2013

OpenGL guy said:
This is not true for GCN.

I think there's a slight difference in terminology here. Here's how the vector part of GCN works AFAIK:
- Single shared decoder that issues 1 vector instruction/cycle.
- 4x16-wide ALU pipelines with 4 cycles latency.
- One active 64-wide thread per ALU pipeline.

This results in the following behaviour:
- Decoder issues one instruction to each ALU pipeline every 4 cycles (round-robin).
- Each instruction takes 4 cycles to process as the threads as 4x as wide as the ALUs.
-> By the time the scheduler comes back to this pipeline, it is ready to process the next dependent instruction.

Now in practice it's a bit more complex than that (the register file is probably the really interesting bit), but basically you don't need any ILP to achieve full throughput with the minimum number of threads. While you do need 256 vector elements (4x64-wide threads) to feed 64 vector elements of processing (4x16-wide ALUs), I think this is basically the exact same trade-off Nick is suggesting for AVX1024.

Novum · Mar 17, 2013

Nick said:
But I don't have to formulate such a corner case to see OpenCL applications run faster on a CPU than on a heterogeneous architecture. In other words, workloads that people assume would run faster by making use of the GPU, aren't running faster at all due to inherent inefficiencies in the GPU architecture and/or due to the heterogeneous overhead.

Handbrake only uses OpenCL for certain parts of the encoding pipeline. This is not a good benchmark, because the slow CPU cores of AMD drag the GPU down (Amdahl's law).

But indeed, video encoding is one of those tasks where a throughput core can't shine because of entropy coding, which is inherently non parallel.

Nick · Mar 17, 2013

Davros said:
Not sure, What does unified cpu/gpu get you over separate units with some shared features (memory,caches ect) over having separate units on the same die ?
If your target is performance then I dont think it is the way to go.
The only reason integrated graphics exist is becuse they make the cpu attractive to system builders who want to cut costs
are the any cost saving with unification over on die gpu's ?

The benefits would be comparable to the unification of vertex and pixel processing: performance, cost, and new applications. For a long time the GPU's unification didn't make sense from a performance point of view due to the inherent differences between vertices and pixels. But soon after they found their common ground they did become unified. Performance wasn't an obstacle any more because unification did not imply sacrificing the qualities of either type of core: it unifies them.

It eliminates duplicate logic, and it eliminates communication overhead between heterogeneous cores (data, code, synchronization, etc.). So it lowers the hardware cost and improves performance, especially for applications that have workloads that fit the 'common ground' or have a mix of workloads that would otherwise be impeded by the heterogeneous overhead. This sparks the development of new applications, which then in turn solidifies the need for a unified architecture.

While the ability to create new high performance applications is already an interesting financial prospect for developers, a unified architecture is also vastly simpler to program than a heterogeneous one and thus reduces the software cost. Compilers can use the UPU's SIMD instructions, with practically no risk of ending up with lower performance like with a heterogeneous architecture. And you can use any programming language and tool set you prefer. Of course the brave still have access to assembly intrinsics and such to squeeze every drop of performance out of it, but the important thing is you're not forced to know anything about the hardware configuration or its characteristics to be able to benefit from it, like you would with all the different heterogeneous setups. So it becomes accessible to the average developer, accelerating his application without effort.

Lastly, the software ecosystem for unified computing is more thriving. You can exchange code, algorithms, libraries, frameworks, etc. much more easily than for a heterogeneous architecture. The amount of OpenCL code pales in comparison to the amount of C, C++, C#, Objective-C, Java, that would benefit from unified acceleration.

Also note that something like OpenCL requires a committee to define and refine the API, and it requires hardware manufacturers to invest into creating drivers for it. So again unified computing would reduce costs at this level as well and broaden the possibilities.

If you do unify wont you end up with a cpu with a huge number of pipelines(aka cores) which is fantastic for graphics work, but sit idle for normal cpu work ?

Not really. It only takes 8 cores with AVX-1024 to create a 3+ TFLOPS UPU. TSX will make it feasible to use use that many cores efficiently for a wider range of applications.

Also keep in mind that leaving cores idle when there's nothing to do isn't a bad thing. If an architecture provides high ILP, high DLP and high TLP it can run any workload in the spectrum, but some will inherently not use all these features.

MfA · Mar 17, 2013

Transactional memory ... easy to write wrong, impossible to debug ... such a great combo. It's never going to be the saviour which will make parallel programming more ubiquitous. Parallel programming isn't easy and transactional memory is not always an efficient solution.

Any way, if we are going to unify around something I'd like it to be something which can efficiently handle divergent workloads ... MIMD down to at least a VLIW level, perhaps scalar ... certainly not a 32 wide vector.

OpenGL guy · Mar 17, 2013

Arun said:
I think there's a slight difference in terminology here. Here's how the vector part of GCN works AFAIK:
- Single shared decoder that issues 1 vector instruction/cycle.
- 4x16-wide ALU pipelines with 4 cycles latency.
- One active 64-wide thread per ALU pipeline.

This results in the following behaviour:
- Decoder issues one instruction to each ALU pipeline every 4 cycles (round-robin).
- Each instruction takes 4 cycles to process as the threads as 4x as wide as the ALUs.
-> By the time the scheduler comes back to this pipeline, it is ready to process the next dependent instruction.

Now in practice it's a bit more complex than that (the register file is probably the really interesting bit), but basically you don't need any ILP to achieve full throughput with the minimum number of threads. While you do need 256 vector elements (4x64-wide threads) to feed 64 vector elements of processing (4x16-wide ALUs), I think this is basically the exact same trade-off Nick is suggesting for AVX1024.

The 4 vector units in a CU can process separate instructions. This is why I recommend at least four 64 thread work groups per CU as opposed to 256 thread work groups.

ninelven · Mar 17, 2013

Nick said:
No I really don't have a clue what bandwidth deficit you're talking about.

So you don't see a problem comparing FLOPs per watt between 2 devices where one has 128bit DDR3 and the other has 256bit GDDR5? Because if you can't, then...

3dcgi · Mar 17, 2013

Nick said:
What do you mean by zero latency if the ALU instruction is in the instruction cache? It always takes multiple cycles to have the result available for a dependent instruction.

No, the instruction cache has nothing to do with the latency I'm talking about. I'm talking about the ALU latency plus the bypass latency.

My only point was correcting the notion that all GPUs need to swap in other threads to hide ALU latency. As others have said there's effectively zero latency for AMD's GCN to execute dependent instructions. The actual latency is 4 cycles, but that's hidden by the wavefront executing in 4 groups.

You of course need a lot of data parallel work to fill up the GPU in the first place.

Nick · Mar 17, 2013

tekyfo said:
Right, and your proposed AVX-1024 vectors have 32. Come on, you have used that particular number so often in the course of this discussion.

I know that your point is not about 1024 bits in particular but generally widening the vector width, but you were talking a lot about that number.

That's only because Intel already mentioned AVX-1024 in 2008. Of all the current throughput-oriented architectures, Xeon Phi is 512-bit, NVIDIA's are 1024-bit, and AMD's are 2048-bit, so it's good to know that the CPU's VEX encoding can be extended to 1024-bit. The instruction set certainly needs to be fleshed out (which is what this thread was supposed to mainly be about), but things are going to be straightforward to add as extensions (at least from the ISA point of view). The x86 encoding format is ready for unified high-throughput computing.

For 1024 bit instructions the hit rate per instruction is going to be lower because of more capacity-misses. CPUs are much more dependent on good hit rates and that is why they wont profit as much from wide SIMDs.

That is certainly true. It's even a consequence of any increase in performance. If the average scalar ILP increases, the hit rate of an equal cache would have indeed gone down. That decrease doesn't prevent the performance increase from happening in the first place though. And if you increased your ILP, you probably also have the transistor budget to sufficiently or fully restore the hit rate again.

What's also critical to realize here is that this is every bit as bad for GPUs. They may not care much about hit rate for the latency impact, but they do care a lot about the bandwidth impact!

Finally, note that Xeon Phi has 512-bit SIMD and only 4-way round-robin SMT to hide latencies from cache misses, and seems to be doing alright. Haswell has out-of-order execution, and 2-way anything-goes SMT. Of course you need to take core count and SIMD count and cache sizes into account too, but I think you'll find that it's a very broad design space and the CPU should have no major trouble to further increase its SIMD width. One of the techniques that will go a long way is the use of long-running SIMD instructions, which could double the latency that can be hidden or more.

So this shows that a slightly lower cache hit rate could merely be the 'symptom' of much high performance, for either architecture. I don' t regard it as something that will require substantial changes beyond regular progress and the addition of long-running instructions. During high-throughput workloads CPUs would behave more similar to GPUs (the good and the bad), which is exactly what we want from a unified architecture.

Nick · Mar 17, 2013

mczak said:
RANT ON:
rcpps is just terrible, though reciprocal square root is just as bad.
Both drop denorms, and are not very accurate, this is the part which would be ok (though it has to be said the precision is really crappy, comparable to half-float precision).
But what makes it terrible:
- it does not return equality for 1.0. This is a real issue in quite some code.
- you can't fix up precision with Newton-Raphson. Well you can (and indeed just one Newton-Raphson step increases precision to the very useful range though IIRC 1.0 input still won't give 1.0 result) but it will turn 0 (or denorm) and Inf inputs into NaNs, which is often just unacceptable. You can then try to fix that up too but that's probably a couple cmpps/blendps instructions.
So in short, either you can live with the crappy accuracy (including the terrible 1.0 input case) or just forget about it, the fixups just aren't worth it, at least not if you could live with somewhat lower accuracy but still need "real" float handling (Inf/NaN). FWIW on Ivy Bridge a divps is listed as 10-14 clock latency, whereas rcpps only needs 5 clocks, but even with a single Newton-Raphson step (2 mul, 1 sub) you are already way over that (5 + 2 * 5 + 3), not including the fixups for Inf/zero case. Sure the divps isn't pipelined but unless you have a boatload of them it's probably not much of an issue that the throughput is lower.
llvmpipe actually has completely given up on using rcpps, though maybe could bring it back some day (if cases could be identified which don't need more accuracy).
reciprocal square root is exactly the same mess, and it's a real pity there (because emulating a reciprocal square root requires TWO unpipelined instructions using divider unit).
3dnow actually had some special instructions to increase accuracy of rcp/rsqrt instructions - with such methods it would be possible to increase precision without having to sacrifice Inf/Zero behavior (but of course 3dnow couldn't deal with those in any case so it wasn't a problem there), but that's something which didn't carry over to sse.
RANT OFF

No need to call this a rant as far as I'm concerned. I fully agree it should be replaced with something better. I'll look around for potential options. Thank you.

Nick · Mar 18, 2013

Gubbi said:
Most of the discussion here has been about microarchitecture instead of ISA, and rightly so.

The amount of active threads/instructions in flight is ultimately limited by you registers, - you need a place to store results. A CU in an AMD GPU currently has 64KB registers, Haswell has 5.25KB AVX2 register space.

If CPUs are going to be competitive in throughput computing (graphics), they need to up the amount of instructions in flight and the amount of register space.

As I've discussed before, CPUs can achieve high throughput with fewer threads and registers by having lower latencies and by hiding them with out-of-order execution. It is highly debatable whether increasing the thread count would improve performance since Hyper-Threading only increases performance by 30% for the best known cases, and even with just two threads it is known to cause cache contention which nullifies any advantage in the worst cases. It is probably a better idea to hide more latency with long-running instructions since they keep the cache accesses coherent.

That said, long-running SIMD instructions do in fact increase the number of strands (i.e. work-items or scalar threads if you like), and require increasing the register space. But you still won't need anywhere near the same amount of work in flight or registers as the GPU does to achieve competitive performance. And that's a good thing considering Amdahl's law.

Another problem with optimizing for throughput is that work/energy favours wider, lower clocked implementations.

CPUs already use lower clocks when all cores are active, than when just one or a few are active. By keeping track of the amount of SIMD instructions they could adjust even more dynamically to high ILP or high DLP workloads or a mix of them.

Optimizing for throughput will inevitably lead to poorer single thread performance, which is where x86 is king (and why it is the most succesful architecture in the world), and hence change is slow (but there is change ! )

That is clearly a false assumption. Haswell will be substantially faster at throughput computing than its predecessor and yet its single-threaded performance will also improve. I also see no reason to assume that widening the SIMD units to 512-bit and adding support for AVX-1024 would have any negative effect on single-threaded performance either. The new process node used for that would most likely allow to increase it again.

Unification doesn't have to be a compromise. It can be the best of both worlds.

Nick · Mar 18, 2013

Ethatron said:
Nick said:

The GPU's SIMD width is also unpartitionable. Could you elaborate on what you mean by unthreadable?

Click to expand...

If 20 GPU SIMD-units are executing the same program, then you have a virtually very wide SIMD. But you don't need to, you can also run different programs per unit. It's flexible. A traditional CPU SIMD-register is not partitionable in this way...

The logical width of the SIMD operations is the minimum granularity for both the GPU and CPU. NVIDIA's warps are 32 elements while AMD's wavefronts are 64 elements. You can't "partition" it into anything smaller. The CPU's AVX-256 instructions can be used to process 8 32-bit elements. So its granularity is in fact smaller than that of the GPUs.

With the ability to partition SIMD-units you gain also the ability to re-schedule the instruction-streams for one particular SIMD-unit, program-flow between SIMD-units is completely independent.

No. All of the lanes of the GPUs SIMD units perform the same operation. That's why it's called Single Instruction Multiple Data. It's the exact same thing on a CPU, just with different width (for now).

The ISA for a UPU

Gipsel

Gipsel

Nick

Exophase

OpenGL guy

Nick

Nick

OpenGL guy

Nick

Arun

Unknown.

Novum

Nick

MfA

OpenGL guy

ninelven

PM

3dcgi

Nick

Nick

Nick

Nick

Similar threads