The ISA for a UPU

Nick · Mar 14, 2013

OpenGL guy said:
This is exactly how a GPU uses registers. A warp/wavefront takes multiple cycles to process an instruction for all registers in use by the instruction.

NVIDIA's warps are 32 elements but Kepler has 32 element SIMD units so they only take one cycle (at least for the regular 32-bit arithmetic ones).

CPUs use out-of-order execution and register renaming to hide latency. GPUs typically don't need to hide ALU latency and memory latency is handled by switching to other warps/wavefronts. It's not really that different.

GPUs do have to hide ALU latency. It is larger than one cycle so they have to swap threads to hide it.

CPUs have low latency L1 cache, but it's not necessarily faster than a GPU's L1 cache. From what I have seen on GTX 680, L1 cache for buffers is very low latency as well. Try the new memory latency test in SiSoft Sandra.

The problem with the GPU's L1 cache isn't so much latency as it is bandwidth. GK104 is at 0.33 bytes per FLOP, whereas Haswell can do 4(+2) bytes per FLOP. Of course as noted before the usage of registers and caches is different between the GPU and CPU, and in this case high cache bandwidth of the CPU isn't that huge an advantage because it needs the L1 cache for more temporary variables (for which the GPU has its larger register file).

So both have qualities and limitations that cancel out against each other to some degree. It means unification is closer than you might think from comparing these kind of numbers without the broader context.

So if a CPU uses hyperthreading it's not using registers?! Odd.

Of course it uses more registers due to Hyper-Threading. I never said anything of the contrary. Hyper-Threading is a valuable convergence toward the GPU architecture.

What does supporting a stack have to do with registers?! A stack pointer is a specialized register with specialized instructions to utilize it, nothing more.

The stack is used to store temporary variables that your software uses, just like registers are used for that same purpose. Register allocation makes sure that the most frequently used variables are in registers, to minimize accesses to the stack.

Where the CPU is lacking is that any cache miss is death, whereas a GPU can just switch to another warp/wavefront. In general, GPUs are much better at covering cache miss penalties than a CPU.

I'm sorry but you're being overly dramatic. A cache miss, by which I assume you mean an LLC miss, does not mean death. Haswell has a 192 entry reorder buffer, and an in-page memory access takes less than 100 cycles these days so you can still achieve an average IPC of 2. That's not very dead. Also note that Hyper-Threading doubles the latency hiding and yet the best results are 30% over single-threaded so it's not like the CPU is stalling all the time. 2-way SMT is useful but I doubt 4-way would help much, if at all. So much for GPUs being "much better" at covering cache miss penalties thanks to all their threads.

Even if there was a 300 cycle latency and for some reason any and all progress were to cease for that time (it won't), that's still an utterly insignificant 0.00001% of time for a 3 GHz CPU. So it all depends on just how frequently it occurs. Saying that "any" cache miss is death is just plain silly. Also note that when a miss occurs, any subsequent misses are largely masked by it. So you can have a fairly high number of them before you start noticing it.

Another thing you are forgetting when comparing a CPU's GFLOPs vs. a GPU's GFLOPs is everything else that makes up a GPU. Features such as rasterization, texture addressing, texture filtering, ROPs, etc. that don't show up in the rated GFLOPs. So it's incredibly naive to compare CPU A to GPU B based on GFLOPs and say the CPU should be competitive in graphics because it wouldn't even be close. If it were so easy to just look at GFLOPs, then Larrabee would have been a lot more compelling in graphics. Let's also not forget the power savings these fixed function bits offer.

Please read the first post again. I'm not forgetting anything or being naive. I'm well aware that for graphics higher performance will be required. That said, most people grossly overestimate the performance required to compensate for the average utilization of fixed-function hardware, and specialized instructions, the real topic of this thread, can significantly compensate for it.

Nick · Mar 14, 2013

ninelven said:
How exactly are you expecting to overcome the bandwidth deficit without increasing the perf/watt gap?

What bandwidth deficit? The CPU and integrated GPU share the same bandwidth. Unifying them doesn't make the bandwidth any worse. In fact the limited bandwidth is a major reason why unification will become feasible. In a few silicon node generations we'll have all the transistors required to have uniform cores capable of high ILP and high DLP for which any possible workload will just be bandwidth limited. That includes graphics. So dedicated hardware for it doesn't help. Instead, we can get ultimate programmability.

Nick · Mar 15, 2013

Gipsel said:
And Alexko mentioned another GPU with a lower wattage (50W) having a lot higher theoretical Peak (1792 GFlop/s). And that's something you can buy for almost a year already versus something which still isn't on the market. I could also say sometimes next year GPUs will achieve 3 TFlop/s in that 50/65W envelope. As others said already, you cherrypicked your numbers. That's possible in both directions.

That's irrelevant. It only took one example of a GPU not "utterly destroying" a CPU to prove that this is not "always" the case. The "always" qualifier gave me free rein to look for any example fitting the definition of "huge out of order speculative cores" and "a GPU like throughput oriented architecture".

I didn't make the claim the way it was stated, and you don't get to change it just because you don't like the conclusion of my proof. If I were you I would just accept it and maybe minimize its importance. But note that this would mean you think the GeForce GTX 650 is barely worthy of being called a GPU... It's your choice.

On tasks fitting the architecture of GPUs, CPUs get utterly destroyed (and that will stay that way for the foreseeable future).

As long as you're being specific enough about such tasks, I'm sure you'll find some large performance gaps. But how far do you want to go, and how relevant will that be? Does a task "fitting the architecture of GPUs" include GPGPU workloads? Because I already gave examples where GPUs aren't doing any utter destroying of the CPU. In particular for integrated GPUs it doesn't appear that in the near future there will be any general purpose workloads left where that's happening. They'll just both be bandwidth limited.

Clearly the GPU is losing terrain. So what makes you so sure that the GPU would be safe after crawling back to graphics? No matter how much dedicated hardware it might have, it too will suffer from the bandwidth wall. It's only a matter of time for the CPU to have additional computing power to do the same operations by using specialized instructions.

Nick · Mar 15, 2013

Davros said:
While your here Nick SwiftShader support for earlier directx versions please

Direct3D 8 not old enough for you?

I was actually seriously considering implementing Direct3D 6 to play Aliens versus Predator again, but then they re-released it as Aliens versus Predator Classic 2000 which runs on Direct3D 9. What games are you interested in reviving?

3dcgi · Mar 15, 2013

Nick said:
By instruction latency I meant the latency between the start of one instruction in a thread, and the ability to execute the next dependent one. So aside from execution latency this includes any latency added by scheduling, operand collection, and writeback.

Anyway, if AMD and NVIDIA want to get rid of some of that latency, then that clearly converges the GPU closer to the CPU architecture again.

For AMD and NVIDIA hardware the latency for dependent ALU operations aren't worse than a CPU. For AMD it's actually better as there's zero latency if the ALU instruction is in the instruction cache. I'm not sure about Kepler, but Fermi supposedly had ~18 cycles of latency if my memory is correct.

I think the latency you're really talking about is the latency when the next instruction isn't in cache or there's a memory operation. This is when a GPU relies on swapping in new work rather than reducing the latency of the memory operation.

ninelven · Mar 15, 2013

Nick said:
What bandwidth deficit?

If you are going to persist in being obtuse *again*, then we can't really have a conversation. So you can either answer the question you damn well know I was asking or...

Nick · Mar 15, 2013

Arun said:
What in the world are you talking about?

Pre-rendered frames.

GK110 has 15 SMX with a maximum of 2K threads/SMX for a total of 30K threads. That's ~3% of a 1024x1024 rendertarget making the overhead rather negligible - and that's before considering that a single shadowmap cascade is moving to 2Kx2K and the main framebuffer is moving to 1440p (4MPixel) and 4K (8MPixel). Oh, and GK110 can work at full efficiency with less than 1/4th as many threads if there's sufficient ILP to extract (quite likely in typical long programs). So you're literally only processing ~0.5% of a rendertarget at a time...

It doesn't quite work like that. Any time you read a render target as a texture, or you have an alpha blend operation, or you use the stencil buffer, or your change the depth test, or you change the depth in the shader, etc. you get a dependency between draw operations. It doesn't matter if your render target has 1024x1024 pixels in total. If you're rendering a point sprite on which all subsequent draw operations depend, then you're going to have to at least wait until the vertex shader has executed to know in what tile(s) it lands. Having 30k threads doesn't help either. You're going to have to suffer through the entire in-order dual-issue latency of a single thread and leave your other gazillion ALUs idle...

...unless there's some completely independent to work on, like drawing to a separate render target. But it's quite possible that this means rendering the next frame. But it too can get stuck on intra-frame dependencies, so you go look for work in even more frames. Or there can be inter-frame dependencies so you're completely out of luck.

In any case, you can easily run out of parallelism and become latency dependent. It doesn't typically have to be as extreme as the above example, but the pre-rendered frames feature exists for a reason.

As Gipsel said, the general overhead for starting tasks on modern GPUs is a bigger problem than available data parallelism, and that is being actively optimised by GPU IHVs.

Could you elaborate on what that overhead is precisely, and what's being done to optimize it? Thanks.

I agree completely. And it's even more completely irrelevant.

I'm afraid not.

Davros · Mar 15, 2013

@nick
crimson skies is the first one that springs to mind
and sportscar gt

List of games requiring 8-bit paletted texture support
Final Fantasy VII (1998)
European Air War (1998) (has a glide mode)
Final Fantasy VIII (2000)

Gipsel · Mar 15, 2013

Nick said:
You just contradicted yourself. And no, you can't strictly partition the data between things that can and things that can't move a shorter distance. If that was the case, why use the LLC instead of RAM if it all stays put anyway? Clearly you think something is worth sharing, but why end at the LLC?

I don't get your reasoning at all. Regarding your question, it is usually more efficient, to share it at a higher level, as lower levels get trashed (data doesn't fit in) and the the access latency is actually lower (you don't have to do a roundtrip to other cores lower level caches to get the data, but just to a common higher level cache, that's usually faster). And as you self said, the bandwidth of an L1-L2 interface is large enough and the potentially shared data in the L1 small enough, that it is no problem to push it out to a higher level. It is basically nothing else than done in a multicore CPU to ensure coherency anyway. It doesn't matter if that are symmetric cores or a mix of latency optimized and throughput optimized cores. The TOCs will only operate on larger datasets in average. Differences in the expected workload may also favour other choices for the design of the lower level caches (like L1).

Nick said:
Interesting. Could you point me to anything that describes it? Thanks.

Just look in any ISA manual for the VLIW architectures. The VLIW architectures explicitly controlled the behaviour through the instructions. If one wanted to use the result of an operation as input to a back-to-back dependent operation, one had to use (i.e. the compiler did this automatically) explicitly the result of the preceding instruction and not a register as the source (as the manual stated the result was not written back to the registers). This encoding into the instructions makes it simpler (the processor doesn't have to detect such dependencies, the behaviour of the bypass multiplexers is controlled directly by the instruction) without loss of functionality in the case of an inorder execution with fixed latency for all ALU operations (AMD's VLIW architectures).

Nick said:
That's what I said. VLIW5 and even VLIW4 became overkill due to changes in workload characteristics, so they switched to single-issue (which I don't think is the right move since modest multi-issue does have valuable benefits).

Multi-issue costs complexity (which you can try to minimize with static, compiler determined scheduling) and therefore die space and power. It's not so easy to say what is the right choice for armchair experts like us.

Nick said:
I understand your question is genuine but I'm not going to derail this thread with a big AMD versus NVIDIA architecture discussion. Anything relevant to unified architectures has already been discussed and it would take too much weeding through other architectural differences to get to a conclusive agreement about single-issue versus dual-issue. This is a tiny difference anyway compared to the CPU's IPC versus the GPU's CPI which is far more relevant to the topic here. If you're adamant about it I'd be happy to share my opinions on GPU to GPU differences if you cared to create a new thread about it.

It has nothing to do with AMD vs. nV. I was just curious, where your claim (that dual issue is the optimum for a GPU) comes from. I would think, this really depends on a lot of factors like the overall design and the expected workload and can't be judged in isolation. If you don't want to answer it here, just write me a PM.

Nick said:
I know these things but I have to admit I'm not entirely sure what to call this multi-issue-but-not-from-a-single-thread behavior.

For throughput tasks, single thread (could be also a subset of the data elements) behaviour isn't very important. One GCN processing core (a CU) can issue multiple instructions for a subset of the multiple threads running on it in a single cycle. On that level, it's the same as an nVidian SM does (and different from what a VLIW CU does, it only issues for a single thread in a given cycle).

Nick said:
Exactly. So we can't compare it as such. Their usage and purpose differs greatly but the results obtained are much more closely comparable than the differences appears to make some assume.

What you still ignore to a large extent, is the argument about differing amounts of parallelism (and how it's exploited) in different workloads. What is your graph (which you link the second time, I said something to it already) really showing?

Nick said:
That would fall under the definition of "trading blows". I gave examples of CPUs beating (integrated and discrete) GPUs. You're giving examples of (discrete) GPUs beating CPUs. Fits my argument just fine, especially since integrated GPUs are far weaker.

No. What you demonstrate is the variety of workloads and that a certain type if processor is vastly better suited to one kind of workload while another type of processor is vastly better at dealing with other workloads. Defining that as "trading blows" appears a bit ridiculous.

Nick said:
NVIDIA's warps are 32 elements but Kepler has 32 element SIMD units so they only take one cycle (at least for the regular 32-bit arithmetic ones).

Are there really 32 physical slots in each SIMD unit or just 16 (single cycle issue but blocked the next cycle for issue)? The latter would fit better with the number of schedulers in each SM.

Anyway, that is not the topic here.

Nick said:
GPUs do have to hide ALU latency. It is larger than one cycle so they have to swap threads to hide it.

GCN apparently doesn't have to do so. You only have to do it if latency>throughput. AMD's GPUs tend to keep latency=throughput (to some extent that is also true for the VLIW architectures, wavefronts are not exactly swapped on a instruction to instruction base but only for larger instruction groups called clauses; physical latency is a fixed 8 cycles, a wavefront gets a new VLIW instruction every 8 cycles, apparent latency: none, exactly two wavefronts are procesed in parallel at any given time).

Nick said:
The problem with the GPU's L1 cache isn't so much latency as it is bandwidth. GK104 is at 0.33 bytes per FLOP, whereas Haswell can do 4(+2) bytes per FLOP. Of course as noted before the usage of registers and caches is different between the GPU and CPU, and in this case high cache bandwidth of the CPU isn't that huge an advantage because it needs the L1 cache for more temporary variables (for which the GPU has its larger register file).

I definitely agree with the second sentence.
It may be also interesting to look on some size and bandwidth numbers (even when they don't tell too much, it just gives a feeling what the relevant numbers are). A single GCN CU (the Tahiti ones with higher DP speed and ECC, the other GCN members use slightly smaller ones) measures about 5.5mm² in 28nm, a Haswell core in 22nm measures about 14.5mm² (would be 23.5mm² normalized to 28nm assuming perfect scaling). A GCN CU @1 GHz can do 128 GFlops/s (SP), Haswell at a somewhat optimistic 4 GHz the same (is faster at DP though). The GCN CU integrates 256kB vector registers with an aggregate bandwidth of ~1 TB/s, 8 kB scalar registers with a bandwidth of 16GB/s, 64 kB shared memory with a bandwidth of 128GB/s, and 16 kB of vector memory L1 cache with a bandwidth of 64 GB/s (L1-L2 connection also provides 64GB/s). Furthermore, it can access the I$ with 32 GB/s (one needs less instructions for the same arithmetic throughput as a CPU) and the scalar data cache with about 16 GB/s.
Haswell on the other hand has 168 physical registers each for integer (64Bits) and floating point/SIMD (256 bits). That's ~1.3 kB integer registers and 5.25 kB SIMD registers. Let's concentrate on the SIMD part. The reg file has probably 6 read ports and 3 write ports or something in that range. That would be a total bandwidth of ~1,1 TB/s (has to be roughly the same as the arithmetic throughput is the same as that of a GCN CU). The 32kB L1 cache offers a bandwidth of 256 GB/s for reads and 128GB/s for writes, the 256kB L2 cache can be accessed with 256GB/s and the L1I cache with 64GB/s.

So in the end, a Haswell core needs about 4 times the normalized area (and at 4 GHz a lot of additional power) to provide the same arithmetic throughput, a comparable amount of SRAM in its L1+L2 cache with somewhat comparable bandwidth numbers (lower than the comparably sized reg files of a GPU of course, but higher than their smaller caches).

Nick said:
So both have qualities and limitations that cancel out against each other to some degree.

Yes, but in the end we are left with the way higher die area and power required for the same throughput. What that means for throughput oriented tasks is clear. It is even clearer, what it means for graphics, as the die size number I gave already includes conversion and filtering as well as texture decompression logic contained in the TMUs.

Nick said:
It means unification is closer than you might think from comparing these kind of numbers without the broader context.

What I left out above, is the scalar performance. There, a Haswell cores shines against a GCN CU. It simply puts much more emphasis on that area. That is it, what it means.

Nick said:
That's irrelevant. It only took one example of a GPU not "utterly destroying" a CPU to prove that this is not "always" the case. The "always" qualifier gave me free rein to look for any example fitting the definition of "huge out of order speculative cores" and "a GPU like throughput oriented architecture".

You again forget about the massive differences existing between workloads. The "utterly destroyed" referred to a certain class of workloads (Novum was alluding specifically to graphics!) for which it is true. And the "always" was meant as a temporal qualifier, i.e. it was that way since the inception of GPUs and will be the same also in a few years (okay, always is quite strong; but you can't know for sure, that it's wrong and your reasoning doesn't make much sense in this context as the statement is definitely right for the foreseeable future).

Nick · Mar 15, 2013

Gipsel said:
That statement was already heavily contested back then.

But not the years before it. What I was trying to illustrate was that the acceptance of VS/PS unification was gradual right up till the moment the hardware arrived. It slowly went from an insane idea to highly plausible. In particular there was a time when you could shift some of the workload from the vertex shader to the pixel shader, or vise versa. It wasn't just 'conceivable' to do that, you had to make wise choices for best results. That's where they truly found common ground, and despite remaining differences they were unified successfully by retaining the necessary qualities for both workloads.

Today we're at the verge of the stage where OpenCL workloads are no longer guaranteed to run faster on the integrated GPU than on the CPU (I'm only saying "verge" to ease your acceptance of it, I personally think we're knee deep into it already). It isn't just 'conceivable' to choose one or the other or split the workload, you have to make wise choices for best results. So they've found their true common ground.

Of course it's completely valid to argue that iGPUs are still better at graphics while CPUs are better at sequential workloads. But that was also true of vertex and pixel units before we unified them. You wouldn't process all your pixels on the vertex units or all your vertices on your pixel units. They were still specialized for their respective purpose right before unification. So it's critical to realize that unification doesn't have to be a big compromise. You can retain the necessary qualities for both, at a cost that is no higher than what you gain.

See how far we've come with the convergence between the CPU and GPU. And we've identified more convergence up ahead. So all I'm asking for is to be open-minded enough to realize that we're somewhere along the way between an insane idea and a highly plausible one. Pondering what the ISA should look like could be somewhat premature, but at the very least it's a convergence step and what we come up with could be very valuable.

Nick · Mar 15, 2013

Gipsel said:
Nick said:

John Carmack was a big proponent of floating-point pixel shaders, and GPU manufacturers also had a hard time keeping up with the increase in ALU:TEX ratio of the shaders that developers wrote, which lead to the decoupling of the texture and arithmetic pipelines to achieve lower latencies.

Click to expand...

I think you mix something up here. Decoupling the TMUs from the ALUs makes it more flexible and also enables better hiding the latency of a memory accesses behind the arithmetic instructions of other wavefronts/warps. It doesn't lower access latencies, but it usually increases throughput.

Straw man alert! I didn't write "access latencies". I wrote "latencies". Decoupling the arithmetic pipeline from the texture pipeline made the latencies for arithmetic operations a lot shorter.

Gipsel · Mar 15, 2013

Nick said:
Today we're at the verge of the stage where OpenCL workloads are no longer guaranteed to run faster on the integrated GPU than on the CPU (I'm only saying "verge" to ease your acceptance of it, I personally think we're knee deep into it already). It isn't just 'conceivable' to choose one or the other or split the workload, you have to make wise choices for best results. So they've found their true common ground.

There is no such thing as an "OpenCL workload". You can easily formulate scalar/serial workloads or expose only a low amount of parallelism with OpenCL.

It's is somehow tiring (also that you pick older posts to answer to). Maybe we both should focus on something else.

Davros · Mar 15, 2013

Nick said:
That's an interesting twist. I'm not suggesting to unify something like a GTX 680 at all, I'm suggesting to unify integrated graphics. So you agree the latter is likely to happen?

Not sure, What does unified cpu/gpu get you over separate units with some shared features (memory,caches ect) over having separate units on the same die ?
If your target is performance then I dont think it is the way to go.
The only reason integrated graphics exist is becuse they make the cpu attractive to system builders who want to cut costs
are the any cost saving with unification over on die gpu's ?

If you do unify wont you end up with a cpu with a huge number of pipelines(aka cores) which is fantastic for graphics work, but sit idle for normal cpu work ?

tekyfo · Mar 16, 2013

Nick said:
Four times what data precisely? Where does this factor of 4 come from? Are you referring to SSE registers having four 32-bit elements? AVX-256 has eight.

Right, and your proposed AVX-1024 vectors have 32. Come on, you have used that particular number so often in the course of this discussion.

I know that your point is not about 1024 bits in particular but generally widening the vector width, but you were talking a lot about that number.

For 1024 bit instructions the hit rate per instruction is going to be lower because of more capacity-misses. CPUs are much more dependent on good hit rates and that is why they wont profit as much from wide SIMDs.

mczak · Mar 16, 2013

Nick said:
SSE/AVX has (v)rcpps for a rough reciprocal approximation, which can be refined using Newton's method. I wouldn't mind one that better handles corner cases though.

RANT ON:
rcpps is just terrible, though reciprocal square root is just as bad.
Both drop denorms, and are not very accurate, this is the part which would be ok (though it has to be said the precision is really crappy, comparable to half-float precision).
But what makes it terrible:
- it does not return equality for 1.0. This is a real issue in quite some code.
- you can't fix up precision with Newton-Raphson. Well you can (and indeed just one Newton-Raphson step increases precision to the very useful range though IIRC 1.0 input still won't give 1.0 result) but it will turn 0 (or denorm) and Inf inputs into NaNs, which is often just unacceptable. You can then try to fix that up too but that's probably a couple cmpps/blendps instructions.
So in short, either you can live with the crappy accuracy (including the terrible 1.0 input case) or just forget about it, the fixups just aren't worth it, at least not if you could live with somewhat lower accuracy but still need "real" float handling (Inf/NaN). FWIW on Ivy Bridge a divps is listed as 10-14 clock latency, whereas rcpps only needs 5 clocks, but even with a single Newton-Raphson step (2 mul, 1 sub) you are already way over that (5 + 2 * 5 + 3), not including the fixups for Inf/zero case. Sure the divps isn't pipelined but unless you have a boatload of them it's probably not much of an issue that the throughput is lower.
llvmpipe actually has completely given up on using rcpps, though maybe could bring it back some day (if cases could be identified which don't need more accuracy).
reciprocal square root is exactly the same mess, and it's a real pity there (because emulating a reciprocal square root requires TWO unpipelined instructions using divider unit).
3dnow actually had some special instructions to increase accuracy of rcp/rsqrt instructions - with such methods it would be possible to increase precision without having to sacrifice Inf/Zero behavior (but of course 3dnow couldn't deal with those in any case so it wasn't a problem there), but that's something which didn't carry over to sse.
RANT OFF

Nick · Mar 16, 2013

Gipsel said:
Nick said:

The important thing to observe here is that latency-sensitivity and throughput-orientedness is in fact orthogonal. So it doesn't make a lot of sense to have a latency-sensitive core and a throughput-oriented core. You wouldn't get good results for workloads that are both throughput-oriented and latency-sensitive.

Click to expand...

I'm not sure I get this. Just a simple example.
Latency oriented: compute a certain algorithm on a single or just on a few data elements as fast as possible
Throughput oriented: compute a certain algorithm on lots and lots (for instance a few million) data elements as fast as possible
In the second case, you usually don't care about the latency of a single instruction or memory access at all, even ILP is irrelevant with the right architecture (without compromising performance!), in the first one you care about those things pretty much.

Please see my response to Arun. It's a textbook example of Amdahl's law. A petite dependent task can bring down the performance of your thousand ALU architecture. So it's not your second case that I'm concerned about. I completely agree that ILP is far less important there and you're better off processing as many threads simultaneously as you can fit in nearby data storage. It's your first case you should really be worried about, since it also occurs on a throughput-oriented architecture, including for graphics.

The solution that's currently offered by proponents of heterogeneous computing: Aha, a latency sensitive task, let's execute it on the CPU instead!

Clearly that's just stupid. Not only is it hugely impractical to have the CPU take part in the low-level rendering processes, the round-trip latencies make any such attempt futile.

Gipsel said:
Having two types of cores enables you to offload throughput oriented task to the throughput cores, while the latency optimized cores would take care of the tasks they excel at. For small problems, the efficiency of a TOC drops off quite a bit, so it is sensible to use the LOCs instead. That's the theory of heteregenuous computing that we all know.

That's the theory alright, and I just illustrated that it is fundamentally flawed.

It is of course possible to modify the LOCs to become more throughput oriented as you propose. Wider SIMD units are one of the tools to do so. Looking at the efficiency (some measure of performance/Watt, performance/mm², or something like that), it goes up for larger problems compared to the same core with narrower SIMD units. In principle, one can define some kind of a critical problem size (where the efficiency rise levels off) as a property of the design. Larger problems won't increase the efficiency anymore for the given design.
But doing so, will eventually lower the efficiency for very small (i.e. serial/scalar) tasks. For smaller SIMD units, this is easily tolerable as it is a relatively small effect and the advantages outweigh it by far. But putting excessive size SIMD units and/or other throughput optimization techniques into the design, will eventually start to seriously compromise the serial performance. Unfortunately that still is and will be a very important measure for general workloads. So you open up an "efficiency gap" between a "true" LOC with just medium sized SIMD units and an LOC/TOC hybrid. The same gap also exists for pure TOC. But combining a LOC and a TOC enables to switch the execution from one type of core to the other one at the intersection point.

Please quantify "seriously" compromising the serial performance. But while doing so keep in mind that Haswell extends the integer SIMD units from 128-bit to 256-bit, adds FMA support (i.e. the ALUs take another 256-bit operand), widens all three memory ports from 128-bit to 256-bit, adds another AGU, implements gather, and adds another scalar port. And yet despite all this widening, scalar performance is expected to increase.

You see, the scalar execution stack generally shrinks in size thanks to smaller process nodes. So any time the opportunity arises to increase the SIMD width, it's relatively easy to keep the same scalar performance, or even keep improving it. Sure, it probably gets tougher as the SIMD width becomes much wider than the scalar width, but I expect it will be about break-even by the time we witness unification. In other words, no absolute compromise should be necessary.

The task for the future is to find efficient mechanisms to do so and to move up the efficiency for intermediate sized problems. This will be probably tackled from both sides.

Absolutely. It's what I've been calling convergence all this time. And convergence eventually leads to unification.

Curently, it is mostly latency sensitive when the software strangle the system with a huge amount of draw calls for instance. But then we are at the task scheduling latency again, which is a different problem.

I'd really like to better understand why GPUs suffer from task scheduling latency exactly and what's being done about it (details please). And why do you think it's a fundamentally different problem from other latency issues?

No, it enables you to start throughput tasks with a lower latency.

Sounds a lot like Amdahl's law to me, hence the kind of problem a unified architecture would be good at.

You can't do anything with just two regs. Heck, a lot of instructions use already 3 input registers.

Pixel Shader 1.3 offered only 2 logical temporary (vector) registers. Input registers aren't a problem because they get computed on the spot. It's the results that you write and read back (effectively 200+ cycles later) that needed temporary registers.

At its highest thread occupancy, G80 offered barely more physical register space. Of course as Arun pointed out G80 is a scalar architecture so you don't waste anything on unused vector components. And thanks to its non-blocking scheduler to improve the average latency it often didn't need high occupancy. But still, since then the average shader length has grown substantially and more registers were needed. Kepler now uses a blocking scheduler (with static dual-issue, essentially VLIW2) so it's a compromise and some GPGPU workloads suffer from it.

If you needed more than that, a high number of texture fetches would cause it to stall due to not having sufficient threads to cover the latency (assuming a sufficient texture cache hit rate to not stall due to bandwidth).

Click to expand...

A high number of texture fetches will always stall the execution, no matter how many threads you run as you can't serve more fetches as you have TMUs in that SM (the number is usually less then the number of elements in a single warp).

My argument applies regardless. Also note that a unified architecture doesn't have to stall from a shader/kernel full of memory accesses. Under ideal circumstances it could even sustain two per cycle.

It's the same as a continuous stream of moves to/from memory would be limited by the AGUs and/or L/S units on a CPU, no matter what the size of your SIMD engines is or if the CPU can do SMT or not.

Limited? Again, Haswell can do two full SIMD width loads and a store per cycle. Also, the CPU can easily schedule around any remaining conflicts so stalls due to L/S contention are practically non-existant. In comparison the GPU is massively bottlenecked by L/S and texture fetch. Also, since it takes multiple cycles to issue a texture fetch I can easily imagine situations where other threads start convoying behind it even though there's independent arithmetic instructions below the texture access in each/some of the threads. The scheduler just won't schedule them. Basically you can get temporarily bottlenecked even though the number of L/S or texture fetches throughout the entire shader is reasonable. The CPU doesn't have either of these issues.

Nick · Mar 16, 2013

Davros said:
Nick said:

You see, there is a trivial way to infinitely increase the data parallelism of graphics: process more frames simultaneously.

Click to expand...

You cant because you dont know what future frames are going to be

Sure you do. Let's say the application makes a draw call for a red triangle, a present call, then a draw call for a green triangle, and a present call again. The driver can queue up these commands and nothing has to actually have been processed by this point. It can then tell the GPU to draw both the red and green triangle simultaneously, into different buffers. And once it's done it first shows you the red one and then the green one.

This isn't a very representative example, in reality it's more of a cascaded process, but I hope this illustrates that you can in fact know what future frames are going to be. The application can be several frames ahead of what the GPU is processing. That's why the 'pre-rendered frames' setting exists.

Gubbi · Mar 16, 2013

Most of the discussion here has been about microarchitecture instead of ISA, and rightly so.

The amount of active threads/instructions in flight is ultimately limited by you registers, - you need a place to store results. A CU in an AMD GPU currently has 64KB registers, Haswell has 5.25KB AVX2 register space.

If CPUs are going to be competitive in throughput computing (graphics), they need to up the amount of instructions in flight and the amount of register space.

Increasing (or replicating) ROBs will impact either cycle time or scheduling latency, as will a massive two-level register file. Another problem with optimizing for throughput is that work/energy favours wider, lower clocked implementations.

Optimizing for throughput will inevitably lead to poorer single thread performance, which is where x86 is king (and why it is the most succesful architecture in the world), and hence change is slow (but there is change ! )

Ethatron · Mar 16, 2013

Nick said:
The GPU's SIMD width is also unpartitionable. Could you elaborate on what you mean by unthreadable?

If 20 GPU SIMD-units are executing the same program, then you have a virtually very wide SIMD. But you don't need to, you can also run different programs per unit. It's flexible. A traditional CPU SIMD-register is not partitionable in this way, it could be if VLIW is used (a multiply on the first element, an add on the second, a move on the third and so on, in the same instruction). But we want to go away from that. I think going wider and wider is really going back to what we had in GPU and which we already left behind because there are more suitable approaches.

With the ability to partition SIMD-units you gain also the ability to re-schedule the instruction-streams for one particular SIMD-unit, program-flow between SIMD-units is completely independent. If you'd like to do that on a CPU SIMD-register you'd have to have some complex de/multiplexable instruction-stream, and an OoO unit which is able to re-schedule sub-sets on SIMD-units. It's so much easier with the GPU approach.

Gubbi said:
Most of the discussion here has been about microarchitecture instead of ISA, and rightly so.

Why? If we'd talk about ISA we would quickly come to the economics of changes, and that really drives or murders progress.

Nick · Mar 16, 2013

OpenGL guy said:
It's also undesirable as it breaks any effect that depends on the results of the previous frame(s). It also would wreak havoc with things like occlusion queries where previous frames contain the queries for later frames.

Those dependencies are all checked for.

If there are not too many of them, and the GPU runs out of parallelism within a frame, it will render things from the next frame(s). If there are too many dependencies, and it runs out of parallelism within an acceptable number of frames, it stalls. The crux of the matter is that when your run out of parallelism, which GPUs do even with graphics, the only solution is to optimize for latency.

The ISA for a UPU

PM

Similar threads