Software/CPU-based 3D Rendering

Andrew Lauritzen · Dec 11, 2012

rpg.314 said:
What's your take on game physics? Not much needed so CPU is fine? Or have the GPU physics efforts not panned out?

That's a tough one... I have argued in the past and will still argue that there's nothing fundamental about most physics stuff that can't be done efficiently on a GPU; in a lot of ways graphics is basically just a series of collision detection problems and related algorithms, so there's a lot of similarity. That said, as you scale complexity you typically need to move to more complicated data structures (in both areas) so GPU physics is going to face similar challenges there.

GPUs are obviously still very suitable for stuff like particles and fluid simulation (especially grid-based solvers), but honestly the crippling issue with using them for anything more than "effects" is the discrete memory/PCI-E bus, and the APIs that are built on top of that model. Stuff is still fairly set up for "pure feed forward" type systems which is inappropriate for physics, and increasingly for graphics.

So yeah, good question and I don't have a firm stance on it really. I think ultimately the gains are probably not going to be worth it for discrete stuff... but it's possible that with some API changes it ends up being a win in unified memory spaces/integrated GPUs; we'll see.

keldor314 said:
Cuda has supported recursion and proper function calls since Fermi.

Through function pointers/unknown call sites/targets? I don't think so...

keldor314 said:
As for register spill, GPUs handle it in exactly the same way as CPUs - by dumping things out to memory. For what it's worth, register renaming can't prevent register spilling.

Ah but on the CPU the L1$ basically acts like a dynamic register file. In a lot of cases, spilling to L1$ can be entirely "free", hence why ABIs save/restore register state for function calls, etc. and only the smallest of functions are really worth inlining these days. On a GPU you have to spill to global memory, which is a massive performance cliff that you basically can't afford to do, hence why so much optimization in CUDA, etc. is around "occupancy" and optimizing register pressure - because GPUs are really not capable of handling that efficiently in a "dynamic" manner (i.e. depending on runtime code path selection). Thus having a single register-heavy path present in a kernel - even if it never gets taken at runtime - can reduce the performance of the entire kernel. That's not really acceptable going forward...

keldor314 said:
So why then are GPUs able to fit this extra power into their budget but CPUs are not?

Mostly because their budget is larger (200-300W vs. <100W)...?

Also realize that the high-end of CPUs is increasingly server-grade CPUs, and those are probably *more* concerned with efficient power usage than mainstream stuff.

keldor314 said:
As for power consumption, my numbers for a system with a high end CPU and GPU, running at 100% load 24/7 works out to about $200 per year, so power consumption isn't really all that important, especially since even gamers aren't likely to pull even 10% of this. Thus, the question is about performance, to the greatest extent possible without melting anything.

You missed my point... it's not (directly) your hydro bill that matters, it's that increasingly going forward all parts will be power-constrained... to some extent they are already. i.e. the power efficiency of an architecture will completely dictate its performance for a given form factor. Given a target like - say - a 25W laptop, what's the best trade-off of architectures to put in there for a given workload (say a game)? That's the interesting question going forward, not whether a 300W part is faster than an 80W part. If you're not normalizing on power usage I could just respond to every point made by saying "but a cluster of CPUs/GPUs/FPGAs/whatever will be faster...".

And sure, performance/$ is relevant when buying hardware, but ultimately in the long term trend that is largely determined by power efficiency as well (building smaller chips that use less power generally means higher margins). Just in the short term it's also determined by other unrelated economic factors that frankly aren't that interesting

keldor314 said:
Fermi and later can access memory across the PCIe bus, so if you really need it, it's there (though you'll only get maybe 12 GB/s for PCIe 3, iirc). Still, if you look at the VRAM as a 2 GB L3 cache, with about 4x the bandwidth of a CPU's L3, it looks pretty nice. I don't believe that there are many algorithms that use this amount of memory that you can't do in an out of core way?

Uhh, 12GB/s - even if you could realize that in practice - is far too slow to even bother doing most stuff on the GPU (there's a reason why papers typically remove transfer times...). For big server applications they put hundreds of GBs of RAM in the machines so they can have a entire databases in RAM and so forth. Arguing that a GPU adds anything useful to that equation is the same as arguing that GPUs don't really need GDDR/memory bandwidth to be effective, which seems to be the opposite of what you were previously saying...

rpg.314 · Dec 11, 2012

Andrew Lauritzen said:
Yes that's what I meant - ultimately we have to drop the concept that the GPU must know - before launch of a kernel - the exact register file requirements of the kernel, and thus by extension the entire set of potentially callable code for that kernel.

Why would something like that be a hindrance with the advent of dynamic parallelism?

AFAICS, you are arguing for something like a CPU where the register file is specified and any function, no matter with how many registers can be called via the call stack and stack variables. Well, for variables that are victims of register spills, allocating them out of a static pool is decisively better from power perspective. And dynamic parallelism fills in for the call stack with a custom hw/sw solution, which is probably more efficient from power perspective than an rigidly defined ABI + sw based stack.

rpg.314 · Dec 11, 2012

Andrew Lauritzen said:
That's a tough one... I have argued in the past and will still argue that there's nothing fundamental about most physics stuff that can't be done efficiently on a GPU; in a lot of ways graphics is basically just a series of collision detection problems and related algorithms, so there's a lot of similarity. That said, as you scale complexity you typically need to move to more complicated data structures (in both areas) so GPU physics is going to face similar challenges there.

GPUs are obviously still very suitable for stuff like particles and fluid simulation (especially grid-based solvers), but honestly the crippling issue with using them for anything more than "effects" is the discrete memory/PCI-E bus, and the APIs that are built on top of that model. Stuff is still fairly set up for "pure feed forward" type systems which is inappropriate for physics, and increasingly for graphics.

So yeah, good question and I don't have a firm stance on it really. I think ultimately the gains are probably not going to be worth it for discrete stuff... but it's possible that with some API changes it ends up being a win in unified memory spaces/integrated GPUs; we'll see.

This sounds more like (and I share this view) a critique of the API and PCI than the GPU architecture itself.

Ah but on the CPU the L1$ basically acts like a dynamic register file. In a lot of cases, spilling to L1$ can be entirely "free", hence why ABIs save/restore register state for function calls, etc. and only the smallest of functions are really worth inlining these days.

Latency free? Sure. Power free?.....

Andrew Lauritzen · Dec 11, 2012

rpg.314 said:
Why would something like that be a hindrance with the advent of dynamic parallelism?

"Dynamic parallelism" helps (if it were on consumer cards yet :S), but it still requires dumping a lot of state to memory and reloading it later, since it is fundamentally a "breadth first" sort of construct. So you can't get any of the nice memory locality effects that most work stealing systems can accommodate. Still, I'm thrilled that we at least have *some* way of launching additional work... or will soon if they roll it down to the consumer lines.

rpg.314 said:
AFAICS, you are arguing for something like a CPU where the register file is specified and any function, no matter with how many registers can be called via the call stack and stack variables. Well, for variables that are victims of register spills, allocating them out of a static pool is decisively better from power perspective. And dynamic parallelism fills in for the call stack with a custom hw/sw solution, which is probably more efficient from power perspective than an rigidly defined ABI + sw based stack.

I don't think I buy that hand-waving frankly

Something like the quad-tree based recursion that I described last SIGGRAPH is pretty efficient on a cached, work-stealing scheduler, and I just don't see coming close if you have to spill any part of the transient state while traversing the tree (light lists, etc.), such as you'd have to do with something like dynamic parallelism, which is basically the ability to launch another *layer* in the tree dynamically, rather than recursing into it in a cache-friendly depth-first style (from what I hear, continuations are sort of discouraged with dynamic parallelism since they *do* involve spilling the entire context to memory). And you don't want to base your occupancy on the "worst case", since the leaves are ultimately going to call arbitrarily expensive shading functions (but not all workers will be running expensive shaders at the same time).

Ray tracing is even worse really... the way that it gets handled in OptiX is really the best you can do with GPUs, but it still involves spilling lots of ray states to memory so that you can reorganize them and run slightly more efficient uber-kernels for the following bounce(s). And from what I hear, there's still a lot of tweaking register pressure, etc...

Andrew Lauritzen · Dec 11, 2012

rpg.314 said:
This sounds more like (and I share this view) a critique of the API and PCI than the GPU architecture itself.

I agree that's a lot of where the problem lies. On integrated the waters are muddier, but on integrated it's also responsible to be a lot more critical of stuff like duplicated ALUs between CPU/GPUs... something which doesn't seem necessary in the long run.

rpg.314 said:
Latency free? Sure. Power free?.....

Not free power-wise, but almost any sin on-chip is better than going to DRAM... at the very least on GPUs you'll end up losing your ability to hide latency due to bloated occupancy, at which point GPUs don't tend to look very fast anymore.

RecessionCone · Dec 11, 2012

Andrew Lauritzen said:
Through function pointers/unknown call sites/targets? I don't think so...

Perhaps you should recheck. CUDA allows you to do arbitrary things with function pointers, as long as you compile for Fermi or later. For example, you can take the address of a device function, store it in DRAM, and then use it in another kernel. C++ virtual functions work in CUDA with this mechanism. You can read the Fermi ABI documentation if you're interested.

sebbbi · Dec 11, 2012

Andrew Lauritzen said:
GPUs are obviously still very suitable for stuff like particles and fluid simulation (especially grid-based solvers), but honestly the crippling issue with using them for anything more than "effects" is the discrete memory/PCI-E bus, and the APIs that are built on top of that model. Stuff is still fairly set up for "pure feed forward" type systems which is inappropriate for physics, and increasingly for graphics.

I agree that this is currently a big problem for PCs. On consoles you can use GPU for much wider field of number crunching (data that needs to get back to CPU), since you have unified memory, and much lower latency. Even simple things as using GPU generated depth pyramid for occlusion culling (to skip draw calls) are hard to do efficiently on PC, as you need to be prepared for up to 2 frames of latency before you can get that data to CPU (or 3-4 frames on SLI/Crossfire + other complications).

I have been programming a GPU driven rendering pipeline lately (goal is to have all viewport culling, render setup, occlusion culling, virtual texture/mesh cache processing, etc done entirely on GPU). You can do all this quite efficiently with the current generation GPUs using compute shaders (+ indirect dispatch and indirect draw calls). The biggest limitation with current hardware + DirectX 11 API is that GPU cannot invoke new kernels on its own, so the CPU has to invoke the kernels (even though the indirect kernel invokes and indirect draw calls give the GPU the flexibility to calculate the number of threads / primitives on it's own). This limitation for example means that object / particle depth sorting on the GPU cannot efficiently use divide & conquer algorithms (such as merge sort, quick sort), because those require log

kernel invocations, where n is the number of elements (and that's unknown for CPU). Fortunately radix sorting is the most efficient GPU sorting technique, and it has O

running time (and static number of kernel invocations). I am hoping that DX12 will feature some kind of GPU kernel invocation method. Kepler K20 already has support for that, but that's not a card available to gamers... and you cannot of course demand that your game requires a K20 to run on PC

Moving graphics processing entirely to GPU actually seems to work quite nicely. But that pretty much requires no feedback to CPU (just a list of virtual texture page misses and mesh page misses, that CPU should load from HDD). However moving any task to GPU that requires latency sensitive feedback to CPU is not going to happen before PC discrete GPUs are dead, or we have some kind of wide bus between CPU and GPU and unified coherent memory system.

keldor314 said:
Still, if you look at the VRAM as a 2 GB L3 cache, with about 4x the bandwidth of a CPU's L3, it looks pretty nice.

6 core Sandy Bridge E reaches 256 GB/s in Sandra L3 cache benchmark: http://www.tomsitpro.com/articles/xeon-e5-2687w-benchmark-review-cores,2-288-5.html (last chart). It's not a 4x BW difference, a tie at best (*). And the GDDR5 has over 10x higher latency compared to CPUs L3 cache. Latency does matter. The higher memory latency you have, the more concurrent threads you need to hide the latency. The more concurrent threads you need, the bigger L1/L2 caches and bigger register files you need (in order to keep all the concurrently used data close to the processing units). GPUs aren't immune to memory latency either. They too need to spend transistors to hide it, one way or another. If you do heavy random texture accessing on the GPU and profile the results, you will notice that huge majority of the execution time will be spent waiting for texture cache misses. Adding more (SMT) threads to a workload that already constantly trashes the L1/L2 cache does not help at all. Larger cache lines make the random access problem even worse (as the percentage of the loaded data that gets unused doubles). Thus memory techniques that require larger cache lines, are not always a win.

(*) 8 core Sandy Bridge E (20 GB of L3) have even higher L3 bandwidth. The L3 in Sandy Bridge E is split to several cache banks. The more banks you connect to the ring bus, the more L3 bandwidth you have. The 14 core Ivy Bridge E chips (Q3 next year) will have even more L3 memory banks (25 GB L3 according to "internets") and DDR4. But these will of course be competing against the next gen GPUs (384 bit bus for Nvidia + faster GDDR clocks for both AMD/Nvidia), so lets continue that discussion at a later stage, when we know all the facts

3dilettante · Dec 12, 2012

Nick said:
You missed the point. Everyone's concentrating on lowering power consumption now.

I've pointed to NTV being an interesting tool for HPC and mobile graphics. For HPC especially, the goals put forward for Exascale computing put a rough time frame of 2018 for Intel and others by 2020.
NTV has been implemented at 32nm, so it can be in the development pipeline prior to 2018 or 2020.
I'm not disputing that awesome things are in the pipeline for after 2020, but it's almost trivial to say that there's almost always something awesome way around the bend.
Until then, there may be years for which NTV is a decent enough choice.

Whenever tunnelling heterojuncture nanowire transistors become viable, then NTV may be jettisoned if it provides no significant benefit. It's a tool for the time period that it's needed and can be thrown out when stale. It's not a spouse.

The original Pentium running at several hundred MHz was a design from 1993. So you'll have to wait about 20 years for NTV technology to give your GPU the same reduction in power consumption achieved with that Pentium, but still at today's performance level.

Should I point out that the pipeline complexity of modern mobile GPUs is many years behind that of high-speed modern OoO CPUs?
To clarify, incarnations of the Pentium core at several hundred MHz were being introduced until 1997, not that it matters since I'm thinking you are making a more fundamental assumption that design X at clock speed N if translated to NTV is design Y at clock speed at N/10.
That isn't the case any more than a Pentium ported to 22nm is going to be running at 3 GHz.

There isn't a linear relationship to the difficulties and demands for running at a given speed. The curve gets a lot steeper at the upper end, and it's much more forgiving at the lower.

Intel is already on top of all of this for its CPUs: using 8T SRAM and early adoption of FinFET.

Neither of those are in the same vein as NTV. FinFET isn't about NTV either, although it probably makes it more compelling since the subthreshold slope is steeper that what was available for the 32nm Rosemont.

GPUs suddenly using full NTV technology doesn't allow to catch up with this in any way.

If GPUs suddenly sprouted NTV capability today and they started giving 40-50 DP GFLOPS/W for HPC and you could run games at 2560x1600 in less than 100W, you'd better bet people would notice.

It has to be much better than that. The Radeon 7970 has 65% more transistors than the 6970, and offers 40% higher performance in practice (with only 50% higher bandwidth).

I've found later reviews using more recent drivers Anandtech in the ~50-60%. The Gigahertz edition adds further margin, but it increases power consumption to do so.

And it will surely be extended upon. Gating is a hot research topic (pun unintended). It's obviously disingenuous to look at an Alpha 21264 when discussing the power consumption per instruction of modern CPUs, but it's equally pointless to only look at today's CPU designs when discussing their future scaling potential. For instance branch prediction confidence estimation is said to save up to 40% in power consumption while only costing 1% in performance due to false negatives.

It's not readily comparable, but the Alpha 21264 is also at this point conservative in many respects to a modern OoO core. Its scheduling and issue capabilities are significantly more restricted, and its branch predictor is somewhere between 2-8 times smaller. The error bars on the prediction would be narrower, but there has been a proliferation of predictors in the front end since then and a reluctance in the latest generations to disclose hard numbers. It's probably something somewhat over 4 times in terms of global history, and then there are possibly 1-2K of storage in different predictor types past that.
Since the branch predictor in particular is a very large component of the energy cost, it's something to be aware of. In size terms alone, even a 40% power savings with confidence prediction puts a modern branch predictor's power consumption significanlty higher than a small predictor from a decade ago.

I'm obviously just scraping the surface here. There's hundreds if not thousands of researchers and engineers working on stuff like this. Besides, both the CPU and the GPU have the same switching activity reduction problem.

One has transistors switching at or below 1 GHz. Mobile GPUs and integrated GPUs generally run below. This makes a very large difference.

Developers are not very willing to jump through many hoops for extra performance. The failure of GPGPU in the consumer market clearly illustrates this.

GPGPU is not a primary concern for me. If silicon doesn't like being pushed too far out of its sphere, I'm not averse to gating it off. Perhaps one of the ways it was stymied was the introduction of fixed-function encoding and decoding hardware present in most modern consumer CPUs and mobile SOCs.

Keeping track of the ratio of long-running SIMD instructions that are executed seems pretty straightforward. And while I never said that only one type of core could have access to such knowledge, I don't see what a non-unified architecture could do with it. Care to elaborate?

The hand-off latency on chip is a fixed cost of some number of cycles. If we're relying on accumulated instruction data or software hints, that indicates startup latencies or periods of misjudged demand.
Handoff can probably be shaved down to tens of cycles, and a lot of corner cases for accumulating sufficient data to change the direction taken by heuristics can take tens of cycles.
Once in a steady state, both methods will stabilize, so I suppose I'd need more data to consider them that different.

The penalty for mixing SSE and AVX instructions is easy to avoid, and mostly just a guideline for compiler writers. It's on a completely different level than GPU manufacturers telling you to minimize data transfers between the CPU and GPU, which is nowhere near trivial to achieve.

This penalty, at the very least, is not likely to persist except possibly for enthusiast gamer systems for one or two hardware generations.

It also remains to be seen whether Sandy Bridge's penalty will still exist for Haswell, since each SIMD unit will be capable of 256-bit integer and floating-point operations. And finally, the warmup period is very well balanced to ensure that code without AVX instructions doesn't waste any power and that it's unnoticeable to code that does use AVX.

Agner Fog profiled the transition between cold and warm 256-vector states for a CPU to take hundreds of cycles. Until then, the throughput is halved.
This is generally tolerable because over time the initial cost is amortized over millions of cycles of full throughput as long as no cooling off occurs.
Several hundred cycles of warmup is a lot of margin to fit alternative design choices into. Signalling and pertinent parameters can be transfered on-chip in tens of cycles, and once running the specialized silicon will save power for millions of cycles.

That's hindsight. Today it's plain obvious that we'll never go back to a non-unified GPU architecture. But several years ago it really wasn't cut-and-dry that vertex and pixel processing should use a unified architecture.

The physical argument remains the same. Embarassingly parallel vertex shaders and embarassingly parallel fragment shaders don't make significantly different demands in terms of the units and the silicon implementing them, even when not unified.

"It’s not clear to me that an architecture for a good, efficient, and fast vertex shader is the same as the architecture for a good and fast pixel shader." - David Kirk, NVIDIA chief architect, 2004

G80 was already underway at the time.
If it wasn't clear to Kirk, it wasn't too cloudy, either.

Today's out-of-order execution architectures consume far less power than several years ago. It's even becoming standard in mobile devices (the iPhone 5's CPU is faster than the 21264, in every single metric). And things are only getting better, so let's not exaggerate the cost of out-of-order execution.

It would be interesting to see a comparison of power consumption on a process-normalized basis with a design with the same reordering and execution capabilities as the EV6.
It's a small core these days, and as you said inferior.

A "growing trend" of software and hardware fighting over control isn't a sustainable solution.

There's no more a battle over this than developers targeting OoO designs fret over the exact ordering of the instructions being executed.
If the end result if the same, the answer to software is that it's none of its business.
In the cases where the results aren't the same (typically low-level timing issues), the answer is that the OS, compiler, or developer need to get over it or buy an in-order core.

Developers would have to deal with unexpected behavior of several configurations of several generations of several architectures of several vendors with several driver and several firmware versions.

At the same time, software is challenged by difficulties in load-balancing highly parallel HPC systems with complex code.
Hardware can know a lot of things that software cannot, and dynamic information can be used to inform what software functions or paths to use or remove if they have already been generated, or whether an optimization pass can begin.
It's somewhat similar to some shader optimizations in the hardware and software realms, it would just be done faster.

You can't measure what kind of specialized core a thread might need, before you execute it.

Software is allowed to submit pertinent data to the hardware or OS scheduler.
A brief history of performance events can in the space of hundreds of cycles out of billions provide a decent enough clue.

And threads can switch between fibers that have very different characteristics. So a better solution is to have only one type of core which can handle any type of workload and adapts to it during execution. It doesn't have to do any costly migrations (and explicitly "manage" that at minimal latency), it just adjusts on the spot.

This is the asserted ideal, but not likely to be born out in practice to meet the demands of the market.

This indicates that homogenizing the ISA and memory model at a logical level will eventually lead to unifying fetch/decode and the memory subsystem at the physical level.

Why would I need to drive a 4-wide high-speed decoder, and multiple instruction queues and caches if a workload doesn't need them?

The same ALU doesn't have to perform the same operation twice. The result is typically bypassed to all other ALUs that can operate on it, and for any operation they support.

Only if the dependent instructions are in the position to snoop the bus at the right cycle. Later accesses go back to the register file.

Would it be of interest to you to know that it was claimed that one of the discarded cores before AMD's Bulldozer didn't bypass results?

Also, since the scheduler wakes up dependent instructions in the cycle before the result becomes available, the chances of executing an instruction which can pick operands from the bypass network is very high.

If they are in the scheduler by the time the instruction executes, or are able to pull their operand from the bus and execute immediately with no other data or structural hazards.
It's easier with reservation stations that actually snoop the bus and store things over time, but that's not power-efficient anymore.

Note also that writeback can be gated when the value's corresponding register is overwritten by a subsequent instruction.

For an OoO design that couldn't be guaranteed until all intervening instructions have been found to have no exceptions at retirement and no interrupts come up. It's not apparent until then what values can be safely skipped.

Again, things like tag checking are independent of instruction width. So it becomes insignificant for very wide SIMD instructions.

Additional checking or some in-built shifters or adders are necessary for multi-cycle SIMD instructions. so it's more linear unless it's all single-cycle.

So I'm not arguing that the ORF and the bypass network are the exact same thing, but from a power consumption point of view they can serve similar purposes. And while the ORF saves the cost of tag checking, it doesn't help reduce thread count to improve memory locality and thus causes more power to be burned elsewhere.

They're not related at all. The ORF is the top tier of a multi-level register file, with the upper tier aliased with the ones below. It saves tag checking and reads/writes to the larger register file. It's read like any other register file because it is a register file.

rpg.314 · Dec 12, 2012

Andrew Lauritzen said:
I agree that's a lot of where the problem lies. On integrated the waters are muddier, but on integrated it's also responsible to be a lot more critical of stuff like duplicated ALUs between CPU/GPUs... something which doesn't seem necessary in the long run.

Duplicated ALUs aren't that big of a bother. It's not like a significant fraction of the flops are going to come from the CPU anyway.

Not free power-wise, but almost any sin on-chip is better than going to DRAM... at the very least on GPUs you'll end up losing your ability to hide latency due to bloated occupancy, at which point GPUs don't tend to look very fast anymore.

That is ok when you are going off chip a few times. L1 is hit SOOO often that even those minor power spikes can add up substantially if not equally. And then there is the circuitry for memory disambiguation, LHS handling, memory ordering, checking tags, coherency bits, TLB lookup etc. which must be used all the time, but can be saved completely when you just make it a register.

3dilettante · Dec 12, 2012

sebbbi said:
IVB-E will be out next year according to Intel's roadmaps. The graphics chip in Ivy Bridge is less than 40% of the die space (according to some die shots). So I agree with you that a 8 core IVB-E would likely be slightly larger than IVB, but the difference should be much less than it's with Sandy Bridge.

The high-end versions that go above quad core also get an additional memory controller, an expanded ring bus and L3, and a very large amount of PCIe connectivity on-die.

Sandy Bridge EP has one side dominated by DDR3 and the other side filled by IO. At least one dimension is relatively resistant to being reduced.
There's probably going to be a large die size difference, although like SB the cores don't drive a lot of the growth.
It may be worth their time to throw in cores just to fill in the space between the IMC and IO controllers.
The TDPs with the big chips going forward appear to be rising at a decent clip to match.

Andrew Lauritzen said:
Ah but on the CPU the L1$ basically acts like a dynamic register file. In a lot of cases, spilling to L1$ can be entirely "free", hence why ABIs save/restore register state for function calls, etc. and only the smallest of functions are really worth inlining these days. On a GPU you have to spill to global memory, which is a massive performance cliff that you basically can't afford to do, hence why so much optimization in CUDA, etc. is around "occupancy" and optimizing register pressure - because GPUs are really not capable of handling that efficiently in a "dynamic" manner (i.e. depending on runtime code path selection). Thus having a single register-heavy path present in a kernel - even if it never gets taken at runtime - can reduce the performance of the entire kernel. That's not really acceptable going forward...

I'm curious what obstacles there are to reducing this constraint, since current GPUs are physically capable of writing to a read/write on-chip memory hierarchy. Is this an artifact of some infexibility in the allocation process of a CU, or a bottleneck that makes having a kernel constrained to a fixed register allocation size undesirable?
I believe Intel's GPU just has fixed allocations per thread, so there's that design option.

Also realize that the high-end of CPUs is increasingly server-grade CPUs, and those are probably *more* concerned with efficient power usage than mainstream stuff.

There is a potential bifurcation here as well, with multiple workstation and server chips with 130 and 150 W not being out of the question. The highest speed bins of the next generation may go 160+.
There was one extrapolation from the server TDP that would create a workstation bin of over 180W. I don't know how much credence to give that.

The energy-efficient density servers get the smaller chips, and potentially an Atom variant if market forces will it.

sebbbi · Dec 12, 2012

3dilettante said:
The energy-efficient density servers get the smaller chips, and potentially an Atom variant if market forces will it.

I don't understand the new Atoms. The Centerton is using the ancient in-order dual core (+HT) design. I would have expected server focused Atoms to at least have more cores than the tablet/netbook versions (Intel added virtualization support to Centerton after all). The TPD of a dual core 2.0 GHz Centerton model is 8.6 watts. That's only 1.4 watts lower than ULV Haswell, and ULV Haswell integrates GPU to the same die.

ULV Haswell chips do cost more than Centertons chips (250$+ vs 50$- bulk), but also offer higher performance (and performance per watt). You need more Centerton based servers to match the performance. The CPU isn't the only cost in the server rack. You need memory, motherboard, hard drives, cooling, etc. Haswell can also "race to sleep" faster than Centerton, so it might actually consume less power under many server loads (while at the same time providing better response times). Maybe Intel just can't produce enough 10W Haswells to meet the demand, so they have chosen not to yet target the server market with the chip (17W ULV SB processors were also cherry picked parts when they first arrived).

It would have been interesting to see AMD producing a Jaguar based server chip. With for example four Jaguar CUs it would have had 16 cores. But it seems that AMD instead chose 64 bit ARM cores for their micro servers. It will be interesting to see how this plan proceeds, and will ARM be able to conquer the micro server market in general.

keldor314 · Dec 12, 2012

3dilettante said:
I'm curious what obstacles there are to reducing this constraint, since current GPUs are physically capable of writing to a read/write on-chip memory hierarchy. Is this an artifact of some inflexibility in the allocation process of a CU, or a bottleneck that makes having a kernel constrained to a fixed register allocation size undesirable?
I believe Intel's GPU just has fixed allocations per thread, so there's that design option.

Dynamic register spilling will end up using the stack to store results, which has some performance issues, but is necessary with register starved architectures like x86 back in the day (only ~8 registers...). The really big problem, though, is where to store the stacks - when you have 50,000 threads in flight (each SIMD lane counts as a thread operating on 32 bit values for our purposes), you have a very BIG problem figuring out where to store these stacks, since even with just room for 8 words of overflow = 1.6 MB of stack space. Dynamically growing and shrinking stacks are a huge problem when you have many threads.

Current GPUs handle register overflow mostly through the L1 cache. When the kernel is JiTed from PTX assembly, the program can decide how many registers to allocate, and partitions stuff we'd like to put into registers into actual registers and statically allocated memory. Note that this is a runtime decision.

It's also worth noting that in order processors tend to suffer from register pressure less than OOO processors, since they don't have to worry so much about register pressure forcing false dependencies.

sebbbi · Dec 12, 2012

keldor314 said:
It's also worth noting that in order processors tend to suffer from register pressure less than OOO processors, since they don't have to worry so much about register pressure forcing false dependencies.

But in-order CPUs need more architecture registers to prevent dependencies in the first place (as there's no register renaming). Xbox 360 in-order CPU core for example has 2 x 128 VMX registers (128 architecture registers for both SMT threads). You need that many architecture registers, since there's no register renaming (and the vector unit has very long pipelines). You have to agressively loop unroll, or your performance drops down a cliff (without register renaming, it cannot rotate registers in the loop iterations).

3dilettante · Dec 12, 2012

sebbbi said:
I don't understand the new Atoms. The Centerton is using the ancient in-order dual core (+HT) design. I would have expected server focused Atoms to at least have more cores than the tablet/netbook versions (Intel added virtualization support to Centerton after all). The TPD of a dual core 2.0 GHz Centerton model is 8.6 watts. That's only 1.4 watts lower than ULV Haswell, and ULV Haswell integrates GPU to the same die.

Centerton is an SOC, however. Is the 10W Haswell the one with an on-package southbridge?
It seems that Centerton is a hedge or testing of the waters.

Maybe Intel just can't produce enough 10W Haswells to meet the demand, so they have chosen not to yet target the server market with the chip (17W ULV SB processors were also cherry picked parts when they first arrived).

The ULV is at one narrow end of the bell curve. Atoms are expected to hit these low power levels with most (edit: all?) of their bins.

It would have been interesting to see AMD producing a Jaguar based server chip. With for example four Jaguar CUs it would have had 16 cores. But it seems that AMD instead chose 64 bit ARM cores for their micro servers. It will be interesting to see how this plan proceeds, and will ARM be able to conquer the micro server market in general.

I'm not sure of the chances of either, though much less sure of AMD's choice. Part of it may be that someone made a choice to not implement ECC, and now it is too late to try it.
Possibly, AMD is afraid of tying a product that may exist long-term or could be sold off to x86.
AMD's ARM effort seems to be as shoe-string as they can manage.

liolio · Dec 12, 2012

I feel sorry to disturbing a pretty low level and interesting discussion but after reading the late comments I can't really help it and I've to ask something.
Something I actually asked about a while ago but I would want to see if opinions on the matter changed in any meaningful way (for example as people have played more with modern CPU and larrabee).
It could me misunderstanding things (quiet possible as quiet some points discussed lately are over my head) but even if Nick's point about an unified architecture is greatly discussed and the consensus here seems against it, my synthesis (could be wrong hence why I ask) is that CPU have indeed more room to grow in the throughput department than GPU in the "serial performances" department.
Reading A. Lauritzen comments it also seems to me that the move to more complex form of rendering also favor CPU architectures. Sebbbi's post also point to quiet a potential for software/cpu rendering.
Summing up:
CPU can be turned into potent throughput devices
Complex workloads map better to CPU (at least for now)
Having only one type of cores doesn't seem like a workable approach
The market (realtime 3d) revolves around API and takes away most of the advantages of something like Larrabee / put it at a significant (too significant to be overcome) disadvantage.

Though, Intel released Xeon phi, while it is imperfect, some kind of place holder product, I'm confident that Intel should pretty soon (a couple of years) provide a throughput oriented core that is on parity wrt ISA/features (not sure of the wording) with their contemporary desktop architecture.
I would think that if Intel (at last quiet some people would say) execute they indeed have here something that has the potential to kill GPGPU.

So if a "unified architecture" doesn't appear, we are more likely to see different cores with the same features set, ISA (akin to A15 and A7). I might be wrong but that sounds like what Intel is up to to me. Could you see that creating enough intensive for the market to shift toward software rendering?
To me it looks like the market is really tied to API but I start to wonder if it is for the best after reading some comments here.

Andrew Lauritzen · Dec 12, 2012

RecessionCone said:
For example, you can take the address of a device function, store it in DRAM, and then use it in another kernel.

Cool, news to me. So how does it partition the register file then? Does an indirect call like that force spilling all state to memory (ouch!)? What about shared memory?

rpg.314 said:
Duplicated ALUs aren't that big of a bother. It's not like a significant fraction of the flops are going to come from the CPU anyway.

500Gflops SP or whatever it is in HSW is not significant?

2x FMA in AVX2 massively shifts the conventional comparison here, which is really a lot of what makes me question my preconceptions. It seems a lot easier to just throw throughput onto CPUs than GPU vendors have had us believe.

liolio said:
Summing up:
CPU can be turned into potent throughput devices
Complex workloads map better to CPU (at least for now)
Having only one type of cores doesn't seem like a workable approach
The market (realtime 3d) revolves around API and takes away most of the advantages of something like Larrabee / put it at a significant (too significant to be overcome) disadvantage.

I think that's a pretty fair summary actually, but the larger question that you ask is the one being fundamentally debated here and I don't think most people will claim to see the future totally clearly on it.

To me, I see a few potential futures...

1) There continues to be a large enough need for "simple" and highly regular throughput that GPUs or similar solutions survive largely in their current form. They end up being so much more power-efficient at these computations that dedicated hardware is justified.

2) GPU flops are not too interesting, but their programming model of large contexts connected high-latency, high-throughput memory is compelling and cannot be duplicated in a power-efficient manner adequately in software.

3) GPUs become general enough that they can do much less regular algorithms. Whether or not their current latency hiding model can survive that transition is unclear, but it's possible that they still end up at a more general design point that is not sufficiently close to a CPU to still justify their existence.

4) CPUs add enough throughput that they can largely take over the stuff that GPUs have been doing, and the advantages in algorithmic efficiency that are offered by the more flexible programming models makes up for any delta in power efficiency.

My thoughts...

1) is possible, but that's the one I'm really questioning given Haswell's raw FLOPS.

Personally I think 2) is sort of unlikely, because the tendency is away from dedicated GPUs and towards SoCs. In an SoC you expect the same memory hierarchy to be usable by both the CPU and the GPU, and it will tend to be a more latency-optimized one.

3) and 4) are both possible... I don't really have any insight here.

So as far as my claims, the only one that I'm really firming up on is that I don't think GPUs in their current form (i.e. how they currently do memory, latency hiding, etc.) can survive for the long term, but even that I wouldn't claim I'm totally certain of.

Probably the most likely is that secondary factors dominate the architecture ones here TBH... i.e. IHVs chasing certain form factors and margins. For instance, even with all of the talk of dark silicon and how we are more than happy to put a massive number of transistors on a chip specific to one task that remain mostly free power-wise, is the market really willing to pay for those transistors? I'm sort of doubtful.

ninelven · Dec 13, 2012

^ This.

Andrew Lauritzen said:
1) There continues to be a large enough need for "simple" and highly regular throughput that GPUs or similar solutions survive largely in their current form. They end up being so much more power-efficient at these computations that dedicated hardware is justified.

Short term (~next 6 years) I think this will most likely be the case.

Andrew Lauritzen said:
1) is possible, but that's the one I'm really questioning given Haswell's raw FLOPS.

Haswell will be a big step forward, but even it will be ~3-4x less efficient than a GK104 for single precision. Of course, I expect CPUs to continue to improve beyond Haswell, but GPUs will not be sitting still either.

Long term, it is really impossible to say with any amount of certainty. But I do think it is safe to predict that power efficiency (performance/watt) will become the sun, moon, and stars of the semiconductor industry (barring breakthroughs in nuclear fusion and battery technology).

prunedtree · Dec 13, 2012

ninelven said:
Haswell will be a big step forward, but even it will be ~3-4x less efficient than a GK104 for single precision.

It might be closer than you think. Let's take look at some Sandy Bridge Xeons:
E5-2670, 2.6 Ghz, 115W, 2.9 GFlop/W
E5-2660, 2.2 Ghz, 95W, 3.0 GFlop/W
E5-2650L, 1.8 Ghz, 70W, 3.3 GFlop/W

Even assuming FMA units consume all power gains of the 22 nm shrink, you would still double those numbers. Compared to a GTX680 (15.9 GFlop/W according to wikipedia) that would be 2.4~2.7x less efficient (and this is for single precision).

Not too bad for a `latency-optimized' architecture...

3dilettante · Dec 13, 2012

Andrew Lauritzen said:
Personally I think 2) is sort of unlikely, because the tendency is away from dedicated GPUs and towards SoCs. In an SoC you expect the same memory hierarchy to be usable by both the CPU and the GPU, and it will tend to be a more latency-optimized one.

At least historically, SOC memory hierarchies have tended to just not be optimal for anything, or perhaps power if the designers made the effort (or literaly an economy of effort if not). Bandwidth constraints and pretty bad latency aren't that uncommon.
Apple's custom ARM SOC got kudos for not having a terrible cache subsystem.

The future directions of memory integration with 2.5D or 3D pose interesting tradeoffs. 3D seems quite a ways away, but 2.5D is a more evolutionary step that seems to be nearing initial rollout.

The primary detraction is the additional materials, yield and manufacturing cost. On the other hand, it can slim down the rest of the system platform, and something like HMC on interposer has the potential (according to its boosters) combination of adding an order of magnitude to bandwidth with a power reduction to boot.

If interposers become commonplace, they may become useful by removing pads from the high-end silicon dies. This may end the era where there's literally die space to burn that is going to have some random widget or nothing, or it can open up another set of implementation choices with regards to how much silicon can be thrown at a design for the same cost on depreciated nodes.

For instance, even with all of the talk of dark silicon and how we are more than happy to put a massive number of transistors on a chip specific to one task that remain mostly free power-wise, is the market really willing to pay for those transistors? I'm sort of doubtful.

It already does, looking at the layout of a modern SOC, or the layout of Intel's chips with integrated GPUs.
Intel likes the value-added argument it can put into its unit pricing, and OEMs like the component reduction and marketing benefits.
Intel could have created a Sandy Bridge quad core without a GPU, but it would be fighting for a subset of sales and salvaged dies with broken GPUs.
The cost of a new physical chip, including engineering, validation, and masks was weighed against having a single larger volume chip that serviced buyers in search of integrated graphics and those not interested.
The quad cores with GPUs fused off got one extra turbo grade and were sold as low-end Xeons.

The latest tepid to cool desktop market trends appear to ahve contributed to the idling of some 22nm fab capacity, since right now the primary product driving it is Ivy Bridge. The very aggressively mobile Haswell and the large high-end Ivy Bridge E chips in the next year may be able to get past the desktop glut.
There is, for now, a problem of there being significant amounts of silicon chasing something to buy it, with the probably temporary exception of the underprovisioned ~28nm nodes at the foundries that have managed to get their act together. The kinds of chips driving that demand don't appeal to design purists.

Andrew Lauritzen · Dec 13, 2012

ninelven said:
Short term (~next 6 years) I think this will most likely be the case.

I agree, that's all but certain. But I like to talk about further out than that because frankly I know too much about roadmaps in that time-frame already

3dilettante said:
The future directions of memory integration with 2.5D or 3D pose interesting tradeoffs. 3D seems quite a ways away, but 2.5D is a more evolutionary step that seems to be nearing initial rollout.

Right and in the timeframes that we're talking about, you can't even really discount something very disruptive coming along, but all we can do is speculate based on what we know now.

3dilettante said:
It already does, looking at the layout of a modern SOC, or the layout of Intel's chips with integrated GPUs.
Intel likes the value-added argument it can put into its unit pricing, and OEMs like the component reduction and marketing benefits.

Sure, but that's kind of a special case because without an integrated GPU you have to buy *another* chip to do that, so obviously the OEMs like that. What I'm talking about more is claims like "well we'll throw dedicated logic for everything and the kitchen sink in there that will remain off most of the time, but will be slightly more efficient than general purpose software when we need it". It's clear that stuff like video and audio encoding/decoding is common enough to make that cut, but for economic reasons I don't think the argument extends as far as some would have you believe. Where's my onboard PhysX chip guys? Turns out that's not really enough of a win over CPU/GPU software to justify the area/cost, even if it is a power win.

Anyways my point is just that area is not free - you pay dollars for it, and consumers are still relatively picky on what they spend money on.

Software/CPU-based 3D Rendering

Andrew Lauritzen

Moderator

rpg.314

rpg.314

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

RecessionCone

sebbbi

3dilettante

rpg.314

3dilettante

sebbbi

keldor314

sebbbi

3dilettante

liolio

Aquoiboniste

Andrew Lauritzen

Moderator

ninelven

PM

prunedtree

3dilettante

Andrew Lauritzen

Moderator

Similar threads