Software/CPU-based 3D Rendering

For what it's worth, there are other cases where CPUs lag far behind. A good example are transcendental functions (sin, cos, log, exp, etc.) where the gap, at least for single precision, is quite large.
Sin/cos are really cheap to compute using FMA. Log/exp can use some IEEE-754 format trickery. Currently this takes several masking and shifting instructions, but with an extension of the BMI instructions to SIMD it would be quite efficient (and they would be useful for much more).
Kepler can retire 256 of these instructions per clock cycle. There's just no way that a CPU can come anywhere close, especially since it doesn't have hardware paths for these functions.
It doesn't have to come close to a GK104. For now it merely has to be adequate to replace the integrated GPU.

Also note that there's an excessive number of these units, to avoid creating a bottleneck when a particular shader needs them. It's like car sharing. Even if among 100 people there's on average only 1 hour of driving for every 100 hours, you need more than 1 car to minimize the chances of someone having to wait for a car to become available. Even though GPU threads can wait a bit for resources, they shouldn't be left waiting for long, especially since they lack out-of-order execution. So they have plenty of these units, making their average utilization quite low. And that's before looking at other bottlenecks which cause bursty execution.

This is one of the reasons why CPUs aren't as bad at graphics as their specifications, compared to the cumulative peak specifications of the GPU, would have you believe.
 
Last edited by a moderator:
Show me where I can get a CPU to run Crysis (you know, the first one, from 5 years ago) at maximum settings. High end GPUs were capable of this back in late 2007 when it was released, so unless someone can show me a CPU implementation capable of at very least matching half-decade old GPUs in this regard, I call BS for CPUs replacing GPUs in the near future.

Larrabee didn't have any "GPU-style" fixed function hardware in addition to it's texture filtering units. The whole graphics pipeline (including blending, depth buffering, triangle setup, rasterization) was just pure x86 software. You could implement similar software pipeline on any CPU.

So the criteria for being a GPU is to have hardware texture units and raster ops? I'm claiming that Larrabee uses pretty much the same programming model as current GPGPU. Unless you're trying to say that the primitive x86 core tacked onto Larrabee's cores somehow lets it do something that you couldn't do by just using the first lane of the SIMD. GPUs are perfectly capable of running full software rendering, and people have written software renderers for them. Of course, they aren't quite as fast as the hardware graphics pipeline, but then what do you expect?
 
Last edited by a moderator:
Nick said:
GK110 has 7.1 billion transistors. Even if you account for the denser design, there's nothing in the Fermi family to compare it to.
You are aware they are fabricated on different processes? Yes, yes you are. You are a pretty smart guy. And it is clear you are really good at playing dumb and avoiding pieces of reality which don't fit your pre-conceived ideas about it.

Nick said:
Why do you believe this is not readily obvious?
I am a bit confused by what you mean by "this." I have interpreted it to mean why do you not believe that GK104 is the obvious GPGPU successor to GF110. Please let me know if this is incorrect.

Let's see...
GF 100/110 & GK110 have similar die size GK104 does not
GF 100/110 & GK110 have similar DP rate GK104 does not (1/2 and 1/3 vs 1/24th)
GF 100/110 & GK110 have a 384bit memory interface GK104 does not (256bit)
GK110 has Hyper-Q and Dynamic Parallelism, GK104 does not
GK110 has double the L2 of GF100/110, GK104 has less


Now, you contend that because GK104 and GF110 share similar transistor count on different processes they are comparable, and that because GK110 slightly more than doubles GF110's transistor count at 28nm vs 40 nm they are not comparable.

Should we then apply this logic to CPUs? I'm guessing you wouldn't like that.

BTW, you still haven't answered the question.
 
Show me where I can get a CPU to run Crysis (you know, the first one, from 5 years ago) at maximum settings. Very high end GPUs were capable of this back in late 2007 when it was released, so unless someone can show me a CPU implementation capable of matching half-decade old GPUs in this regard, I call BS for CPUs replacing GPUs in the near future.

Nick's point is that in the near future cpu's will change radically and have all sorts of gpu like features
 
This also shows why the industry eventually needs/wants software rendering. It guarantees that what you see on your development system is what customers see on their system. You basically ship the same 'driver' with your application. Better performance from dedicated hardware isn't worth much if it only actually runs reliably on one class of hardware, and support costs of graphics issues can be very substantial. Aside from the gaming industry, consumer application developers steer clear from using the GPU, despite its theoretical potential, precisely because of its unreliability and requiring high expertise.
Exactly which consumer applications could benefit greatly from GPU rendering but do it in software instead because of unreliability and API complexity?
 
Nick's point is that in the near future cpu's will change radically and have all sorts of gpu like features

Meaning what? That 2 years from now CPUs will be able to run Crysis (now 7 years after its release) at maximum settings?

Nick's primary contention is that a single threaded program will always have better locality than a parallel program, but I don't think this is really true. While certainly TLP has bad locality, DLP is a different beast altogether. In order to have a fast DLP algorithm, you structure the execution so that groups of threads operate on local data, meaning that they tend to cooperatively fetch data that nearby threads reuse. There are actually algorithms where adding say 2x as many processors yields a 2.03x speedup. The reason is that in a tight DLP algorithm, you have threads effectively prefetching data for other threads.

There's a tricky balance there, since you want the processors to share data across the shared L2 or L3 cache, but you don't want them fighting over L1 cache lines. Double buffering helps considerably here, since it not only gets rid of the problem of the results overwriting the input that other threads may need, but it lets the caches load the input in read-only mode, so that they don't fight over cache lines. The output can then bypass the local cache in full write-through mode, since it won't be needed again for a while.

The future I envision has only 1-2 serial latency optimized cores (since how many cores does a serial program use...?), while the rest of the die (besides caches...) is given over to throughput optimized cores.

A little note on TLP vs. DLP: When people think of parallelism, they usually think of TLP. However, I contend that DLP is much, much better to use than TLP. The first reason is that there is generally more DLP than TLP for a given program. How many windows does your web browser have open? 2-3? Now how many pixels are in each of those windows? See what I'm saying?

The second advantage of DLP is probably the most important: ease of programming. Now you might say that DLP needs new algorithms, while TLP can use the old ones, and this is true. So why do I say that DLP is easier to program, even to add to existing code? The reason is that programmers don't spend most of their time programming - they spend it debugging. With TLP, you have large parts of your program operating in parallel. This means that you can have nasty side effects causing bugs across code from wildly different pieces of the program. Imagine trying to track down a bug in the text entry of a window and finding out (or not finding out...) it's caused by something in the code that writes your file to disk. Now imagine that *that* code wasn't even your own code - someone else on the team wrote it. So this sort of bug is incredibly hard to track down and fix. With DLP, the scope of side-effects is reduced to merely the scope of the particular algorithm in question, so instead of having a tricky interaction between far flung subsystems over hundreds of thousands of lines of code, the side-effect causing the problem is very likely to be within the same subsystem, maybe even in the same file.

So, to sum up, use DLP wherever you can, but only use TLP where it's really needed, like for asynchronous stuff that keeps the UI responsive.
 
This also shows why the industry eventually needs/wants software rendering. It guarantees that what you see on your development system is what customers see on their system. You basically ship the same 'driver' with your application. Better performance from dedicated hardware isn't worth much if it only actually runs reliably on one class of hardware, and support costs of graphics issues can be very substantial. Aside from the gaming industry, consumer application developers steer clear from using the GPU, despite its theoretical potential, precisely because of its unreliability and requiring high expertise. With a plethora of APIs and compute languages of increasing complexity, this isn't getting any better for dedicated hardware, while their software implementations are highly reliable. So all we really need is CPUs with higher throughput and reduced power consumption. That's exactly where they're heading.

Uh, no. The reason drivers have issues is that rendering is a very tricky subject, and that drivers have a very wide scope of things they do. Writing a good renderer is quite difficult, which is why most games these days don't have their own rendering engines - they use someone else's, and that's with APIs that do most of the hard stuff for you! The only thing a software renderer would change would be that instead of buggy drivers, the bugs would find their way into the commercial engines instead, and since there are more commercial engines than hardware vendors, you'd have the same total number of developers having to fix these bugs separately more times, so it would probably be worse. All you'd be doing is moving the difficult code from one platform onto another. Hardware is irrelevant here.

Another thing to note is that many driver bugs are due to bugs in aggressive compiler optimizations, as well as aggressive optimizations in the driver. These sorts of things happens for CPUs too, you know, especially for an ISA as bloated and fragmented as x86.
 
Last edited by a moderator:
I'm claiming that Larrabee uses pretty much the same programming model as current GPGPU.
It can do a lot more than what conventional GPUs can though. The reliance of GPUs on static partitioning of register files based on worst-case analysis of kernels that are known in advance is a significant weakness for irregular code compared to register renaming and L1$ spill/fill. Ultimately everything starts static and then has to move to be more dynamic, and GPUs are going to take a hit when they have to be more general in this area. But there really is no long term option; as the BPS course at SIGGRAPH (among others) keeps demonstrating, even in graphics there's an increasing need for more clever algorithms. Brute force a la. typical GPGPU does not scale well enough.

I mean it's great to see that with Kepler - and not even the consumer one - I can finally call a function/launch work without a CPU round trip... but that's the tip of the iceberg here, and the efficiency of that mechanism compared to what Larrabee or any CPU could trivially do (cooperative work stealing, or whatever other scheduler you want to write) remains to be seen.

Exactly which consumer applications could benefit greatly from GPU rendering but do it in software instead because of unreliability and API complexity?
Depends what you mean by "consumer application" I guess. If you saw John Carmack's keynote from QUAKECON, he talked directly to this point. They were initially using GPUs for lightmap baking and other offline ray tracing stuff but eventually switched to CPUs because of reliability issues (and because the performance delta was fairly small to start with, so that may not fit your example). I've heard similar stories from other game developers with respect to their tools pipeline.
 
Depends what you mean by "consumer application" I guess. If you saw John Carmack's keynote from QUAKECON, he talked directly to this point. They were initially using GPUs for lightmap baking and other offline ray tracing stuff but eventually switched to CPUs because of reliability issues (and because the performance delta was fairly small to start with, so that may not fit your example). I've heard similar stories from other game developers with respect to their tools pipeline.
Well, which non-game consumer applications, since that's what Nick was suggesting. I have a hard time coming up with non-game consumer applications that might need lots of GPU rendering. Using non existing apps as an example for where software rendering is already common place make for a pretty shoddy foundation of the whole argument. ;) But maybe I'm just missing some obvious SW category.
 
Well I would certainly concede that adoption of GPU computing into stuff like Adobe's packages has been slower than I would have thought, since in those places there are clear uses for stuff like texture sampling, but I imagine robustness does play a role there. Even the latest Photoshop which IIRC does include a little bit of GPU stuff is really just dropping a few custom GPU filters in there, not using it for any of the real heavy lifting (which it conceptually would be good at). I'd add that moving data back and forth between the CPU and GPU is a real issue/performance killer in any non-toy application as well, but of course whenever reporting results, people like to leave that (and shader JIT) out of the results ;)

I'll let Nick answer what he was thinking though. I'm just speculating and don't necessarily agree anyways.
 
So the criteria for being a GPU is to have hardware texture units and raster ops? I'm claiming that Larrabee uses pretty much the same programming model as current GPGPU.
No, I meant exactly the opposite. Texture units != GPU. Fixed function rasterization pipeline = GPU.

Larrabee / Xeon Phi doesn't have a single programming model. You can use any programming model that fits your purpose. The x86 software renderer Intel did for Larrabee is just one way to do things. You could be running a completely separate process on each of Xeon Phi's (64*4=) 256 logical cores if you wanted (for example have a web hosting server run all servlets on it). You can't do the same on a GPU.
The second advantage of DLP is probably the most important: ease of programming. Now you might say that DLP needs new algorithms, while TLP can use the old ones, and this is true. So why do I say that DLP is easier to program, even to add to existing code? The reason is that programmers don't spend most of their time programming - they spend it debugging. With TLP, you have large parts of your program operating in parallel. This means that you can have nasty side effects causing bugs across code from wildly different pieces of the program. Imagine trying to track down a bug in the text entry of a window and finding out (or not finding out...) it's caused by something in the code that writes your file to disk. Now imagine that *that* code wasn't even your own code - someone else on the team wrote it. So this sort of bug is incredibly hard to track down and fix. With DLP, the scope of side-effects is reduced to merely the scope of the particular algorithm in question, so instead of having a tricky interaction between far flung subsystems over hundreds of thousands of lines of code, the side-effect causing the problem is very likely to be within the same subsystem, maybe even in the same file.

So, to sum up, use DLP wherever you can, but only use TLP where it's really needed, like for asynchronous stuff that keeps the UI responsive.
What you call "TLP", I call "old style threading". When multicore processors (and multicore consoles) first came out, the first thing that came to coder's mind was: how to separate all these tasks to the cores. We had rendering thread, physics thread, particle simulation thread, logic thread, UI thread, AI thread, sound mixing thread, etc, etc. One task per thread, all read/write structures private for each thread (only read only structures were shared, and synchronized between frames). There are two big problems in this kind of approach:

The first problem is multicore scaling. Whenever a new architecture comes out, you have to figure out new things to do on all the additional CPU cores. As your threads are programmed in a single threaded (serial) fashion, you cannot easily split them to multiple cores (without additional synchronization and code refactoring). Your program is pretty much fixed to a certain core count and cannot scale beyond that (without huge code refactoring).

The second problem, like you said, is threading problems. You are running N completely separate threads at the same time. If a programmer accidentally does something illegal in any of these threads, it will be very hard to find the bug. The game crashes once in a blue moon, and likely at later stage to a data corruption in some other thread that has nothing to do with the faulty code.

If all threads would instead be executing one program step at a time, you never need to think about anything else than making each program step internally thread safe. And as each program step is multithreaded, you can automatically scale it to all available CPU threads. This is a very good programming model. I fully agree on that. It also makes programs much easier to port to GPU (however it isn't always simple as that).
Nick's primary contention is that a single threaded program will always have better locality than a parallel program, but I don't think this is really true. While certainly TLP has bad locality, DLP is a different beast altogether. In order to have a fast DLP algorithm, you structure the execution so that groups of threads operate on local data, meaning that they tend to cooperatively fetch data that nearby threads reuse. There are actually algorithms where adding say 2x as many processors yields a 2.03x speedup. The reason is that in a tight DLP algorithm, you have threads effectively prefetching data for other threads.
Now we are in the core of the debate. CPU is optimized for running a single serial thread. In order to get good performance out of it, you need to make sure data accesses in the serial code flow are cache friendly. A single GPUs core in the other hand is running many (almost) lock-stepped threads at the same time. In order to make data accesses cache friendly, you instead need to be sure that data accessed between threads in the same part of the code are cache friendly. Neither approach is worse, the problem is just rotated 90 degrees :)

Example: A shader has four memory accesses, each to a separate texture. Accesses to a single texture are cache friendly. There's plenty of ALU instructions between the texture fetches.

Lets assume the CPU processes one pixel at a time. It accesses all these four textures (memory locations that do not share cache lines) in a serial manner. The first pixel has four stalls, but the caches will keep the accessed DXT blocks (four blocks in a 64 byte cache line), and in the future accesses will likely hit caches. CPU caches simultaneously hold data for all the four textures.

GPU core on the other hand first accesses the texture #1 on all the threads. All the threads will first stall, and GPU will swap new threads to execution (and those threads will stall as well). However after the rocky start, things will be running smoothly, as texture fetches from a single texture are cache friendly. Texture #1 data will be flowing to the cache at fast pace, and latency hiding will cover all the remaining stalls. After texture #1 is fetched and processed, the threads will move to further in the shader code and start fetching texture #2, and so on. The most important thing to notice here is: Cache holds data from single texture at once.

The GPU has more threads running at the same time, but unlike conventional wisdom would tell us, this doesn't actually mean that the GPU needs bigger caches to be as memory efficient... as long as the memory access patterns are similar to the example above (that's a very common access pattern for graphics pixel shaders). All CUDA programming guides tell us to coalesce all global memory writes and reads, no matter how much extra ALU instructions we might need perform in order to reorganize the data. It's almost always worth it. Coalesced memory access pattern is basically the same memory access pattern as all efficient graphics rendering algorithms use. If your CUDA software doesn't access memory like this, your performance drops like a rock (up to 10x performance boosts are claimed by many scientific papers by just coalescing properly).

However things are not that bad for the CPU either, as the programmer can customize the programming model depending on his/her needs. Andrew mentioned custom fiber model earlier. It basically turns things 90 degrees on CPU as well. Now the single CPU thread processes the first part of the shader first (AVX2 SoA-style number crunching), and after that it moves to the second part. This way it needs 1/4 sized caches to reach same cache utilization. Now it is at least as efficient as the GPU (and has bigger caches to boot).

We haven't yet analyzed the most CPU friendly case: The case where the serial execution itself is cache friendly (accesses same cache lines repeatedly), but other work items (threads / pixels / CUDA threads) are processing mostly separate data (not much sharing). This case is a nightmare to GPU. All threads first fetch the data #1, but it has no coherency. So the GPU cannot coalesce the reads, and ends up loading a new 64 byte cache line for each thread. It wakes up other threads, but those do not share any data with the existing threads either. Cache starts to trash badly, as every memory request loads a separate 64 byte cache line. Now the first memory fetch and all code related to it is processed, and the threads start to process data fetch #2. As the GPU cache is so small, and it has already been trashing a lot in the data fetch #1, there's no cache lines remaining in the cache that could instantly give us the data needed for the coherent second fetch. So we are back in square one, and must load a full 64 byte cache line for each thread... again, and again, and again, for all the four fetches.

All the GPU friendly access patterns can be optimized for the CPU (to have similar cache efficiency), but some of the CPU friendly access patterns are very hard to optimize for the GPU. Of course the CPU might need a separate code path for each access pattern (while the GPU latency hiding is more automated, but not as robust as most people think).
 
Last edited by a moderator:
Right, but to some extent in this discussion the ALUs on GPUs are one of their least interesting parts. It seems to be more and more clear that we *are* able to just slap more math power on big cores and do pretty well with it, so it's the other architectural differences that are more interesting to me (scheduling, texture sampling, etc). Your thoughts on these are quite interesting, so thanks for posting!
Big FLOP/s numbers are always interesting, and that's why GPU ALUs are interesting. However I fully agree with you that the really interesting topic is how to feed these ALUs and keep them running as efficiently as possible.

I haven't coded much CPU image processing since old times. The last time I really spend lots of time on CPU rendering was when we developed games for Nokia N-Gage. It didn't have any GPU or FPU, so I had to code a full (fixed point / integer based) 2.5d graphics library for it. We used DXT-style (custom) block compression in our blitters, because of the memory limitations. Making it run at 30 Hz on a 100 MHz old ARM processor required lots of graphics optimizations. "Software SIMD hack" for 565 texture data on standard 32 bit integers is so much fun (you just have to keep clear bits in between the rows, and mask them out whenever a under/overflow is possible). I have also done some real time GPU DXT compressors for our virtual texture system, so I have had to learn plenty of DXT tricks. This actually reminds me that even AVX1 can process 8*24 bit integer operations per cycle: Eight floats * 24 bit mantissa in each, perfect for integer processing, if you have full speed FMA. Combine with "software SIMD hack", and you can process plenty of integer stuff per cycle... unfortunately masking out the overflow becomes tricky, since floating point hardware constantly normalizes data (and you cannot just mask out the mantissa bits easily).

If CPU rendering becomes common, I am sure Intel will provide another vector instruction extension to solve the hardest bits of DXT decompression / filtering (if those prove to be a bottleneck). Intel has already introduced domain specific instructions sets (such as AES) and bit manipulation instructions (BMI). I don't think we need separate dedicated fixed function hardware pipelines for texture sampling/filtering.
 
GPUs have big raw floating point peaks, but the integer processing capacity of GPUs hasn't been discussed that much. And it's an area where CPUs still have an advantage. Kepler's 32-bit integer multiply is 1/6 rate, bit shifts are 1/6 rate. Fermi is slightly better (1/4 rate for integer mul & shift). Algorithms / structures used on GPU programs are getting more and more sophisticated (kd/oct/quad/binary/etc trees, hashing, various search structures, etc), and traversal of these structures is mostly integer processing.

AVX2 doubles the CPU integer processing capability (to full width 256 bit registers). AVX2 also makes SIMD integer processing more useful, because now SIMD can be efficiently used also for memory address calculation, because gather can directly use SIMD register contents as memory offsets (instead of requiring eight separate register moves and load operations). CPU vector processing also supports lower precision integers (8 x 32 bit / 16 x 16 bit / 32 x 8 bit). GPUs on the other hand have "fixed width" SIMD (Nvidia has 32 lanes, AMD has 64). Many algorithms (image processing, image decompression, etc) do not need wider than 8/16 bit integers. CPUs can process these at 2x or 4x rate. GPU wastes cycles and perf/watt by using needlessly wide 32 bit integer registers/operations for all integer math.

Rasterization, address generation and image decompression (DXT for example) are mostly integer processing. GPU has fixed function hardware for all these. CPU on the other hand has to use programmable integer units for these tasks. AVX2 integer processing might not be exactly as fast as fixed function hardware, but it is more flexible. This flexibility might allow the CPU developer to choose more efficient algorithms.

Bandwidth:

Eight core Sandy/Ivy Bridge have 51.2 GB/s memory bandwidth (DDR3-1600) and 20 MB of last level (L3) cache. Fermi has 192.3 GB/s memory bandwidth (GDDR5) and 512 KB of last level (L2) cache. For graphics rendering, that 20 MB cache can be used exactly in the same way as gaming consoles use EDRAM. Backbuffer (color & depth buffers) are the single biggest bandwidth user in graphics rendering (rasterization). On traditional PC GPU, you have to use the GDDR5 bandwidth for the backbuffer as well. On a modern PC CPU you could use the 20 MB cache as your render target, and save a lot of memory bandwidth. This approach seems to work for Xbox 360 and WiiU too. A CPU (Haswell E) with over twice the memory bandwidth of these devices (and twice as big L3 as the EDRAM in Xbox 360) should be just fine for high quality graphics rendering. It wouldn't match Kepler, but at least it would be more flexible.
What N-Gage games did you work on? also, the damned thing didn't have an FPU? Wow.
I was the lead programmer in Pathway to Glory. It received excellent reviews, but sadly pretty much nobody had the N-Gage required to play it. The game even supported real time online multiplayer, but GPRS internet connections were very slow and expensive back then. If I remember correctly, the GPRS data was around 50 euros/month, and without a web browser nobody wanted to pay that much. iPhone changed things a few years later, and the rest is history :)
 
Waiting on III-V is 8 or more years. TFET is something more speculative and may be a decade or more.
You missed the point. Everyone's concentrating on lowering power consumption now. There will be short-term results, and long-term results. Short-term results that keep driving the convergence between the CPU and GPU, and long-term results to make unification an achievable and desirable goal not just for integrated graphics but also to scale beyond that.
The original Pentium ran at several hundred MHz and continued to run in the same ballpark with NTV.
Mobile GPUs run at several hundred MHz.
Niche HPC hardware and FPGAs run at several hundred MHz.
They can all do as horribly as they do right now, just many times more power efficiently.
The original Pentium running at several hundred MHz was a design from 1993. So you'll have to wait about 20 years for NTV technology to give your GPU the same reduction in power consumption achieved with that Pentium, but still at today's performance level. Of course the increasing transistor budget can offset this, but it reduces the gain in power efficiency to only about 4x. Obviously all of this is highly dependent on what happens to semiconductor technology in the next two decades...

In any case, you clearly don't want to adopt full NTV technology any time soon. You want to pick the design changes that allow to lower the supply voltage with the least impact on performance, and only to augment what newer processes and design techniques don't offer. Intel is already on top of all of this for its CPUs: using 8T SRAM and early adoption of FinFET. GPUs suddenly using full NTV technology doesn't allow to catch up with this in any way.
20-30% per node, optimistically. That means that the general-purpose silicon can get its 20-30% increase in transistor count, and then there's 70-80% of the chip that they can put something in or just have unexposed wafer.
It has to be much better than that. The Radeon 7970 has 65% more transistors than the 6970, and offers 40% higher performance in practice (with only 50% higher bandwidth). Even if you account for the slightly higher power consumption and clock frequency, that's reasonably good utilization of those extra transistors. Yes, it's possible that on average only 20-30% of those extra transistors are switching, but the 6970 doesn't have transistors that are switching 100% of the time either so you have to look at the relative increase. And if power was the limiting factor for using more, you'd expect them not to increase the clock frequency. Also, this is with a process that's approaching the limits of planar bulk transistors. FinFET offers a significant improvement in the short term, and there's multiple promising technologies for the longer term.
Already being done.
And it will surely be extended upon. Gating is a hot research topic (pun unintended). It's obviously disingenuous to look at an Alpha 21264 when discussing the power consumption per instruction of modern CPUs, but it's equally pointless to only look at today's CPU designs when discussing their future scaling potential. For instance branch prediction confidence estimation is said to save up to 40% in power consumption while only costing 1% in performance due to false negatives. When you have multiple cores and you're optimizing for throughput/Watt, this should be tuned to allow for a slightly larger single-threaded performance impact and offer a substantially bigger power saving during badly predictable branchy code (and it could also be tuned dynamically depending on the number of active threads).

Wider SIMD units would increase the relative number of transistors that can be gated when running scalar workloads. And long-running SIMD instructions enable more gating in the rest of the the pipeline during parallel workloads.

I'm obviously just scraping the surface here. There's hundreds if not thousands of researchers and engineers working on stuff like this. Besides, both the CPU and the GPU have the same switching activity reduction problem. So it's not like a unified architecture would in theory be off worse. It's definitely a challenge to ensure that this unification doesn't completely cancel the power optimizations, but there are clearly ways forward, and it solves all of the efficiency and programmability issues that heterogeneous computing is facing.
The lack of a free lunch doesn't seem like a strong detraction from anything.
There is absolutely nothing we've discussed that comes for free.
Developers are not very willing to jump through many hoops for extra performance. The failure of GPGPU in the consumer market clearly illustrates this. AVX2 on the other hand marks an inflection point for auto-vectorization to extract DLP from generic code. That's a free lunch. Likewise TSX is mostly about enabling the creation of tools and frameworks which assist or automate multi-threaded development. That's lowering the developer cost of TLP extraction. Of course it's not completely free on the hardware end, but I'm sure GPU manufacturers wished they could pay that low a price to make GPGPU a selling point.
It strikes me as extra baffling because the earlier part of this discussion concerning long-running SIMD is a design that somehow knows the workload it's running and sort of adjusts itself.

It strikes me as very disingenuous to say that only one type of core can have access to this knowledge.
Keeping track of the ratio of long-running SIMD instructions that are executed seems pretty straightforward. And while I never said that only one type of core could have access to such knowledge, I don't see what a non-unified architecture could do with it. Care to elaborate?
It's particularly true since major shifts in unit activity do incur costs, as we see with Sandy Bridge and its warmup period and known performance penalty for excessively switching between SIMD widths.

There are a lot of transfers and costs that can be considered acceptable if they are within the same ballpark in terms of latency and overhead for incidental events such as that.
If Intel is free to caution programmers not to do X, or risk wrecking performance on SB, the same leeway can be granted elsewhere.
The penalty for mixing SSE and AVX instructions is easy to avoid, and mostly just a guideline for compiler writers. It's on a completely different level than GPU manufacturers telling you to minimize data transfers between the CPU and GPU, which is nowhere near trivial to achieve. It also remains to be seen whether Sandy Bridge's penalty will still exist for Haswell, since each SIMD unit will be capable of 256-bit integer and floating-point operations. And finally, the warmup period is very well balanced to ensure that code without AVX instructions doesn't waste any power and that it's unnoticeable to code that does use AVX.
Or just moving them within millimeters of each other and use physical integration to provide growth in bandwidth.
Whatever gets the job done works for me.
Sure, and every step along the way that gets the job done, happens to be a convergence step.
I would say that the caches, interconnect, and memory controller would be in the same boat.
And that extra coalescing doesn't come for free, if you consider that a valid argument.
Not free, but cheaper than relying on the developer to handle data locality.
The actual hardware demands for the two types in terms of units and data paths weren't that dissimilar.
That's hindsight. Today it's plain obvious that we'll never go back to a non-unified GPU architecture. But several years ago it really wasn't cut-and-dry that vertex and pixel processing should use a unified architecture.

"It’s not clear to me that an architecture for a good, efficient, and fast vertex shader is the same as the architecture for a good and fast pixel shader." - David Kirk, NVIDIA chief architect, 2004

There will come a day in the not too distant future when all programmable computing will be handled by one architecture. And from then on it will be considered plain obvious that all workloads are not too dissimilar, they're just a mix of ILP, DLP and TLP.
The physical circuits and units involved are massively overdetermined for that use case, and the cost of that is significantly higher than zero.
Every load and store would be rammed through a memory pipeline designed for 4GHz OoO speculation and run by a scheduler, retirement logic, and bypass networks specified to provide peak performance and data transfer to portions of the core you declare are fine to be unused.
Today's out-of-order execution architectures consume far less power than several years ago. It's even becoming standard in mobile devices (the iPhone 5's CPU is faster than the 21264, in every single metric). And things are only getting better, so let's not exaggerate the cost of out-of-order execution. The 21264 days are long gone. Also, the scheduling cost is the same, regardless of whether it's an 8-bit or a 1024-bit arithmetic instruction, or an 8-bit or 1024-bit load or store. That, together with long-running instructions to increase the gating opportunities, really means that out-of-order execution will become a non-issue for high DLP workloads. And in fact it improves data locality, thus saving on data transfer power consumption.
I guess it is true that if a programmer has more power than he should have and he does pointless things that there could be a problem. It's sort of solved by the growing trend of the chip's cores, microcontrollers, and firmware very quietly overriding what software thinks is happening.
A "growing trend" of software and hardware fighting over control isn't a sustainable solution. Developers would have to deal with unexpected behavior of several configurations of several generations of several architectures of several vendors with several driver and several firmware versions. With ever more abstraction layers to run ever more complex code on this very wide variety of hardware, with each layer thinking it knows what's best and most of them not under the application developer's control, it becomes incredibly hard to write high performance software. Heck, it's a nightmare just to get acceptable stability and provide bug fixes after shipping.

A unified architecture would eliminate these issues. There's no need to balance workloads between different core types of unknown capability. There's no unexpected data transfer behavior. And the 'driver' is whatever software libraries you decide to use and ship.
Threads migrate all the time, even between homogenous cores. The costs are measurable and can be scheduled and managed if they aren't explicitly spelled out by the software so that the chip knows what kind of core a thread needs.
You can't measure what kind of specialized core a thread might need, before you execute it. And threads can switch between fibers that have very different characteristics. So a better solution is to have only one type of core which can handle any type of workload and adapts to it during execution. It doesn't have to do any costly migrations (and explicitly "manage" that at minimal latency), it just adjusts on the spot.
Unifying fetch and decode might be a choice, but it too doesn't seem to be strictly necessary since the fetch and decode requirements can be different between cores. There would be no software-visible difference.
Nothing is "strictly" necessary. Having the same shader capabilities for vertices and pixels while having dedicated cores for each is perfectly possible. That doesn't mean its recommendable though. Even mobile GPUs are going unified.

This indicates that homogenizing the ISA and memory model at a logical level will eventually lead to unifying fetch/decode and the memory subsystem at the physical level.
The value on the bypass bus is whatever value came out of the ALU, irrespective of the destination register, and unless the ALU performs the same operation twice, that value is gone afterwards and further accesses will need to come from the register file.
The same ALU doesn't have to perform the same operation twice. The result is typically bypassed to all other ALUs that can operate on it, and for any operation they support. Also, since the scheduler wakes up dependent instructions in the cycle before the result becomes available, the chances of executing an instruction which can pick operands from the bypass network is very high. Note also that writeback can be gated when the value's corresponding register is overwritten by a subsequent instruction. This can cause a large portion of instructions to execute without even touching the register file.
The ORF is a software-managed set of registers used to keep excessive evictions from occurring from the RFC, and can service multiple accesses across multple cycles. It is guided by the compiler's choices in register ID usage, and because the source value is in the instruction, no tag checking is needed. It's very much not a bypass network.
Again, things like tag checking are independent of instruction width. So it becomes insignificant for very wide SIMD instructions. Fermi used tag checking in the RFC. So I'm not arguing that the ORF and the bypass network are the exact same thing, but from a power consumption point of view they can serve similar purposes. And while the ORF saves the cost of tag checking, it doesn't help reduce thread count to improve memory locality and thus causes more power to be burned elsewhere.
 
Last edited by a moderator:
Show me where I can get a CPU to run Crysis (you know, the first one, from 5 years ago) at maximum settings. High end GPUs were capable of this back in late 2007 when it was released, so unless someone can show me a CPU implementation capable of at very least matching half-decade old GPUs in this regard, I call BS for CPUs replacing GPUs in the near future.
Sure, high-end GPUs could run it back in 2007, but that's not where the unification will start. Integrated graphics still has great difficulties running Crysis at high settings, today. That said, nobody's claiming that CPU architectures as we know them today will replace integrated graphics either. Next year's Haswell architecture, with doubles the peak FLOPS per core (again), still won't make it happen, but it shrinks the gap. And more is coming because Intel already expressed an interest to consolidate AVX with the Xeon Phi's 512-bit ISA, and it can be readily extended to 1024-bit as well. This would dramatically improve the CPU's ability to extract DLP, and I've discussed several other techniques here which would improve performance/Watt.

So eventually we'll end up with an architecture which combines the qualities of legacy CPUs and GPUs, meaning a dedicated (integrated) GPU can be dropped. For business users, who don't care about Crysis, this could potentially even happen sooner than before AVX reaches 1024-bit width.
 
You are aware they are fabricated on different processes? Yes, yes you are.
I said so explicitly. I don't see why you question it like that.
You are a pretty smart guy. And it is clear you are really good at playing dumb and avoiding pieces of reality which don't fit your pre-conceived ideas about it.
Thanks, but I'm not playing dumb or avoiding anything. I am well aware of the challenges that lie ahead. But the more I learn about them, the more solutions I see as well. So as I've said before they're just hurdles to cross, not walls that stop any progress, and definitely not something that's reversing the convergence. If there's something that hasn't been mentioned yet, or for which you strongly believe that I'm not appreciating its effect of rendering unification impossible (pun unintended), please state it.
I am a bit confused by what you mean by "this." I have interpreted it to mean why do you not believe that GK104 is the obvious GPGPU successor to GF110. Please let me know if this is incorrect.
This isn't primarily about which chip is considered the successor or not. I was talking about architectural efficiency. And while that could be mostly a theoretical discussion, I believe GK104 versus GF110 provides practical proof since it's faster at graphics in every game but GPGPU results are clearly compromised.

Once we can agree on that, we can start concluding what it might mean to the future of GPUs. We can readily observe that NVIDIA does kind of position GK104 as the successor to GF110. Which is working out fine due to its appreciable increase in graphics performance. It's of course no wonder that with a comparable transistor budget it can't be substantially faster at all workloads, but it's quite telling about the state of GPGPU that NVIDIA made this shift in workload efficiency in the first place. In comparison, GF104 was faster than GT200 at both graphics and compute. Which is impressive even if you take the slight increase in transistor count into account. NVIDIA clearly decided to take a radical turn from that with Kepler. It's not a sustainable strategy though. You can't keep ignoring latency optimizations without sooner or later also compromising certain graphics workloads.

So apparently GPUs can no longer outrun Moore's Law for all workloads like they did before. Meanwhile the CPU is the one outpacing Moore's Law at DLP, with plenty more potential. Hence unification is unavoidable, and will happen from the bottom up. Discrete cards which focus on graphics will be able to outrun it the longest, but the realities of the memory wall will eventually catch up with them too.
Let's see...
GF 100/110 & GK110 have similar die size GK104 does not
GF 100/110 & GK110 have similar DP rate GK104 does not (1/2 and 1/3 vs 1/24th)
GF 100/110 & GK110 have a 384bit memory interface GK104 does not (256bit)
GK110 has Hyper-Q and Dynamic Parallelism, GK104 does not
GK110 has double the L2 of GF100/110, GK104 has less
Part of this isn't very relevant to the discussion at hand. And while GF110's memory bus is indeed wider, you fail to mention that GK104's higher frequency causes them to have equal bandwidth. Also while GF110 has 50% more L2 cache, GK104 has 73% higher L2 bandwidth. If despite that substantial bandwidth benefit, a mere 256 kB of L2 is the cause of GK104's shameful GPGPU performance, you'd expect them to have invested half a percent more transistors to save its reputation.

So unless there's something else I've missed, which you haven't mentioned yet either, these two chips are fairly closely comparable for isolating the effects of the architecture of the compute cores on performance.
Now, you contend that because GK104 and GF110 share similar transistor count on different processes they are comparable, and that because GK110 slightly more than doubles GF110's transistor count at 28nm vs 40 nm they are not comparable.

Should we then apply this logic to CPUs? I'm guessing you wouldn't like that.
Go ahead, I would love to see the effect of architectural choices on the performance and power consumption of different workloads, largely isolated from the transistor count.
 
Exactly which consumer applications could benefit greatly from GPU rendering but do it in software instead because of unreliability and API complexity?
I didn't mean rendering, specifically, despite talking about software rendering in that paragraph. I was pointing out that outside of the gaming industry (where there's little alternative for now), consumer application developers in general stay clear of GPUs due to the complications. But as the CPU's graphics performance continues to improve, we'll see more developers choosing software rendering when it's adequate, even if dedicated hardware is faster.

I can tell you that Chrome blacklists quite a few GPUs and drivers that may be unreliable, making them fall back to software rendering. They don't bother/risk supporting everything out there, despite our best effort to also support SM2 hardware in ANGLE.
 
Nick's primary contention is that a single threaded program will always have better locality than a parallel program, but I don't think this is really true. While certainly TLP has bad locality, DLP is a different beast altogether.
I'm not claiming single-threaded code to "always" have better locality. I wholeheartedly agree that DLP extraction through SIMD execution and non-divergent SMT typically have excellent locality for things like graphics, which is why I'm suggesting CPUs to feature 1024-bit SIMD units with long-running instructions, and for Hyper-Threading to stick around.

But all good things need moderation. Sharing the caches between too many threads causes too much contention. You have to keep the balance between data reuse within a thread, and reuse across threads, which varies. Today's GPUs can't adapt to workloads that require a low thread count for best data locality, or don't have a high thread count in the first place. The CPU on the other hand can cope with just one thread and still extract fairly high performance from ILP, but it can also deal with workloads that offer lots of DLP and/or TLP, and it has lots of opportunity to be extended to become better at those.
The future I envision has only 1-2 serial latency optimized cores (since how many cores does a serial program use...?), while the rest of the die (besides caches...) is given over to throughput optimized cores.
With all due respect that's not much of a future vision, since that's pretty much today's situation. But the reality of it is that the GPU is practically only being used for graphics, despite the fact that generic code also has some DLP worth extracting. That's because migrating things back and forth between these different cores costs more than what can be gained. And this problem gets worse as the computing power increases but bandwidth and latency don't keep up.

So what I envision is a low number of unified cores (more than 2 though - TSX will improve the multi-threading potential), each having a few very wide SIMD units. This would eliminate the work migration problem.
So, to sum up, use DLP wherever you can, but only use TLP where it's really needed, like for asynchronous stuff that keeps the UI responsive.
I agree. I just don't agree that this dictates a heterogeneous architecture.

Also keep in mind that while DLP is very valuable and today's CPUs aren't taking enough advantage of it yet, you do want an architecture capable of dealing with a fair amount of TLP as well. SIMD doesn't handle divergent data accesses or divergent control flow very well, so once you've exhausted the opportunity to split things into fibers, you have to split things across threads.
 
Back
Top