Is everything on one die a good idea?

The Memory Wall is exacerbated by the co-location of code and data and MIMD-style, multi-core execution. In that sense, the future may be more GPU-like than CPU.
I agree the number of MIMD cores should kept as low as possible. But note that 8 cores with two 1024-bit FMA units each would already amount to 3 TFLOPS at 3 GHz. A GTX 680 has the same number of SMXs and the same total computing power. So I don't think a future unified architecture has to be any more GPU-like than it has to be CPU-like. It has to have all the qualities of both.

By 2024 we'll have far more MIMD cores, but we'll also have an L4 SRAM and L5 stacked DRAM, or something better altogether. Some of the data/code locality will have to be controlled through software, potentially assisted by hardware collected statistics. TSX is also a major component in allowing the core count to scale.
It isn't clear to me how serial architectures are better at dealing with higher latencies either.
First and foremost they deal with it by trying to avoid it altogether. The low thread count and large caches ensure that the average memory latency is quite low. The hit rate is also substantially improved through prefetching. And when an LLC miss does occur, most of its latency is still hidden by out-of-order execution, and other misses can hide in its shadow. As a last resort there's also Hyper-Threading.

So the CPU has a ton of weapons against high latency memory accesses. The GPU basically only has SMT, which can only hide latency when there's enough storage for thread context. This is self-defeating, because more threads means lower hit rates and thus more latency to be hidden. When this happens, performance falls off a cliff. CPUs deal with low locality far more elegantly. Execution continues on a cache miss thanks to out-of-order execution, and even when it grinds to a halt there are few other threads interfering so it can recover fast.
The reason (well, one of them) why we only have quad-core Intel CPUs is that there isn't the bandwidth on the 115X sockets [one of the reasons why I only compared the 860 to the 4770s, instead of the 9X0s, which are on a different class socket entirely]. Whatever you think about the GPUs and memory bandwidth, CPUs are already up against the wall.
That's really a quad-core CPU plus an iGPU. We could have an 8-core CPU for this bandwidth instead. And I wouldn't say CPUs are up against the wall for increasing raw bandwidth. DDR4 will increase bandwidth while lowering power consumption at the same time. But it has been delayed because there was still enough life in DDR3. In other words, the solution for more bandwidth is ready but there hasn't been a demand for it yet. Also note that when discrete GPUs die out, there won't be a need for high bandwidth PCIe and those pins can be repurposed for supplying more RAM bandwidth. So there's plenty of opportunity to increase bandwidth in cost effective and power efficient ways.

GPUs have to instead aggressively move forward with stacked DRAM. Note that this is a solution that will also be available to the CPU, once it needs it and the cost goes down.

So I don't really see what you're trying to argue for here. By 2024, today's status quo and short term trends will be irrelevant. We'll see lots of other innovation to keep moving the Memory Wall when necessary. But the fact of the matter remains that today's GPU architectures are very wasteful with bandwidth and the situation is already worse for them than it is for CPUs.

Solutions to this bandwidth hunger exist too, by adopting techniques from CPUs. So there's no need to worry, we'll still see a lot of growth in graphics computing power. It's just going to inch closer to a unified architecture, which is also desirable for many other reasons than bandwidth.
I do think we're in agreement that the issue here is memory bandwidth. I think you'll get no argument from anyone that putting a large amount of memory very near the CPU and very near the GPU would be awesome. I would love to have 100 MIMD cores and 10k SIMD cores to play with. It's less clear to me that I need to have a large number of MIMD cores co-located with my SIMD cores.
It's a necessity due to Amdahl's Law. Any kind of MapReduce task needs the Reduce to start as soon as possible after the Map to minimize total processing time. On a homogeneous architecture you can switch from vector processing to scalar processing from one cycle to the next. You can even have it overlap a bit.

Amdahl's Law is also reflected in the sudden move towards low-overhead drivers. That's again all about the sequential performance starting to matter more than additional parallelism. But these driver models are only an intermediate stopgap that shift the problem to the application. Eventually we need the SIMD cores to sit much closer to the scalar cores to further lower that latency. Intel is on the right track with AVX-512.
Even if I accept that there's a large crossover between those worlds, it isn't clear to me whether the world belongs to traditional CPUs with vector style extensions, or GPUs with (for example) a "real" core per SMX.
The world will belong to anyone who can combine the properties of both. Whether you call that a CPU which can act as a GPU or a GPU which can act as a CPU is irrelevant, although I think people are more inclined to use the former.
I do hope that we get a few years of CPUs with vector instructions, and GPUs with arm/whatever cores on them. That's where all the fun is! Once hardware gets homogenized, we have to put our coding straightjackets back on :(
I disagree. The next few years could get very messy and not much fun as developers try to figure out where to run their code. They'll be forced to support multiple architectures or settle for one and leave a large percentage of users in the cold. AVX-512 will be the most developer friendly and will have the largest market share on PCs, but consoles will require some compute tasks to run on the GPU or to use narrow SIMD on AMD's excuse for an 8-core CPU. Discrete GPUs with scalar cores will only make matter worse overall.

Unified homogeneous hardware will liberate us from the straitjackets created by different heterogeneous architectures and the need to aim for the lowest common denominator, instead enabling renewed creativity in software design.
 
Are those two things really all that different? Isn't it mostly semantics at this point?

I was hoping that question would draw in more thoughts. Alas :(

So, in all honesty, I don't *know*. I mean, we're talking about theoretical hardware descended from non-existent (well, okay, I don't have any, but if someone wants to send some to me, I'm happy to provide an address) hardware. But, let me take a crack at guessing.

Yes, there will likely be differences. I do think that each evolutionary path taken to the logical extreme should meet in a common place, it's just that I don't think we'll ever get there from either side (because successful hw/sw usually stops at 'good enough'), and even if someone were to start fresh, it's highly unlikely that they'll ever be more than a niche player. I expect that the low-latency heritage of Intel and the high-throughput heritage of nvidia will dominate their designs, affecting things like cache design, apis, and optimization strategies.

I agree the number of MIMD cores should kept as low as possible. But note that 8 cores with two 1024-bit FMA units each would already amount to 3 TFLOPS at 3 GHz. A GTX 680 has the same number of SMXs and the same total computing power.

You realize you're comparing legacy (>2yrs) shipping silicon with non-existent, theoretical parts?

So I don't think a future unified architecture has to be any more GPU-like than it has to be CPU-like. It has to have all the qualities of both.

Well, most of the qualities of both, anyway. Maybe even just 'many'.
I'll agree with you in the main :)

By 2024 we'll have far more MIMD cores

I've been waiting for more than four years now.
My impatience may well be tied to my perception of CPUs being up against the wall already.

So the CPU has a ton of weapons against high latency memory accesses. The GPU basically only has SMT

Counter-example: cache size on Kepler.
GPUs aren't against the wall, they have yet to avail themselves of all of the weapons that the CPU already requires.

That's really a quad-core CPU plus an iGPU. We could have an 8-core CPU for this bandwidth instead.

Yeah, I don't believe it. GT3 is relegated to lower-speed parts, and why do they need GT3e?

DDR4 will increase bandwidth while lowering power consumption at the same time. But it has been delayed because there was still enough life in DDR3.

I was under the impression that DDR4 was delayed because DDR3 wasn't moving in high enough volumes due to falling desktop sales.

In other words, the solution for more bandwidth is ready but there hasn't been a demand for it yet.

I demand more cores!!
No, but, you're right, there's insufficient demand for desktops period. But that's not really a technical reason, is it? Worse, slowing demand is slowing innovation, which I would argue is slowing demand.

Also note that when discrete GPUs die out, there won't be a need for high bandwidth PCIe

10Gbe, 4k video acquisition, dedicated raid cards...
I think I've got four cards in my system, only one of them is a gpu....
Oh, and I don't currently have any of the above three :shrug:

GPUs have to instead aggressively move forward with stacked DRAM. Note that this is a solution that will also be available to the CPU, once it needs it and the cost goes down.

Apparently, IGPs will use it first. It's hard for me to know whether that supports your argument or not :)

It's a necessity due to Amdahl's Law. Any kind of MapReduce task needs the Reduce to start as soon as possible after the Map to minimize total processing time. On a homogeneous architecture you can switch from vector processing to scalar processing from one cycle to the next. You can even have it overlap a bit.

Honestly, I hadn't considered MRs as a way to structure code execution between latency cores and throughput cores, but I can see how a Map could roughly equate to a throughput processor, and a Reduce to a latency processor. It has the additional benefit of keeping the two styles of code separate. But, MRs have a stage between Map and Reduce called 'shuffle', where map-output, keyed-data is bucketed and co-located so that the Reduce phase can execute. Partially this is to increase reliability (otherwise your reduce phase is subject to the whims of all map sources), but co-location of data is sort of important for low-latency operations :) For example, it is highly likely to boost cpu cache efficiency if MRs are applied in the way that you describe. It's precisely because of this intervening phase, as one example of the impedance mismatch between code designed for throughput and code designed for low-latency (if you will), that I question whether we benefit from co-location. Maps generally run co-located to data-sources, and sized according to how those data-sources are built. Reduces aren't.

[Apologies to the Hadoopers out there, as terminology isn't 100% translatable.]

It's entirely reasonable to believe, on the otherhand, that some kind of coding convention NOT of the MR style (where throughput code and latency code are separately developed) could become popular. In which case co-location will be of benefit. In the absence of evidence, I retain a measure of skepticism

Intel is on the right track with AVX-512.

I'll believe that when their IGP starts using it.

I disagree. The next few years could get very messy and not much fun as developers try to figure out where to run their code.

Yes, that's exactly what I mean by 'fun'. What fun is it when everything has a known solution? Anyone can follow a recipe. Do you want to cook, or do you want to be a chef?

Unified homogeneous hardware will liberate us from the straitjackets

Homogeneity is a straitjacket. I agree that LCD-driven solutions suck. Don't buy those :)
 
Are those two things really all that different? Isn't it mostly semantics at this point?
Yes, there will likely be differences. I do think that each evolutionary path taken to the logical extreme should meet in a common place, it's just that I don't think we'll ever get there from either side (because successful hw/sw usually stops at 'good enough'), and even if someone were to start fresh, it's highly unlikely that they'll ever be more than a niche player. I expect that the low-latency heritage of Intel and the high-throughput heritage of nvidia will dominate their designs, affecting things like cache design, apis, and optimization strategies.
"Good enough" has lead to integrated GPUs. So this is an argument against discrete GPUs, not one that will keep them around. It's also a result of discrete GPUs not being good enough in several respects. In the same vein, fixed-function is dead and buried, and non-unified is dead and buried, through good enough and not good enough on opposite sides. That's three major architectural changes in a little over a decade. I'm sorry but it's madness to think that in another decade things will still look roughly the same due to low-latency and high-throughput heritage being good enough. That's totally ignoring the problems, and ignoring the opportunities.
You realize you're comparing legacy (>2yrs) shipping silicon with non-existent, theoretical parts?
Please. We're trying to look 10 years ahead, and you're worried about looking 2 years back? Besides, I'm really comparing a hypothetical near-future unified architecture to a 2 year old GPU and to a 2 year old quad-core CPU with a 300 GFLOPS iGPU. My point was the theoretical possibility of having the same raw throughput of the discrete GPU without an excessive number of cores or unrealistic die area, while also fully retaining the qualities of serving as a CPU. I think that's quite phenomenal.

Furthermore, even today a 3 TFLOPS discrete GPU is above average, while a quad-core with an iGPU is about the minimum you can get. Just as another reference point the PlayStation 4 has a 1.8 TFLOP GPU and ~8-core CPU, and that's a system first and foremost aimed at gaming! The real target for a unified architecture 'today' is even lower than that. Intel just has to aim at meeting or exceeding their own integrated GPU performance, which currently stands at merely 800 GFLOPS. Of course that's a moving target, but not one that appears hard to keep up with when 3 TFLOPS already seems technically feasible in the not too distant future.

I'm fully aware that for discrete GPUs to die out it would also have to meet their performance, which is a moving target as well. But with consoles now using integrated GPUs, I don't think it will take long for games to run worse on discrete GPUs even if they have more raw processing power. This is comparable to non-unified GPUs sometimes having more total processing power but being inefficient for modern games. Also, when faced with the choice between a weak CPU plus an expensive discrete GPU, or a unified CPU with twice the cores for the same price total, the discrete GPU will really have to excel to be worth losing CPU power. So all Intel has to do is keep carving out the market from below and the 1000 $ you once spent on a 'Titan' will one day go to a 'Xeon' with more cores than the average.
I've been waiting for more than four years now.
My impatience may well be tied to my perception of CPUs being up against the wall already.
I share your impatience, but I really don't think the apparent stagnation in core count is due to running into any hardware walls. The importance of the software ecosystem cannot be overstated. When dual-core CPUs appeared, practically nothing happened. That's because nothing had to happen for existing multi-task use cases and asynchronous task completion to hugely benefit from a second core and improve the user experience. When quad-cores appeared, developers really had to rearchitect their entire application to take advantage of the extra computing power. But first there was a chicken-or-egg problem where nobody bought a quad-core because there was no software, and developers didn't put the effort in because there was no market share (and it's still not that great). Then for the longest time multi-core development has been very poorly understood and good tools are scarce. For the best results you even have to delve into lock-free algorithms, which only a handful of developers on the planet truly master.

Fortunately there is light at the end of the tunnel. C++ is adopting multi-thread awareness into the language, and TSX makes writing lock-free algorithms child's play in comparison to using CAS. Also, a growing number of frameworks and libraries has become thread-safe. So don't underestimate what's involved in making a significant majority of popular software multi-threaded, and how it has slowed down the multi-core scaling. It's an intricate web of dependencies and everyone's waiting on everyone else. The greatest thing is that once things have gained critical mass, scaling from 4 to 8 and then beyond should go relatively smoothly. I think we're nearly there.
GPUs aren't against the wall, they have yet to avail themselves of all of the weapons that the CPU already requires.
That's exactly what I said: "Solutions to this bandwidth hunger exist too, by adopting techniques from CPUs. So there's no need to worry, we'll still see a lot of growth in graphics computing power. It's just going to inch closer to a unified architecture, which is also desirable for many other reasons than bandwidth."
That's really a quad-core CPU plus an iGPU. We could have an 8-core CPU for this bandwidth instead.
Yeah, I don't believe it. GT3 is relegated to lower-speed parts, and why do they need GT3e?
Don't overthink this. Anyone buying a high-end consumer CPU is still likely to get a discrete GPU. All-in-one systems and laptops with retina screens are the most likely to benefit from a faster class of integrated GPU, and you don't want the highest-clocked CPU cores in those. Crystalwell is indeed intended to increase bandwidth for GT3e (and keep power consumption low), but it doesn't disprove my claim. A quad-core Haswell CPU can achieve close to 512 GFLOPS, while GT2 pushes 432 GFLOPS, and this combination doesn't require Crystalwell. So it's safe to say that an 8-core without iGPU doesn't strictly need Crystalwell either. AVX-512 will demand more bandwidth to feed ~2 TFLOPS for 8-core, but we'll have DDR4 by then so Crystalwell still remains an option for scaling beyond that (and/or having more RAM channels and fewer periferal I/O). There's no shortage of options to make it cost-effective and power efficient.
I was under the impression that DDR4 was delayed because DDR3 wasn't moving in high enough volumes due to falling desktop sales.
I don't see much correlation there, aside perhaps from mispredictions leading to overproduction leading to small margins leading to less investment in new innovation? Still, if DDR3 bandwidth would have quickly become inadequate then they would still have reserved some investment for moving toward DDR4 production. To me this seems much more like the RDRAM story all over again. There's a new expensive technology to aggressively increase bandwidth, but instead the current technology proves to continue scaling for a while and remain adequate at a lower cost, ultimately because the need for more bandwidth isn't very high to begin with.

Intel is smart enough to invest into new technology long before it becomes necessary, so that it can produce it at an affordable price. This is what worries me about NVIDIA's Volta. I'm sure it can increase raw bandwidth, but it's a fairly radical departure from previous GDDR increments and it's only going to be required by the high-end parts aimed at a small market. So it will likely be expensive. They could try to offset it with Tesla parts, but that market will likely be destroyed by Xeon Phi.
I demand more cores!!
No, but, you're right, there's insufficient demand for desktops period. But that's not really a technical reason, is it? Worse, slowing demand is slowing innovation, which I would argue is slowing demand.
I still kind of wonder which came first. Did mobile appear and desktops stagnate as a result, or did desktop performance stagnate and people turned their attention elsewhere? Or is it just coincidence that this coincided? I love my superphone as much as the next guy, but I'd pay good money to upgrade my desktop to something significantly more powerful. I still think the software ecosystem's slow adjustment to the multi-core reality is central to it, even though that itself was a consequence of single-core performance/Watt stagnation, which ties in with mobile.

I have good hopes that eventually we'll pull out of this. Mobile CPUs now resort to out-of-order execution and multi-GHz frequencies, while 'desktop' CPUs are down to the single-digit Watt consumption. They'll soon clash, and after that the only way forward will be wider SIMD everywhere and more cores everywhere.
Also note that when discrete GPUs die out, there won't be a need for high bandwidth PCIe and those pins can be repurposed for supplying more RAM bandwidth.
10Gbe, 4k video acquisition, dedicated raid cards...
I think I've got four cards in my system, only one of them is a gpu....
Oh, and I don't currently have any of the above three :shrug:
That's really the exception. There will always be workstations with expansion slots, probably even for some specialized discrete GPUs, but the rest of the world is moving towards all-in-one systems and laptops. Note that today's workstations have CPU sockets with lots more pins and lots more RAM bandwidth, but you mentioned 115X sockets, and that's what I responded to.
GPUs have to instead aggressively move forward with stacked DRAM. Note that this is a solution that will also be available to the CPU, once it needs it and the cost goes down.
Apparently, IGPs will use it first. It's hard for me to know whether that supports your argument or not :)
There's a difference. Crystalwell is a 128 MB L4 cache that sits on the package PCB, while Volta aims to put all of the RAM (several GB) next to the GPU. It's a big, desperate, radical move on the part of the discrete GPU that's bound to have some cost implications, while for the CPU DDR4 still offers a direct increase in raw bandwidth and Crystalwell is probably cheaper and more power efficient than increasing pin count (which also still offers room for growth). They both obviously aim at different performance levels, but Intel is carving out the discrete GPU market from below and they've only just started. Everything is converging to the same point so integrated GPUs and unified architectures will face the same issues eventually, but not before discrete GPUs have disappeared.
Honestly, I hadn't considered MRs as a way to structure code execution between latency cores and throughput cores, but I can see how a Map could roughly equate to a throughput processor, and a Reduce to a latency processor. It has the additional benefit of keeping the two styles of code separate. But, MRs have a stage between Map and Reduce called 'shuffle', where map-output, keyed-data is bucketed and co-located so that the Reduce phase can execute. Partially this is to increase reliability (otherwise your reduce phase is subject to the whims of all map sources), but co-location of data is sort of important for low-latency operations :) For example, it is highly likely to boost cpu cache efficiency if MRs are applied in the way that you describe. It's precisely because of this intervening phase, as one example of the impedance mismatch between code designed for throughput and code designed for low-latency (if you will), that I question whether we benefit from co-location. Maps generally run co-located to data-sources, and sized according to how those data-sources are built. Reduces aren't.

[Apologies to the Hadoopers out there, as terminology isn't 100% translatable.]

It's entirely reasonable to believe, on the otherhand, that some kind of coding convention NOT of the MR style (where throughput code and latency code are separately developed) could become popular. In which case co-location will be of benefit. In the absence of evidence, I retain a measure of skepticism
In the absence of evidence, I really don't see why it would be reasonable to believe that such a coding convention could become popular.

It certainly hasn't happened so far. GPGPU is proving very messy for consumer application development, and the only thing that helps is closer hardware integration and fewer separations at the code/data level. Note that map-reduce in the general sense is extremely common in typical code: any loop with independent iterations, followed by some form of aggregating the results, is amenable to 'map' parallelization which subsequently requires a low latency 'reduce' execution to fight Amdahl's Law. Nobody really wants to rewrite that in a heterogeneous fashion and deal with the synchronization issues and attempts to hide the latency.
Intel is on the right track with AVX-512.
I'll believe that when their IGP starts using it.
That will never happen. Being only half-way homogeneous has all the disadvantages of being heterogeneous, with none of the benefits of being homogeneous, topped with massive confusion. Instead, when they're ready to use AVX for all graphics (most probably 1024-bit by then), the IGP as we know it will disappear and any computing will happen on uniform cores. Still, it can happen more gradually than you might think.
Yes, that's exactly what I mean by 'fun'. What fun is it when everything has a known solution? Anyone can follow a recipe. Do you want to cook, or do you want to be a chef?

Homogeneity is a straitjacket. I agree that LCD-driven solutions suck. Don't buy those :)
You have an utterly wrong idea about that. The latency and bandwidth limitations for communicating between heterogeneous components imposes limitations on the sort of algorithms you can efficiently implement. Lots of great ideas where high throughput and low latency are closely intertwined are simply not feasible today. Worse yet there are many variations so you have to aim down the middle and can't really achieve the best results on anything. You only have to look back at non-unified GPUs to see proof that heterogeneous processing is a big limitation on creativity. You had to conform to "known solutions" for balancing vertex and pixel complexity to get good performance, which made every game back then look more similar than the variety we have today thanks to unification. Unification of the CPU and GPU will only create even more possibilities, and not just for games.
 
It certainly hasn't happened so far. GPGPU is proving very messy for consumer application development, and the only thing that helps is closer hardware integration and fewer separations at the code/data level. Note that map-reduce in the general sense is extremely common in typical code: any loop with independent iterations, followed by some form of aggregating the results, is amenable to 'map' parallelization which subsequently requires a low latency 'reduce' execution to fight Amdahl's Law. Nobody really wants to rewrite that in a heterogeneous fashion and deal with the synchronization issues and attempts to hide the latency.


A bit of fun at Ocaml's prompt

Code:
# let double = List.map (fun x->2*x);;
val double : int list -> int list = <fun>
# double [1;2;4;33;14;8];;
- : int list = [2; 4; 8; 66; 28; 16]

Do you think functional programming has its place? (e.g. Haskell, OCaml, F#, Erlang). Tim Sweeney was calling for a layered boondoggle that includes a purely functional core (pictured here in page 58, talked about from page 44 or 39 - these are slides)
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf

I can't add much more to the discussion, I'm glad I'm able to read it so far.
 
Do you think functional programming has its place? (e.g. Haskell, OCaml, F#, Erlang). Tim Sweeney was calling for a layered boondoggle that includes a purely functional core (pictured here in page 58, talked about from page 44 or 39 - these are slides)
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf
Yes, the functional programming paradigm definitely has its place. But I don't think functional languages like the ones you summed up are the right answer. You don't have to make everything functional to achieve high parallelism. There's great value in the ease of imperative programming for things that can remain single-threaded and scalar (which doesn't mean there can't be concurrency or vectorization).

I think it suffices and is far more practical to have a functional EDSL within existing popular languages like C++. Examples of these are Halide and SystemC. And while these are pretty much entire languages on their own, it's not unreasonable for something like a game engine, which is used by many game titles, to have its own EDSL. Also, it doesn't have to be an entire language. Often you just need a way to express dataflows in a functional manner. Google's Cloud Dataflow is based on similar ideas where you build a pipeline with very simple looking constructs, but it fully abstracts the reality that it can run on millions of processors. Also, you can have an imperative EDSL but use it within a concurrent functional framework. This is what I aimed for with Reactor in SwiftShader, borrowing some ideas from GRAMPS and reactive programming.

Anyway, one thing's for sure, a lot of innovation has to, can, and will come from the software side. But first we need homogeneous terraflop architectures. AVX-512 will get us there pretty soon. I'm sure the IGP will stick around for quite a while before going fully unified, but that gives Intel and the software developers the time to make the transition without ending up like Larrabee.
 
"Good enough" has lead to integrated GPUs. So this is an argument against discrete GPUs, not one that will keep them around.
That all depends on the market, and what "good enough" means. The market of greatest significance is the one that's busy buying phones like candy right now. It's equally possible that we'll never get a fully unified processor, because no one will be buying them in the quantity required to invest in them. Consider http://www.marketwatch.com/story/tech-job-cuts-reflect-declining-desktop-computer-sales-2014-07-28
A look at which companies are cutting the most jobs shows that desktop computers may be going the way of the Polaroid camera.
Link-bait quote for sure (and look, it worked ;^/), but it's something to consider. Perhaps it is a pessimistic view of technological history, but at any given time, the "best" available product was rarely the most successful. You're asking for the "best" possible outcome -- the perfect integration of low latency and high throughput cores. It's not even clear to me that such a thing exists (the right balance is likely different depending on the task at hand), but even if it does, it seems unlikely that market forces are poised to deliver it to us. You'll recall that my entrance into this thread reflected a frustration with the pace of development, and a concern that the types of CPUs and GPUs that I need seem increasingly niche/expensive.

So, in my view, it's just as likely that "good enough" means the final integration never happens.

Of course, that isn't what this thread is supposed to be about -- it's supposed to be about whether it's a good idea or not :) But, I find the argument intellectually interesting, sorry!

I'm sorry but it's madness to think that in another decade things will still look roughly the same...
Sure, but that doesn't mean that I buy your argument as to what will change, or how, or even if I think it's the problem that needs to be solved. Here's a completely different problem. We've got over a billion active smart phones on the market right now. There is no sign of a slowdown in shipments, and it's likely that the embedded cameras will be 4k capable, and even more frequently used. The public shows an almost unlimited appetite for recording their lives -- consider twitch.tv. A billion phones/gaming-machines/street-cameras taking 150Mbps 4k video 24x7 isn't just a lot of data, it's a half-petabyte of data per person per year. Total harddrive sales capacity (137 million units in 1q2014) is roughly three orders of magnitude too small to store it all. So we're hoping that people will record less than .1% of their lives. Maybe 4k60p is an overestimate of where we'll be in a few years (after all, how do we upload 150mbps over cellular), but then we haven't accounted for data replication due to availability and redundancy and use of multiple social sites, etc. [and NSA :> ].

I'm not particularly interested in debating which of these issues is most pressing in the industry right now, I'm just trying to make the point that the technologies that we're interested in may not be the ones that drive market innovation. Maybe memristors come in to save the storage problem, and it totally changes the way code execution happens. Maybe, instead of integrating memory into the CPU, memory starts getting smarter, because more people are worried about the issues of storage than execution. Maybe we're like blind troubadours arguing about the importance of Next's postscript display, totally missing the incoming storm being brought by Mosaic.

We're trying to look 10 years ahead, and you're worried about looking 2 years back?
The "near-future" of AVX-1024 in desktop that you're considering is at least two release cycles after broadwell. I don't know that we'll get 8 cores by then -- I doubt it, but I'm willing to suspend my disbelief. We're talking at least 3 years from now, or in sum, 5 years of technological change. Am I worried that in your attempt to make a case for the near-term integration of these technologies that you're willing to use two examples that span a time-period of half as much? Yes I am.
FWIW, I think your arguments regarding the two current gaming consoles are more persuasive, while the distinction between average GPU and average CPU is negated when pricing is considered [even without considering that we're talking raw cpu vs. gpu + 4GB + etc.].

My point was the theoretical possibility of having the same raw throughput of the discrete GPU without an excessive number of cores or unrealistic die area, while also fully retaining the qualities of serving as a CPU. I think that's quite phenomenal.
I agree, it could make for quite an interesting time for sure.

Also, when faced with the choice between a weak CPU plus an expensive discrete GPU, or a unified CPU with twice the cores for the same price total, the discrete GPU will really have to excel to be worth losing CPU power. So all Intel has to do is keep carving out the market from below and the 1000 $ you once spent on a 'Titan' will one day go to a 'Xeon' with more cores than the average.
A scenario like that is almost exactly what concerns me. Some people will need the high-throughput cores. Some people will need the low-latency cores. There will be a tradeoff, and a bifurcation of the market. Even the sum total of this population is small, and likely to get smaller. Things are going to get expensive, my debating friend :(

I share your impatience, but I really don't think the apparent stagnation in core count is due to running into any hardware walls. The importance of the software ecosystem cannot be overstated.
I agree, and I admit to it being a problem that I don't usually think of, because I've been writing multithreaded code since I learned how to code back in the 80s. The idea that people don't find it a natural style of coding often sneaks up and bites me. I had a talk with my previous manager only a few months ago about just this subject, and it became apparent that I just 'see' code differently.

For the best results you even have to delve into lock-free algorithms, which only a handful of developers on the planet truly master.

In my experience, a bigger issue is recognizing where there are locks that aren't locks. If you've got a thread calling another thread synchronously, you have a lock, but most people don't see that. They teach locking pretty well at schools these days, but not so much multi-threaded programming....

That's exactly what I said: "Solutions to this bandwidth hunger exist too, by adopting techniques from CPUs.
Yes, you did. I seem to recall the context being different, and the point being argued being different, and thought myself frightfully clever to use your own ideas, but honestly, I don't remember what the points in question were, and I'm too lazy to go back and figure it out :)

It's just going to inch closer to a unified architecture, which is also desirable for many other reasons than bandwidth."
We agree. I have a sense that a perfectly reasonable alternative would be to pursue a purely stream-oriented programming model, but I don't see that happening. I think it's that whole seeing things differently -- people like the security of discrete bits of code running over data, rather than vice-versa.

I don't see much correlation there, aside perhaps from mispredictions leading to overproduction leading to small margins leading to less investment in new innovation?

This is the article sourced from the wiki: http://www.techhive.com/article/2034175/adoption-of-ddr4-memory-facing-delays.html
What I got out of that was desktop sales fall, memory suppliers go out of business, demand for ddr3 is actually high relative to remaining supply, and there's not enough people looking for ddr4.

This is what worries me about NVIDIA's Volta. I'm sure it can increase raw bandwidth, but it's a fairly radical departure from previous GDDR increments and it's only going to be required by the high-end parts aimed at a small market. So it will likely be expensive.
Total nitpick -- Pascal, you mean. I have to keep looking it up as well, I keep thinking 'Parker'. Gah....
But, yes, I agree, we're ultimately talking about expensive parts.

...but I'd pay good money to upgrade my desktop to something significantly more powerful.
Yeah, I'm still waiting as well. I need to use Windows because of the software I need to run, but I'm looking forward to Windows8 in exactly the way I looked forward to Vista.... I mean, I could use the advances in SMB, and that's about all....

That's really the exception. There will always be workstations with expansion slots, probably even for some specialized discrete GPUs, but the rest of the world is moving towards all-in-one systems and laptops. Note that today's workstations have CPU sockets with lots more pins and lots more RAM bandwidth, but you mentioned 115X sockets, and that's what I responded to.

Yeah, no, I think my fear is exactly that people are moving to a world that doesn't need any of the things I want to play with. It's going to make my hobbies more expensive....

There's a difference. Crystalwell is a 128 MB L4 cache that sits on the package PCB, while Volta aims to put all of the RAM (several GB) next to the GPU.
Yes, agreed, I was being a little facile.
It's a big, desperate, radical move on the part of the discrete GPU that's bound to have some cost implications, while for the CPU DDR4 still offers a direct increase in raw bandwidth
I dunno. From what I can tell, the move to on-die memory is a power consideration move. GPUs have actually gotten less wide over time -- you don't see a lot of 512byte busses on GPUs anymore (GK110 is 384b), so I don't think the motivation is bandwidth ... at least not entirely.

I also don't think it's going to be as expensive as you think. The reading I've done on DDR4 indicates that the memory array is less than a third of the area of the actual chip. Getting rid of all of the termination, impedance, power h/w is a huge net benefit in area and power. I don't know if GDDR5 suffers the same issues. I'm also less sanguine about the improvements that DDR4 has on tap. The current crop of memory units represents a move backwards in latency (from the current high end DDR3). That makes you even more reliant on code optimized for data location, when current OO decompositions often leads to exactly the inverse.

It certainly hasn't happened so far. GPGPU is proving very messy for consumer application development, and the only thing that helps is closer hardware integration and fewer separations at the code/data level. Note that map-reduce in the general sense is extremely common in typical code: any loop with independent iterations, followed by some form of aggregating the results, is amenable to 'map' parallelization which subsequently requires a low latency 'reduce' execution to fight Amdahl's Law.
Yes, I agree, we haven't found a good programming paradigm yet. And yes, I use ~MRs in a bunch of my programming. I mean, really Flume http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf but aside from the impossible-to-parse, overly genericized java code, it's essentially the same thing. The nice thing about that style is that it preserves the sequential construction of code execution that most programmers feel comfortable with, if only we could fix the syntax that makes so many actual flumes impossible-to-follow amalgams of '<' and '>'.... We also use another technology that is functional, and designed for multi-threaded code, but which leaves the creation of the actual code execution sequence up to the runtime engine. It's nice that the computer can find the optimal execution order, but losing the sequencing is ... unsettling. I don't think we've described that in public yet, but I raise it to point out that we're still playing with different ways of expressing code execution. That's the fun bit! :)

You have an utterly wrong idea about that. The latency and bandwidth limitations for communicating between heterogeneous components imposes limitations on the sort of algorithms you can efficiently implement. Lots of great ideas where high throughput and low latency are closely intertwined are simply not feasible today.

Of course. But necessity is the mother of invention. Many more ideas get investigated as these things are in flux. Consider architectures that are very wide, with only limited serial execution. I've never seen architectures with the throughput version of OOO. But wouldn't it be neat to have a streaming programming architecture where each branch became a different collection point for future SIMD execution? There are so many unexplored avenues, I'm not in a hurry to finish the quest.

Worse yet there are many variations so you have to aim down the middle and can't really achieve the best results on anything.
You're thinking like a tool user, not a tool maker. Tool makers want the messy bits -- it's way more fun to create architectures around (eg) Cell, but then, way less fun to build a game. So, it all depends on what you're in it for. (And yes, if you're trying to be compatible across all of those products, that is a complete nightmare.)

A bit of fun at Ocaml's prompt
Do you think functional programming has its place? (e.g. Haskell, OCaml, F#, Erlang).

"has its place" is open to a lot of interpretation, so I'm going to try and be more precise, but I might wind up asking a different question. If we grade languages on their ability to efficiently execute code on future processors, does the functional programming model offer a compelling decomposition of current programming problems?

I think that depends. I think there is a challenge in representing parallel code execution in any serialized format, but I'm not sure that the functional model has, as one example, many benefits over a somewhat more advanced representation of 'const' than C has. (C captures only half of immutability -- there is no way to model a promise by the caller that the passed data will not be modified by the caller while the callee is executing. Such a thing isn't possible in non-multi-threaded code, but of course is in multi-threaded situations.) But to be honest, I haven't played around a lot with it. Most of my functional programming is done in bastardized-Java, which has its own issues. The baggage of immutability and the gratuitous data copying that often occurs is particularly troubling. That said, executed in an environment where there is no permanent memory, but just streams of data flowing from one code site to another, it might be incredibly powerful. Your example works particularly well for streams, as it's all Map (no Reduce). I think there's a place for some functional concepts in a proper model of code execution for sure....

So, yes, I think playing with different representations of code is important, but I'm concerned that doing so doesn't really address the underlying problems of data non-locality which gets particularly egregious in OO code (no matter what the significant benefits of using OO modeling are).

Tim Sweeney was calling for a layered boondoggle that includes a purely functional core (pictured here in page 58, talked about from page 44 or 39 - these are slides)
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf
Wasn't this some really old idea concerning why discrete GPUs had to die or some such? "The reports of my death have been greatly exaggerated?"

I can't add much more to the discussion, I'm glad I'm able to read it so far.
Well, I'm glad my devil advocating is useful to someone other than myself :>

I think it suffices and is far more practical to have a functional EDSL within existing popular languages like C++. Examples of these are Halide and SystemC. And while these are pretty much entire languages on their own, it's not unreasonable for something like a game engine, which is used by many game titles, to have its own EDSL. Also, it doesn't have to be an entire language. Often you just need a way to express dataflows in a functional manner. Google's Cloud Dataflow is based on similar ideas where you build a pipeline with very simple looking constructs, but it fully abstracts the reality that it can run on millions of processors. Also, you can have an imperative EDSL but use it within a concurrent functional framework. This is what I aimed for with Reactor in SwiftShader, borrowing some ideas from GRAMPS and reactive programming.

Like this -- this was very interesting for me. It reminds me wth it is that I'm still doing coding ;^/
Yes, I have heard of Halide making it very easy to create highly performant image manipulation code, I think even separately from execution on a GPU. It turns out that writing really efficient C++ code is tricky, and it often winds up being hard to maintain. I agree with Nick that an embedded language might work well. I think the problem is to appropriately model the problem space in such a way that an efficient implementation is a fall-out of the representation. OO, functional, etc. are tools to achieving that representation, but I'd argue against making them a matter of religion.
 
Umm... where? Please find the exact post and quote it. Otherwise you are flat out lying.
Please stop putting words in my mouth.
I apologize then, I should have mentioned that you merely implied that CPUs would stagnate, your example elevated everything into the level of a statement.

But with consoles now using integrated GPUs, I don't think it will take long for games to run worse on discrete GPUs even if they have more raw processing power.
Current console cycle is a bad example, it is outmatched right out of the gate, which means it will not provide much advances on the visual front, which means it will not last long, and will be replaced with a more powerful hardware to cope with the necessary technological edge, Otherwise sails will stall.

This is comparable to non-unified GPUs sometimes having more total processing power but being inefficient for modern games.
You are giving CPU/GPU tight integration more credit than it deserves.

Also, when faced with the choice between a weak CPU plus an expensive discrete GPU, or a unified CPU with twice the cores for the same price total, the discrete GPU will really have to excel to be worth losing CPU power. So all Intel has to do is keep carving out the market from below and the 1000 $ you once spent on a 'Titan' will one day go to a 'Xeon' with more cores than the average..
Which wouldn't work because a unified future Xeon would need to become as fast as a future Titan+ a decent CPU, I don't see that happening for obvious practical, commercial and technological reasons, discrete GPUs throughput would have to dwindle for that to happen.

We're trying to look 10 years ahead, and you're worried about looking 2 years back?
You are still basing the "10 years ahead argument" on current graphics technologies, I agree that progress on that front is beginning to show the signs of diminishing returns, but you are still ignoring the possibility of another "big bang" in graphics, which is more likely to happen the slower progress becomes.

I'm fully aware that for discrete GPUs to die out it would also have to meet their performance, which is a moving target as well.
Exactly.
 
DavidGraham said:
I apologize then
Thank You.

DavidGraham said:
I should have mentioned that you merely implied that CPUs would stagnate, your example elevated everything into the level of a statement.
No, I did not imply that. I know I didn't because I know the point I was making. Now, if you cared to understand my perspective, perhaps you could simply ask me rather than trying to tell me what I think. I am fairly sure I am the higher authority in that regard.
 
There are a lot of good opinions in this thread. As for myself, I like the idea of going to software rendering (except for texturing, maybe). I have always thought that the only reasons we ever had hardware rendering were because:
1. intel didnt use an ISA or many extensions geared much more heavily towards graphics.
2. they have never put a whole lot of emphasis on floating point performance (look at the emotion engine vs the pentium iii).
3. two or more general purpose dies in one consumer-based system were not really considered.
 
...Amdahl's Law...

It surprises me how many people use Amdahl's Law to claim limits to parallelism in general since the law is actually rather flawed.

Let me explain:

Amdahl's Law expresses the limit of parallel performance in terms of percent serial in a program, but this entire premise is flawed in that for that percentage to be constant, the big O complexity of the serial portion must be the same as the big O complexity for the parallel portion. Otherwise the percent will asymptotically approach either 0% or 100%, making the law inapplicable.

Why are we stating this in terms of big O complexity? Because quite frankly we're only interested in large problem sizes - the small ones are "fast enough".

Thing is, in practice, the big O complexity of each portion of a given algorithm are often different. Here are a couple examples:

Matrix multiplication: Let's assume a traditional discrete GPGPU architecture. Then the serial portion is copying memory back and forth, which is O(n). The parallel portion is the actual work, which is O(n^1.5). Since n^1.5 > n, the serial portion asymptotically approaches 0% for large problem sizes. Amdahl's law is irrelevant here.

Map-Reduce: This one is a bit more complicated, since both map and reduce and parallel and of O(n) in many cases (actually, it's slightly more complicated - map is O(n/k) for k processors, and reduce is O(n/k), but only O(log n) for infinite processors; thankfully, log n increases so slowly as n gets large that, practically speaking, we can treat it as constant). The serial portion is O(1) since we're reducing to a single chunk of information which is sent back to the CPU. This yields 0% serial portion in the limit, assuming we don't have to send the data out to the GPU each time the algorithm is invoked. Hence, the question is data reuse.. We can state this as a separate big O function, where the data transfer is O(1) (we send the data out exactly once), with n being the number of times the Map-Reduce is invoked over the course of solving the large problem without sending in completely new data. In many cases, the Map is iterative, while the Reduce provides some metrics for each simulation step. In these cases, we get O(n) for the parallel portion, approaching 0% serial time for large problems.

Of course, Map-Reduce is such a broad subject that trying to characterize it is futile, my point is more that with a well designed algorithm, the serial portion should be of a lower order than the parallel portion.

~~~

The core idea of massively parallel programming then is to find algorithms such that the parallel portion is of a higher order of complexity than the serial portion. As long as this condition is met, the parallel speedup is unbounded.
 
Last edited by a moderator:
"Good enough" has lead to integrated GPUs. So this is an argument against discrete GPUs, not one that will keep them around.
That all depends on the market, and what "good enough" means. The market of greatest significance is the one that's busy buying phones like candy right now.
Phones favor integrated GPUs even stronger than the PC market does, so I don’t see how this is supposed to help your ‘good enough’ argument. Even the Tesla K1, which can be considered ultra high-end in the mobile market, is a fully integrated design. And while not necessarily ‘ultra’, the new consoles again prove that you don’t need discrete graphics for a system aimed first and foremost at the gaming market. So it’s really not all that market sensitive. Discrete GPUs will stick around for a while due to inertia, but it’s destined to become a niche at best.
It's equally possible that we'll never get a fully unified processor, because no one will be buying them in the quantity required to invest in them. Consider http://www.marketwatch.com/story/tech-job-cuts-reflect-declining-desktop-computer-sales-2014-07-28

Link-bait quote for sure (and look, it worked ;^/), but it's something to consider. Perhaps it is a pessimistic view of technological history, but at any given time, the "best" available product was rarely the most successful. You're asking for the "best" possible outcome -- the perfect integration of low latency and high throughput cores. It's not even clear to me that such a thing exists (the right balance is likely different depending on the task at hand), but even if it does, it seems unlikely that market forces are poised to deliver it to us. You'll recall that my entrance into this thread reflected a frustration with the pace of development, and a concern that the types of CPUs and GPUs that I need seem increasingly niche/expensive.

So, in my view, it's just as likely that "good enough" means the final integration never happens.

Of course, that isn't what this thread is supposed to be about -- it's supposed to be about whether it's a good idea or not :) But, I find the argument intellectually interesting, sorry!
The computer technology industry has gone through several highs and lows before. It doesn’t signify much in the long term, especially for architectural evolutions driven by the laws of physics and practicality. Also, there’s always evolution going on somewhere. We might observe some stagnation in the PC market, but the mobile market isn’t magically isolated from the same core issues to achieve higher performance while expanding the range of applications. It took a while, but mobile CPUs now also resort to out-of-order execution and multi-core for greater speed, and mobile GPUs want to support the same compute APIs. Will evolution grind to a halt when mobile meets desktop (read: Broadwell and Tesla K1)? Of course not. We’ll just see that any technological advance won’t merely be used for mobile any more, but will benefit both.
I'm sorry but it's madness to think that in another decade things will still look roughly the same…
Sure, but that doesn't mean that I buy your argument as to what will change, or how, or even if I think it's the problem that needs to be solved. Here's a completely different problem. We've got over a billion active smart phones on the market right now. There is no sign of a slowdown in shipments, and it's likely that the embedded cameras will be 4k capable, and even more frequently used. The public shows an almost unlimited appetite for recording their lives -- consider twitch.tv. A billion phones/gaming-machines/street-cameras taking 150Mbps 4k video 24x7 isn't just a lot of data, it's a half-petabyte of data per person per year. Total harddrive sales capacity (137 million units in 1q2014) is roughly three orders of magnitude too small to store it all. So we're hoping that people will record less than .1% of their lives. Maybe 4k60p is an overestimate of where we'll be in a few years (after all, how do we upload 150mbps over cellular), but then we haven't accounted for data replication due to availability and redundancy and use of multiple social sites, etc. [and NSA :> ].

I'm not particularly interested in debating which of these issues is most pressing in the industry right now, I'm just trying to make the point that the technologies that we're interested in may not be the ones that drive market innovation. Maybe memristors come in to save the storage problem, and it totally changes the way code execution happens. Maybe, instead of integrating memory into the CPU, memory starts getting smarter, because more people are worried about the issues of storage than execution. Maybe we're like blind troubadours arguing about the importance of Next's postscript display, totally missing the incoming storm being brought by Mosaic.
Point taken, especially since I just argued that there’s always some progress somewhere. But I can assure you that datacenter companies are on top of the future demand for cloud storage. What you might be underestimating is that for every bit that gets stored, there’s a proportional or even greater increase in demand for processing power. People want image quality improvement, format conversion, streaming with adaptive compression, face / object recognition, and last but not least we’ve only scratched the surface of ‘deep learning’ applications. There’s a lot more value in doing something useful with this massive amount of data, than to just store it. And that takes processing, lots of processing. So no matter what market we’re looking at, there’s plenty of demand for computing power.
The "near-future" of AVX-1024 in desktop that you're considering is at least two release cycles after broadwell. I don't know that we'll get 8 cores by then -- I doubt it, but I'm willing to suspend my disbelief. We're talking at least 3 years from now, or in sum, 5 years of technological change. Am I worried that in your attempt to make a case for the near-term integration of these technologies that you're willing to use two examples that span a time-period of half as much? Yes I am.
FWIW, I think your arguments regarding the two current gaming consoles are more persuasive, while the distinction between average GPU and average CPU is negated when pricing is considered [even without considering that we're talking raw cpu vs. gpu + 4GB + etc.].
The problem of the core count and the pricing are obviously closely related. Much of it is to blame on AMD not having been competitive since Bulldozer. Fortunately they’ve recently assembled a new dream team led by Jim Keller to work on the brand new K12 architecture. They abandoned Bulldozer’s module architecture and focus on single-threaded performance with SMT. It’s pretty obvious that to compete with Intel and their own current products, they need affordable 8-core CPUs. Intel will have no choice but to follow suit. K12 will probably be available in 2016-17, so there’s plenty of remaining time before 2024 to see one or multiple manufacturers unify the CPU and GPU.
It's a big, desperate, radical move on the part of the discrete GPU that's bound to have some cost implications, while for the CPU DDR4 still offers a direct increase in raw bandwidth.
I dunno. From what I can tell, the move to on-die memory is a power consideration move. GPUs have actually gotten less wide over time -- you don't see a lot of 512byte busses on GPUs anymore (GK110 is 384b), so I don't think the motivation is bandwidth ... at least not entirely.
It’s both. Bandwidth needs to scale for GPUs to become faster, and while for Kepler they achieved that with a 50% increase in RAM clock speed, that’s not something they can repeat without serious power consumption issues. Pascal’s stacked DRAM makes it possible to increase the bus width again. So it’s all about bandwidth, and about power.

The generation after that will run into the Bandwidth Wall yet again. The solutions for raw bandwidth increases get ever more expensive, so GPUs have to learn how to make due with less external bandwidth, like CPUs, by using a hierarchy of fast to large caches.
I'm also less sanguine about the improvements that DDR4 has on tap. The current crop of memory units represents a move backwards in latency (from the current high end DDR3). That makes you even more reliant on code optimized for data location, when current OO decompositions often leads to exactly the inverse.
Current high-end DDR3 is the result of many years of fine-tuning this technology. DDR4 is still in early development, but it has several intrinsic advantages that will allow it to reach higher performance and lower power consumption than DDR3 will ever be able to attain. Again you’re just not thinking long-term enough. That said DDR4 will exceed DDR3 in every metric long before 2024. It’s really useless to talk about DDR4 vs. DDR3 when reasoning about that long a time span. I merely mentioned it to point out that CPUs can increase raw RAM bandwidth with clock increases while still reducing power consumption, while GPUs already have to resort to more exotic solutions. The CPU can also easily still have a wider bus, if/when it drops support for PCIe, which for a large class of consumer devices without a dedicated GPU would be perfectly reasonable. Lastly they can have an L4 cache to crank up the bandwidth even more while lowering power consumption. So CPUs are nowhere near starved for bandwidth, and GPUs can learn a trick or two from that.
The latency and bandwidth limitations for communicating between heterogeneous components imposes limitations on the sort of algorithms you can efficiently implement. Lots of great ideas where high throughput and low latency are closely intertwined are simply not feasible today.
Of course. But necessity is the mother of invention.
Sure, but the "necessity" isn’t faster execution on a heterogeneous architecture, it’s faster execution period. So even without further details the "invention" may very well be a homogeneous one. We already know how to make homogeneous architectures much faster, and it completely avoids the problems with heterogeneous processing. So why hope for a miracle solution in heterogeneous computing when homogeneous computing already solves it?
Many more ideas get investigated as these things are in flux. Consider architectures that are very wide, with only limited serial execution. I've never seen architectures with the throughput version of OOO.
Just because you haven’t seen them yet doesn’t mean they can’t exist. GPUs were non-unified at one point too, with stark differences between pixel and vertex processing. Now there’s no more question pixel and vertex processing can coexist. Likewise high throughput and out-of-order execution can be combined into one CPU architecture.
But wouldn't it be neat to have a streaming programming architecture where each branch became a different collection point for future SIMD execution? There are so many unexplored avenues, I'm not in a hurry to finish the quest.
I don’t think there’s that many avenues left worth exploring. Academics have investigated just about every crazy idea in one form or another. You would expect that the valuable ones have been found by now. SPMD on SIMD is one of those very clearly valuable ones worth exploiting more. I mean, there’s no need to go beyond what CPUs and GPUs do. Let’s first unify them, which already unlocks a lot of new possibilities, and maybe then start wondering where to go next.
You're thinking like a tool user, not a tool maker. Tool makers want the messy bits -- it's way more fun to create architectures around (eg) Cell, but then, way less fun to build a game. So, it all depends on what you're in it for. (And yes, if you're trying to be compatible across all of those products, that is a complete nightmare.)
I am very much a tool maker. I just think we have different ideas about “fun”. Something convoluted that nobody is going to use, doesn’t give me any satisfaction. Also, if you want messy, try writing lock-free algorithms that abstract threads and locks into tasks and dependencies; you don’t need an exotic architecture like Cell to run into a lot of complexity for which good tools are needed to make it useful for application developers.
 
But with consoles now using integrated GPUs, I don't think it will take long for games to run worse on discrete GPUs even if they have more raw processing power.
Current console cycle is a bad example, it is outmatched right out of the gate, which means it will not provide much advances on the visual front, which means it will not last long, and will be replaced with a more powerful hardware to cope with the necessary technological edge, Otherwise sails will stall.
The previous generations of consoles still had a huge influence on game design long after they were outperformed by affordable PCs. So just because this generation starts already being outmatched, doesn't mean they'll have no effect. It's precisely because of their long life cycle, that game developers invest into really understanding the architecture well and exploiting every opportunity to make the next game more compelling than the previous one. This currently means more more CPU threads, and relatively low latency compute tasks with unified memory. For a PC this means it favors more cores and higher SIMD performance in the vicinity of those cores.
You are giving CPU/GPU tight integration more credit than it deserves.
Why? Have you already seen the full effect of what that enables? All GPU manufacturers are working on more unified memory models, integrated GPUs are gaining market share, the consoles are integrated, AVX-512 brings GPU technology into CPU cores... Clearly tight integration is already given a lot of credit by the actual architects. We've only seen the tip of the iceberg of what that means for software design, but if it's anything like the GPU's unification of vertex and pixel processing, there's simply no turning back and discrete GPUs will be perceived as very limiting several years from now.
all Intel has to do is keep carving out the market from below and the 1000 $ you once spent on a 'Titan' will one day go to a 'Xeon' with more cores than the average.
Which wouldn't work because a unified future Xeon would need to become as fast as a future Titan+ a decent CPU, I don't see that happening for obvious practical, commercial and technological reasons, discrete GPUs throughput would have to dwindle for that to happen.
We already have 15-core Xeon processors, and with AVX-1024 they would exceed Titan's theoretical performance. So even though I'm comparing technologies from a different timeframe, it's not unimaginable for CPUs to keep up with / catch up with GPUs. And again, the effective performance of discrete GPUs will become significantly lower than their theoretical performance, due to software that needs low overhead communication between serial and parallel code. It's like the the (non-unified) NV47 with 200 GFLOPS getting beaten by the (unified) G84 with 140 GFLOPS at DX10 workloads.
You are still basing the "10 years ahead argument" on current graphics technologies, I agree that progress on that front is beginning to show the signs of diminishing returns, but you are still ignoring the possibility of another "big bang" in graphics, which is more likely to happen the slower progress becomes.
With an argument like that you might as well admit that unification will happen unless the moon falls out of the sky. Seriously, any major progress in technology that advances graphics, is highly likely to advance generic computing just as much. The Memory Wall issue has been the cause of integration and provided room for programmability for multiple decades now. Heck, it's a geometrical fact that when transistor density quadruples, the edges can only have twice the number of wires (*), and the number of pins remains the same. We'd need radically different technology than what we've had in the past 60 years for these fundamental effects to no longer apply, and there's no telling it would help graphics any more than generic computing. Besides, the graphics and computing are becoming more synonymous as well so there's ever less room for specialization. Even rasterization and texture filtering is increasingly being done by the programmable cores.

(*) Except when adding more metal layers, but that's expensive and thus provides limited room for improvement.
 
it isn't clear to me whether the world belongs to traditional CPUs with vector style extensions, or GPUs with (for example) a "real" core per SMX
Are those two things really all that different? Isn't it mostly semantics at this point?
AMD GCN already has a "real core" per CU (compute unit). It's called the scalar unit. The scalar unit does pretty much the same things as the CPU scalar unit. It executes scalar (integer and float) instructions, it has direct connection to the instruction cache, it fetches instructions and processes branches and control flow (jumps, loops, calls). Vector units are slaves for the scalar unit (just like in the CPU), they get the instructions and scheduling from it. Each scalar unit commands four 16 wide (512 bit) vector units. Each vector unit has 10 way hyperthreading. Radeon 290X has 44 cores (CUs) like this.

Each GCN core (CU) can execute a completely different program. There is no global state or global resource tables like the old GPUs had. The scalar unit loads all the resource descriptors directly from memory to registers. It also has full general purpose L1 and L2 cache hierarchy just like a modern CPU.

A modern GPU is not much different compared to a modern supercomputer CPU such as the PowerPC A2 (Blue Gene/Q). A2 has 16 cores, each core sporting in-order execution, 4 way hyperthreading (total 64 threads) and two 256 bit vector units. Each A2 board has four CPUs, thus the total core (thread) count is 64 (256). A2 cores also have 32 MB eDRAM L2 cache to give it the needed bandwidth (and to provide it a big transactional memory pool). At 1.6 GHz the board thus peaks at almost one gigaflop of double precision performance, slightly beating Radeon 290X in double precision flops (at roughly equal power consumption).
Fixed-function GPUs have long been dead and buried. Non-unified GPUs have long been dead and buried. Both have been replaced by hugely inefficient fully programmable unified computing devices. Of course they're only inefficient in terms of power and area if they were implemented on the silicon processes used back when GPUs were fixed-function and non-unified.
Yes, if scaled up, an old fixed fuction GPU would beat modern GPUs in texture blitting and shadow map rendering by both performance and power usage. However, I would personally never want to go back to the DirectX 7 era, where you had to multipass like crazy to get more advanced stuff out of the card. Basically all graphics programming felt like hacking the hardware to do something it's not supposed to do. Multipassing is not power efficient, and things had to change. There was a clear demand for programmable hardware.

In modern graphics engines, GPUs spend less than 30% of the frame running pixel and vertex shaders. Most graphics rendering steps have moved to compute shaders. Even though this new hardware is less efficient in simple tasks (such as simple ROP/depth output), the compute shader programming model with its on chip shared work memory, and very low overhead synchronization primitives allows much reduced memory traffic, reduce the need for latency hiding (most memory ops operate in the work memory, and it is as fast as L1 cache) and make more complex (work efficient) parallel algorithms possible. The total efficiency gain is much greater than the efficiency loss from the slightly more complicated hardware.
It surprises me how many people use Amdahl's Law to claim limits to parallelism in general since the law is actually rather flawed.
Yes, and Gustafson's law explains why: http://en.wikipedia.org/wiki/Gustafson's_law

Pretty much every expensive step in our games involves big amount of data / entities. Code that is just running once per frame (serial code without any loops) has never been a bottleneck. Thus Gustafson's law prevents Amdahl's law from mattering.
Current console cycle is a bad example, it is outmatched right out of the gate, which means it will not provide much advances on the visual front
I strongly disagree with this statement. There isn't as much gains in raw performance as we have used to in past console generations, but both the CPU and the GPU are running much closer to their peak performance. In the future games, compute shaders will change things radically. There are lots of graphics processing algorithms that get 2x-5x boost in efficiency compared to last generation GPU brute force processing. The GPU is just that much more programmable. You just need to let the developers get used to the new way of programming, and there will be huge advances in graphics quality.

nick said:
That's really a quad-core CPU plus an iGPU. We could have an 8-core CPU for this bandwidth instead.
Yeah, I don't believe it. GT3 is relegated to lower-speed parts, and why do they need GT3e?
If the market situation would be different Intel would be selling a 8-core CPU for the same price, as the die would be similar in size (the IGP is roughly 50% of the die). However, as most productivity software (and even games) do not scale beyond 4 cores, the IGP gives most customers more value, and Intel can segment the 8+ core CPUs for professionals. AMD losing the CPU race, and dedicating it's focus to APUs meant that there's no competition anymore in the high (multicore) end CPU field. This is why Intel's business plan works, and we don't likely see high core counts in consumer devices until the programming models evolve, and there's clear demand for higher core counts in the consumer software as well.
So the CPU has a ton of weapons against high latency memory accesses. The GPU basically only has SMT, which can only hide latency when there's enough storage for thread context. This is self-defeating, because more threads means lower hit rates and thus more latency to be hidden. When this happens, performance falls off a cliff. CPUs deal with low locality far more elegantly. Execution continues on a cache miss thanks to out-of-order execution, and even when it grinds to a halt there are few other threads interfering so it can recover fast.
Counter-example: cache size on Kepler.
GPUs aren't against the wall, they have yet to avail themselves of all of the weapons that the CPU already requires.
The problem with running tons to threads simultaneously is that you need to split your register files and your caches between more threads. This means that a single thread effectively has less registers and less cache to use, meaning it needs to spill more data to memory.

However, the CPU needs also more data elements in registers and caches, since it needs to scan the future code aggressively (192 instruction ROB in Haswell), prefetch data (including some overhead), and rename registers to enable out of order execution. A CPU also needs a store forwarding buffer (another on-die memory block) to combat the compiler register filling to memory (x86 architecture register set is quite limited in size, and compilers spill registers to stack for various reasons, like function calls).

When creating optimal CPU code, you need to ensure that the serial execution of a single thread has high cache locality (to ensure that most accesses come from the L1 cache). When creating GPU code, you need to instead ensure that the cache locality across the neighbor threads has high cache locality. GPU L1 caches have been already trashed before a shader thread reaches it's end, so it's better to organize the data and the processing in a way that the neighbor thread data accesses hit the same cache lines as much as possible. This is virtually a 90 degree transpose of the same problem.

So in the end, the GPU isn't that much worse in the memory, cache or register usage, or doesn't need significantly more live instructions to operate (compared to 192 per Haswell core). GPUs do however fall of the cliff much easier when the programmer doesn't think about the memory access patterns or writes code that needs huge amount of registers. One of the biggest limitations of current GPUs is that they are not designed to scale up/down the thread count based on the current register pressure. The peak register usage determines the occupancy (thread count), even if that peak occurs inside a branch that is never taken with the current data set. CPU is more flexible when executing dynamic code filled with branches.

Obviously GPU is hopeless in executing code that doesn't have data parallelism. But not as much as it used to be. Radeon 290X could be executing 44 completely different kernels simultaneously without any problems (L1 and instruction caches behaving well).
GPUs will cease to exist, but only as we know them today. (...)

So the death of the GPU will be a joyous moment as well. We'll get a new breed of processors that fully supersede its functionality and will extract maximum amounts of ILP, TLP and DLP from any code you throw at it. Of course you can also think of it as a continuation of the GPU, or of the CPU for that matter, but the way we know either of them today will cease to exist.
Let me explain my views about this topic with a real world example.

Most game developers have came to the same conclusion. Automatically scaling (core count) task/job based systems with small work items, parallel loops ("kernels") and entity/component data abstraction are both the way to extract the highest possible performance of the current multicore CPUs and to offer the best program maintenance (full separation of data and transforms and different entity properties, additive / existence based programming).

The result is a program that doesn't have random branches, has almost linear memory access patterns, has zero raw synchronization primitives (no stalling) and has almost zero memory allocations (you can grow data tables with virtual memory tricks). This is how you want to use your CPU to get most out of it. Most of the time, all the CPU cores are crunching the same processing step (no instruction cache stalls, reduced data stalls). Of course multiple tasks can be running simultaneously if there are no dependencies and there's free CPU cores.

You don't want to be running unique serial code on each of your CPU cores anymore. Modern engines have no dedicated "physics thread" or "graphics thread" anymore. Hard coding serial tasks to long running threads results in bad multicore CPU utilization. Different game scenarios have a different bottleneck (physics heavy area, graphics heavy scene, AI heavy area). The monolithic serial tasks cannot be automatically parallelized to CPU cores that have free cycles, and the bottleneck dictates the frame length (fps).

You don't want unpredictable branches. Every single branch miss costs more than a 4x4 matrix multiply on a modern CPU. You don't want to touch properties that are not currently active (that is another useless memory access). The last but not least... 80%+ of a modern CPUs computational capacity (flops/iops) comes from the vector units. Modern engines need good primitives to make it easy for programmers vectorize their processing steps (or use an existing language extension such as Intel SPMD).

In the end, your efficient CPU code becomes quite close to the code you would be running on your GPU (using compute shader, CUDA or C++ AMP).

You should move your whole graphics engine (scene setup / management / animation / rendering) to the GPU. This provides nice improvements to the performance per watt as well, since issuing draw calls on the CPU cost a huge amount of cycles (and the generated command buffer data needs to be moved to the discrete GPU). On the other hand, on the GPU side, the same operation is simply writing a 32 bit identifier to an append buffer. There's a full thread about this topic here: http://beyond3d.com/showthread.php?t=64685&highlight=indirect+pulling.

Currently it seems that the heavy processing is moving away from the CPU to the GPU (at least in games). Developers finally have consoles that have fully programmable GPUs. GPUs can nowadays feed themselves, and this provides unique opportunities to get major performance (and efficiency) boosts by keeping all the data in the same device. You don't need the CPU to orchestrate the things anymore. Unfortunately the current PC graphics APIs are still lacking in this aspect, but the hardware is evolving rapidly (see: Kepler Dynamic Parallelism).

However, in the long run, I would definitely want a processor / programming model that allows me to write all my code in a single language (having good modern built in parallel primitives) and to use a single set of debugging and analysis tools that can be used to step through (or trace back) the whole code base.
 
Most game developers have came to the same conclusion. Automatically scaling (core count) task/job based systems with small work items, parallel loops ("kernels") and entity/component data abstraction are both the way to extract the highest possible performance of the current multicore CPUs and to offer the best program maintenance (full separation of data and transforms and different entity properties, additive / existence based programming).
[...]
You don't want to be running unique serial code on each of your CPU cores anymore.

Interesting. You seem to be suggesting that, despite the apparent difference between task-oriented CPU cores and vectorized GPUs, CPUs are also better utilized if the code they're running is structured as if it were running on GPUs. Do you think this is because so much of the destination of the CPU output is for the GPU (ie: is this a result of your example -- you go on to describe moving the process of feeding GPUs onto the GPU), or is this fundamental to the nature of (eg) caching, lock-contention, etc?

Or have I misunderstood you completely? :)
 
AMD GCN already has a "real core" per CU (compute unit). It's called the scalar unit. The scalar unit does pretty much the same things as the CPU scalar unit.
...
The scalar unit loads all the resource descriptors directly from memory to registers. It also has full general purpose L1 and L2 cache hierarchy just like a modern CPU.
I think this is debatable for GCN, at least for now.
The scalar unit's cache is read-only, shared by up to 4 other CUs, and scalar reads are the one kind of memory traffic that cannot be counted on to always return in-order, even with respect to other scalar reads. From Southern to Sea Islands, the only solution deemed useful by the ISA docs is that a wavefront stall if a scalar read is in-flight.
The scalar cache is currently not described as taking part in the protocol that keeps the vector L1 coherent.
That seems to be fine for what the CUs are currently tasked with, but it's a bit short of what a core normally can do.

Perhaps the idea of giving the scalar unit a scalar subset of the vector ISA, espoused in the following research paper, will find itself implemented someday:
http://people.engr.ncsu.edu/hzhou/ipdps14.pdf
At least some of the names in the paper give it more credence than other proposals.

The CU might be considered holistically as some kind of degenerate core, but I'm vacillating on whether a the better pop-culture metaphor for the scalar/vector split is Master Blaster from Mad Max or Kuato from the 1990 Total Recall.

At 1.6 GHz the board thus peaks at almost one gigaflop of double precision performance, slightly beating Radeon 290X in double precision flops (at roughly equal power consumption).
Well, at least the non FirePro version of it.

Yes, and Gustafson's law explains why: http://en.wikipedia.org/wiki/Gustafson's_law
I remember when Gustafson's law came up with Cell and Xenon eight years ago. I recall at least some developers on this generation being at least a little thankful for the Jaguar cores, even though their peak throughput numbers weren't a significant upgrade.

Unless the consoles abandon the fixed-platform concept, their maximum supported level of concurrency based on execution and storage has some ceiling that will be found in time, which means Gustafson's appeal to indefinite expansion will not bear out long-term.

Gustafson's law seems to be the thing that gets bandied about at the start of a generation.
It hasn't been the primary griping point when nearing the end.

Pretty much every expensive step in our games involves big amount of data / entities. Code that is just running once per frame (serial code without any loops) has never been a bottleneck. Thus Gustafson's law prevents Amdahl's law from mattering.
At least for current game graphics. The current state of Sony's use of GPU compute for more latency-sensitive audio operations is that the neglected straightline performance of the GPU keeps it tens of milliseconds beyond the range of acceptable.

If the market situation would be different Intel would be selling a 8-core CPU for the same price, as the die would be similar in size (the IGP is roughly 50% of the die).
The uncore, memory bus, and the on-die interconnect tend to be upgraded at the higher core counts. The IGP has fewer stops on the ring bus than there would be if there were four cores in its place. The EP and EX Xeons have to scale the infrastructure to make the extra cores worthwhile.

Most game developers have came to the same conclusion. Automatically scaling (core count) task/job based systems with small work items, parallel loops ("kernels") and entity/component data abstraction are both the way to extract the highest possible performance of the current multicore CPUs and to offer the best program maintenance (full separation of data and transforms and different entity properties, additive / existence based programming).
This is the case for games at least.
For HPC, the real domain of Gustafson's Law, the systems employed by game engines are less than ideal.
The costs of scaling system concurrency and working set size are masked because the consoles are so far below that level of scale, much how a goldfish may think it is a better master of its fishbowl than a great white is of the Pacific.
 
Interesting. You seem to be suggesting that, despite the apparent difference between task-oriented CPU cores and vectorized GPUs, CPUs are also better utilized if the code they're running is structured as if it were running on GPUs. Do you think this is because so much of the destination of the CPU output is for the GPU (ie: is this a result of your example -- you go on to describe moving the process of feeding GPUs onto the GPU), or is this fundamental to the nature of (eg) caching, lock-contention, etc?
Predictable code is easier for any machine to run. Branches are bad for both architectures. GPUs choke on incoherent branches (threads do not follow the same path in wave/warp) and CPUs choke on mispredicted branches, and neither likes running cold code (instruction cache stall). Haswell can sustain 4 instructions per cycle (retire rate), thus a branch miss that causes a 14 cycle stall could cost up to 56 instructions. In the worst case among these 56 instructions there are 28 AVX2 (256 bit) fused multiply-adds (Haswell can execute two AVX2 FMAs per cycle). Thus the mispredict can cost 896 flops (+28 other instructions such as 256 bit AVX2 loads/stores). People are saying that branchy code is bad for GPUs, but it is also bad for CPUs. You can most often transform your code in a way that you don't need branches at all. This is often called "sort by branch", because it means that the processor is executing elements with different branch conditions separately (all elements following a certain path first, and then the others following the other path, etc). This solves both CPUs branch misprediction penalties and GPUs incoherent branch penalties. Obviously not all algorithms can be transformed in a way like this (but you don't need to use algorithms that don't behave well on modern processors).

The other important thing is memory access patterns. Modern CPUs and GPUs need good cache locality to achieve acceptable performance. If your code is not cache-optimized, all the other optimizations are pretty much useless. This is the most important thing when optimizing either for CPU or GPU. Some people are still thinking that CPU is good at traversing pointer soups (long dependency chains with random memory accesses), such as linked lists and complex tree structures. This is simply not true. Processing linear arrays with vectorized code almost always beats complex structures at element counts that matter in interactive applications (such as games). For example Dice got huge performance gains when they ditched their old octree-based viewport culling and changed to a simpler vectorized one with flat structures (http://dice.se/publications/culling-the-battlefield-data-oriented-design-in-practice/). You might want to say that CPUs advantage is that it CAN operate on these complex structures. However, most often you don't want to use pointer soup structures at all. Generating complex structures in a massively parallel way (on the GPU) is often a difficult problem, traversing is often easy (as you can do branchess traversals with CMOV-style code if you have a depth guarantee, such as log(n)).

Lock contention is an important thing to consider, but I feel that many programmers are just doing it wrong. I have seen too much code with ad-hoc manual lock primitives added to make something thread safe. These systems tend to bloat extensively (too many synchronization points), and cause all kinds of problems in the end. I feel that a task based system that lets you describe resources (and read / modify accesses to them) is strictly better. Your scheduler doesn't schedule a task until it can ensure that all the necessary resources are available. This way you don't have idle waiting or preemption (each context switch causes both L1 instruction and data cache to fully trash).

I am very pleased to see hardware transactional memory (Intel TSX extensions) entering consumer products. This makes it much easier to implement efficient parallel data structures and task schedulers. Unfortunately not even all the high end Haswell processors support it (4770k doesn't!), so I don't see a wide adaptation any time soon. I hope Intel (and the other GPU manufacturers) are listening now: Transactional memory would be perfect for GPUs. It would make many parallel algorithms much simpler to implement and would allow completely new kinds of algorithms. Even if the transaction size would be limited to two cache lines (64 bytes * 2), it would be enough to change the GPU computing (one cache line transaction is obviously not enough for the most interesting cases).
 
I think this is debatable for GCN, at least for now.
The scalar unit's cache is read-only, shared by up to 4 other CUs, and scalar reads are the one kind of memory traffic that cannot be counted on to always return in-order, even with respect to other scalar reads. From Southern to Sea Islands, the only solution deemed useful by the ISA docs is that a wavefront stall if a scalar read is in-flight.
The scalar cache is currently not described as taking part in the protocol that keeps the vector L1 coherent.
That seems to be fine for what the CUs are currently tasked with, but it's a bit short of what a core normally can do.

Perhaps the idea of giving the scalar unit a scalar subset of the vector ISA, espoused in the following research paper, will find itself implemented someday:
http://people.engr.ncsu.edu/hzhou/ipdps14.pdf
At least some of the names in the paper give it more credence than other proposals.
Yes, the GCN scalar unit cannot write to memory directly (only access to read only cache), and doesn't have coherence with the rest of the system. You can of course sidestep this by splatting the scalar to a vector register and writing it out using the vector pipeline.

What I do like about the scalar units in modern GPUs is that the scalar unit helps in reducing the unnecessary repetitive work. The compiler can detect some cases where the calculation sources are invariant in the whole wave/warp, and perform that calculation just once in the scalar unit. This is 32x/64x less work. NVIDIA has an paper about doing this automatically in the compiler: http://www.eecs.berkeley.edu/~yunsup/papers/scalarization-cgo2013-talk.pdf.

Unfortunately the compilers are not perfect at this yet, so if I for example divide my thread id by 32/64, and use that index to fetch data and later perform calculations, the compiler doesn't understand that all this data fetching and the following calculations could be done in the scalar unit (instead of replicating it 32/64 times for the whole wave/warp). I would personally prefer an extension to the shading languages that allow you to tell the compiler that some data is invariant over the warp/wave. However making this portable across multiple GPU brands isn't straightforward (NVIDIA has 32 wide warps, AMD has 64 wide waves). Still, this provides such big gains (reduction in repetitive work) that it might change the way we program these massively parallel machines, and would eventually lead to more complex scalar units that resemble current CPU cores.
 
Back
Top