"Good enough" has lead to integrated GPUs. So this is an argument against discrete GPUs, not one that will keep them around.
That all depends on the market, and what "good enough" means. The market of greatest significance is the one that's busy buying phones like candy right now. It's equally possible that we'll never get a fully unified processor, because no one will be buying them in the quantity required to invest in them. Consider
http://www.marketwatch.com/story/tech-job-cuts-reflect-declining-desktop-computer-sales-2014-07-28
A look at which companies are cutting the most jobs shows that desktop computers may be going the way of the Polaroid camera.
Link-bait quote for sure (and look, it worked ;^/), but it's something to consider. Perhaps it is a pessimistic view of technological history, but at any given time, the "best" available product was rarely the most successful. You're asking for the "best" possible outcome -- the perfect integration of low latency and high throughput cores. It's not even clear to me that such a thing exists (the right balance is likely different depending on the task at hand), but even if it does, it seems unlikely that market forces are poised to deliver it to us. You'll recall that my entrance into this thread reflected a frustration with the pace of development, and a concern that the types of CPUs and GPUs that I need seem increasingly niche/expensive.
So, in my view, it's just as likely that "good enough" means the final integration never happens.
Of course, that isn't what this thread is supposed to be about -- it's supposed to be about whether it's a good idea or not
But, I find the argument intellectually interesting, sorry!
I'm sorry but it's madness to think that in another decade things will still look roughly the same...
Sure, but that doesn't mean that I buy your argument as to what will change, or how, or even if I think it's the problem that needs to be solved. Here's a completely different problem. We've got over a billion active smart phones on the market right now. There is no sign of a slowdown in shipments, and it's likely that the embedded cameras will be 4k capable, and even more frequently used. The public shows an almost unlimited appetite for recording their lives -- consider twitch.tv. A billion phones/gaming-machines/street-cameras taking 150Mbps 4k video 24x7 isn't just a lot of data, it's a half-petabyte of data per person per year. Total harddrive sales capacity (137 million units in 1q2014) is roughly three orders of magnitude too small to store it all. So we're hoping that people will record less than .1% of their lives. Maybe 4k60p is an overestimate of where we'll be in a few years (after all, how do we upload 150mbps over cellular), but then we haven't accounted for data replication due to availability and redundancy and use of multiple social sites, etc. [and NSA :> ].
I'm not particularly interested in debating which of these issues is most pressing in the industry right now, I'm just trying to make the point that the technologies that we're interested in may not be the ones that drive market innovation. Maybe memristors come in to save the storage problem, and it totally changes the way code execution happens. Maybe, instead of integrating memory into the CPU, memory starts getting smarter, because more people are worried about the issues of storage than execution. Maybe we're like blind troubadours arguing about the importance of Next's postscript display, totally missing the incoming storm being brought by Mosaic.
We're trying to look 10 years ahead, and you're worried about looking 2 years back?
The "near-future" of AVX-1024 in desktop that you're considering is at least two release cycles after broadwell. I don't know that we'll get 8 cores by then -- I doubt it, but I'm willing to suspend my disbelief. We're talking at least 3 years from now, or in sum, 5 years of technological change. Am I worried that in your attempt to make a case for the near-term integration of these technologies that you're willing to use two examples that span a time-period of half as much? Yes I am.
FWIW, I think your arguments regarding the two current gaming consoles are more persuasive, while the distinction between average GPU and average CPU is negated when pricing is considered [even without considering that we're talking raw cpu vs. gpu + 4GB + etc.].
My point was the theoretical possibility of having the same raw throughput of the discrete GPU without an excessive number of cores or unrealistic die area, while also fully retaining the qualities of serving as a CPU. I think that's quite phenomenal.
I agree, it could make for quite an interesting time for sure.
Also, when faced with the choice between a weak CPU plus an expensive discrete GPU, or a unified CPU with twice the cores for the same price total, the discrete GPU will really have to excel to be worth losing CPU power. So all Intel has to do is keep carving out the market from below and the 1000 $ you once spent on a 'Titan' will one day go to a 'Xeon' with more cores than the average.
A scenario like that is almost exactly what concerns me. Some people will need the high-throughput cores. Some people will need the low-latency cores. There will be a tradeoff, and a bifurcation of the market. Even the sum total of this population is small, and likely to get smaller. Things are going to get expensive, my debating friend
I share your impatience, but I really don't think the apparent stagnation in core count is due to running into any hardware walls. The importance of the software ecosystem cannot be overstated.
I agree, and I admit to it being a problem that I don't usually think of, because I've been writing multithreaded code since I learned how to code back in the 80s. The idea that people don't find it a natural style of coding often sneaks up and bites me. I had a talk with my previous manager only a few months ago about just this subject, and it became apparent that I just 'see' code differently.
For the best results you even have to delve into lock-free algorithms, which only a handful of developers on the planet truly master.
In my experience, a bigger issue is recognizing where there are locks that aren't locks. If you've got a thread calling another thread synchronously, you have a lock, but most people don't see that. They teach locking pretty well at schools these days, but not so much multi-threaded programming....
That's exactly what I said: "Solutions to this bandwidth hunger exist too, by adopting techniques from CPUs.
Yes, you did. I seem to recall the context being different, and the point being argued being different, and thought myself frightfully clever to use your own ideas, but honestly, I don't remember what the points in question were, and I'm too lazy to go back and figure it out
It's just going to inch closer to a unified architecture, which is also desirable for many other reasons than bandwidth."
We agree. I have a sense that a perfectly reasonable alternative would be to pursue a purely stream-oriented programming model, but I don't see that happening. I think it's that whole seeing things differently -- people like the security of discrete bits of code running over data, rather than vice-versa.
I don't see much correlation there, aside perhaps from mispredictions leading to overproduction leading to small margins leading to less investment in new innovation?
This is the article sourced from the wiki:
http://www.techhive.com/article/2034175/adoption-of-ddr4-memory-facing-delays.html
What I got out of that was desktop sales fall, memory suppliers go out of business, demand for ddr3 is actually high relative to remaining supply, and there's not enough people looking for ddr4.
This is what worries me about NVIDIA's Volta. I'm sure it can increase raw bandwidth, but it's a fairly radical departure from previous GDDR increments and it's only going to be required by the high-end parts aimed at a small market. So it will likely be expensive.
Total nitpick -- Pascal, you mean. I have to keep looking it up as well, I keep thinking 'Parker'. Gah....
But, yes, I agree, we're ultimately talking about expensive parts.
...but I'd pay good money to upgrade my desktop to something significantly more powerful.
Yeah, I'm still waiting as well. I need to use Windows because of the software I need to run, but I'm looking forward to Windows8 in exactly the way I looked forward to Vista.... I mean, I could use the advances in SMB, and that's about all....
That's really the exception. There will always be workstations with expansion slots, probably even for some specialized discrete GPUs, but the rest of the world is moving towards all-in-one systems and laptops. Note that today's workstations have CPU sockets with lots more pins and lots more RAM bandwidth, but you mentioned 115X sockets, and that's what I responded to.
Yeah, no, I think my fear is exactly that people are moving to a world that doesn't need any of the things I want to play with. It's going to make my hobbies more expensive....
There's a difference. Crystalwell is a 128 MB L4 cache that sits on the package PCB, while Volta aims to put all of the RAM (several GB) next to the GPU.
Yes, agreed, I was being a little facile.
It's a big, desperate, radical move on the part of the discrete GPU that's bound to have some cost implications, while for the CPU DDR4 still offers a direct increase in raw bandwidth
I dunno. From what I can tell, the move to on-die memory is a power consideration move. GPUs have actually gotten less wide over time -- you don't see a lot of 512byte busses on GPUs anymore (GK110 is 384b), so I don't think the motivation is bandwidth ... at least not entirely.
I also don't think it's going to be as expensive as you think. The reading I've done on DDR4 indicates that the memory array is less than a third of the area of the actual chip. Getting rid of all of the termination, impedance, power h/w is a huge net benefit in area and power. I don't know if GDDR5 suffers the same issues. I'm also less sanguine about the improvements that DDR4 has on tap. The current crop of memory units represents a move backwards in latency (from the current high end DDR3). That makes you even more reliant on code optimized for data location, when current OO decompositions often leads to exactly the inverse.
It certainly hasn't happened so far. GPGPU is proving very messy for consumer application development, and the only thing that helps is closer hardware integration and fewer separations at the code/data level. Note that map-reduce in the general sense is extremely common in typical code: any loop with independent iterations, followed by some form of aggregating the results, is amenable to 'map' parallelization which subsequently requires a low latency 'reduce' execution to fight Amdahl's Law.
Yes, I agree, we haven't found a good programming paradigm yet. And yes, I use ~MRs in a bunch of my programming. I mean, really Flume
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf but aside from the impossible-to-parse, overly genericized java code, it's essentially the same thing. The nice thing about that style is that it preserves the sequential construction of code execution that most programmers feel comfortable with, if only we could fix the syntax that makes so many actual flumes impossible-to-follow amalgams of '<' and '>'.... We also use another technology that is functional, and designed for multi-threaded code, but which leaves the creation of the actual code execution sequence up to the runtime engine. It's nice that the computer can find the optimal execution order, but losing the sequencing is ... unsettling. I don't think we've described that in public yet, but I raise it to point out that we're still playing with different ways of expressing code execution. That's the fun bit!
You have an utterly wrong idea about that. The latency and bandwidth limitations for communicating between heterogeneous components imposes limitations on the sort of algorithms you can efficiently implement. Lots of great ideas where high throughput and low latency are closely intertwined are simply not feasible today.
Of course. But necessity is the mother of invention. Many more ideas get investigated as these things are in flux. Consider architectures that are very wide, with only limited serial execution. I've never seen architectures with the throughput version of OOO. But wouldn't it be neat to have a streaming programming architecture where each branch became a different collection point for future SIMD execution? There are so many unexplored avenues, I'm not in a hurry to finish the quest.
Worse yet there are many variations so you have to aim down the middle and can't really achieve the best results on anything.
You're thinking like a tool user, not a tool maker. Tool makers want the messy bits -- it's way more fun to create architectures around (eg) Cell, but then, way less fun to build a game. So, it all depends on what you're in it for. (And yes, if you're trying to be compatible across all of those products, that is a complete nightmare.)
A bit of fun at Ocaml's prompt
Do you think functional programming has its place? (e.g. Haskell, OCaml, F#, Erlang).
"has its place" is open to a lot of interpretation, so I'm going to try and be more precise, but I might wind up asking a different question. If we grade languages on their ability to efficiently execute code on future processors, does the functional programming model offer a compelling decomposition of current programming problems?
I think that depends. I think there is a challenge in representing parallel code execution in any serialized format, but I'm not sure that the functional model has, as one example, many benefits over a somewhat more advanced representation of 'const' than C has. (C captures only half of immutability -- there is no way to model a promise by the caller that the passed data will not be modified by the caller while the callee is executing. Such a thing isn't possible in non-multi-threaded code, but of course is in multi-threaded situations.) But to be honest, I haven't played around a lot with it. Most of my functional programming is done in bastardized-Java, which has its own issues. The baggage of immutability and the gratuitous data copying that often occurs is particularly troubling. That said, executed in an environment where there is no permanent memory, but just streams of data flowing from one code site to another, it might be incredibly powerful. Your example works particularly well for streams, as it's all Map (no Reduce). I think there's a place for some functional concepts in a proper model of code execution for sure....
So, yes, I think playing with different representations of code is important, but I'm concerned that doing so doesn't really address the underlying problems of data non-locality which gets particularly egregious in OO code (no matter what the significant benefits of using OO modeling are).
Tim Sweeney was calling for a layered boondoggle that includes a purely functional core (pictured here in page 58, talked about from page 44 or 39 - these are slides)
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf
Wasn't this some really old idea concerning why discrete GPUs had to die or some such? "The reports of my death have been greatly exaggerated?"
I can't add much more to the discussion, I'm glad I'm able to read it so far.
Well, I'm glad my devil advocating is useful to someone other than myself :>
I think it suffices and is far more practical to have a functional EDSL within existing popular languages like C++. Examples of these are
Halide and
SystemC. And while these are pretty much entire languages on their own, it's not unreasonable for something like a game engine, which is used by many game titles, to have its own EDSL. Also, it doesn't have to be an entire language. Often you just need a way to express dataflows in a functional manner. Google's Cloud Dataflow is based on similar ideas where you build a pipeline with very simple looking constructs, but it fully abstracts the reality that it can run on millions of processors. Also, you can have an imperative EDSL but use it within a concurrent functional framework. This is what I aimed for with Reactor in SwiftShader, borrowing some ideas from
GRAMPS and
reactive programming.
Like this -- this was very interesting for me. It reminds me wth it is that I'm still doing coding ;^/
Yes, I have heard of Halide making it very easy to create highly performant image manipulation code, I think even separately from execution on a GPU. It turns out that writing really efficient C++ code is tricky, and it often winds up being hard to maintain. I agree with Nick that an embedded language might work well. I think the problem is to appropriately model the problem space in such a way that an efficient implementation is a fall-out of the representation. OO, functional, etc. are tools to achieving that representation, but I'd argue against making them a matter of religion.