For the sake of GPGPU, is it time for an AGP-style interface again?

Grall

Invisible Member
Legend
It is often said[who?] that PCIe latency in communication between CPU and GPU is a hindrance in GPGPU applications. Would a more direct communications channel between CPU and GPU therefore be beneficial, or is PCIe about as efficient as we can expect an expansion slot to be, keeping in mind considerations such as physical distances involved and any electrical impacts like impedance of long traces, connectors and so on bring?

While it would be interesting to hope that one day we would see expansion slot-quality GPUs integrated into our CPUs, I don't think the PC market is ready for >300W CPUs and never will be either, much less multiple CPU sockets of that caliber on a mobo...
 
Last edited by a moderator:
Another dedicated interface -- hell no!

Want low-latency communication? Just integrate the components. There's no [easy] way around the physical limits. And will save some power, too.

By the way, PCI-E spec's has a reserved x32 mode, for the bandwidth freaks, though no implementation yet exists. With the new signal encoding, introduced in v3.0, the thing already slashed some good amount of latency and packet overhead.
 
Want low-latency communication? Just integrate the components.
Yeah we're there already, and the result is we get high-performance CPUs with a very meh GPU (Intel) or very meh CPUs with a meh GPU (AMD console parts), when compared with high-end discrete multi-GPUs. ...And of course, no multi-sockets either unless you buy server mobos with server CPUs, which don't have integrated graphics at all.

There's no [easy] way around the physical limits. And will save some power, too.
...Optronics...? :)

With the new signal encoding, introduced in v3.0, the thing already slashed some good amount of latency and packet overhead.
That's interesting. Do you have any numbers, benchmarks...?
 
I don't think the low latency connection between the CPU and GPU on APUs has been fully, or even close to fully exploited yet.
 
Well I'm Dc doom when I see Intel speaking of 3TFLOPS (DP) socket-able chip with crazy power efficiency I wonder if a custom interface is going to cut it for GPGPU.
Though those numbers are crazy on the verge of too good to be true (and it is just a roadmap).
 
Last edited by a moderator:
Well I'm Dc doom when I see Intel speaking of 3TFLOPS (DP) socket-able chip with crazy power efficiency I wonder if a custom interface is going to cut it for GPGPU.
Though those numbers are crazy on the verge of too good to be true (and it is just a roadmap).

I believe you speak about Knight Landing... well it is not for today.. mostly 2015 and next...
 
I believe you speak about Knight Landing... well it is not for today.. mostly 2015 and next...
Not for today, some news the others stated sometime in 2014, now we have roadmap with vague timeline that state 2015.
Anyway it is not that a hypothetical new AGP type of interface can spawn tomorrow without a lot of actors having their say so a time consuming process.
 
Last edited by a moderator:
It is often said[who?] that PCIe latency in communication between CPU and GPU is a hindrance in GPGPU applications.
If the rumors are true, then Skylake will feature 512-bit SIMD units. At 14 nm, octa-cores will be mainstream, so we're looking at 2 TFLOPS worth of fabulous homogeneous computing power. Why would any developer invest their time into dealing with the problems of GPGPU computing instead with little chance of widespread success?

Computing power is increasing much faster than the interface can keep up with if you need to communicate back and forth. So why bother with a new interface? It is only going to drag out the death of GPGPU by a couple more years. Graphics is surviving, for now, because it's mostly one-way communication.

The era of unified computing is upon us. Don't try to live in the past.
 
Wasnt it avx that was supposed to kill discreet graphics, well avx is with us and discreet graphics arnt dead, not even pining - you know for the fjords...
 
At 14 nm, octa-cores will be mainstream, so we're looking at 2 TFLOPS worth of fabulous homogeneous computing power. Why would any developer invest their time into dealing with the problems of GPGPU computing instead with little chance of widespread success?
Uhm, ok, so a CPU releasing - hopefully - in what, 2015 or something, will do two TF. We have GPUs right NOW at four and a half. What do you think a GPU will do in '15; six, eight? More?

Regardless it'll be enough to squish any CPU available on the market right then, which by in of itself makes it worth targetting by developers. I know you're a big software/microprocessor fan, but discrete graphics processors aren't going to go away just because 8-core chips become mainstream.

Also, will CPU memory interfaces be able to keep up with 8-core, 512-bit SIMD units...? And in any case, high-end GPUs already have multiples more bandwidth than any CPU releasing in a few years' time.

The era of unified computing is upon us. Don't try to live in the past.
With all due respect, but...bahahah! :)
 
Well you have to give him credit that if actually nothing move in the realtime graphics realm in the HPC realm it is another matter, IBM still rules in power efficiency (I would also bet that their bluegene /q can cope with a greater variety of workloads than heterogeneous set-up <=personal assumption), Intel is promising massive improvements. I see a trend here, GPGPU is getting more relevant in the personal realm than in HPC, for some of the reasons Nick went through many times.
 
I don't want to create waves here, but this thread is about interconnects, not Nick's 1024-bit wide AVX. :D
 
I remember Anand Shimpi dreaming of socketable GPUs back when AMD purchased ATI. It's a shame (if only because of my unfed curiosity) that dream never came true. I guess it didn't need to though... the industry changed. APUs make more sense.

I think the biggest hindrances to GPGPU are the programing complexity and and the fact that it's still a relatively new concept. Both of those are being addressed, the first with things like HSA and C++ AMP; and the second just through the passing of time, as new CS majors saturate the workforce.

The need for GPGPU is still relatively tiny. It's an emerging concept, and I would think that in the face of what we've already developed to solve latency issues (among other things), APUs, spending that R&D for a low latency connection to the GPU seems wasteful at this point in time.

As far as "expansion slot quality" goes, it will certainly be done as time and technology progress. Stacked memory will take care of any bandwidth issues, and the continued shrinking of transistors will give us integrated GPUs that are as fast a 7970 or 780 just 4 years from now (or two shrinks). Of course, that technology will be applied to dedicated GPUs as well -- integrated will always lag dedicated. The need for dedicated GPUs will decline, however.

I think things will look very different even as early as Intel's Skylake. The implications of integration have only begun to be realized, and the eventual merging of storage and memory, all on package or on chip, leads me to believe that dedicated units will become a small niche.

May I point out that Intel already has a very large amount of die real estate to play with (Haswell is very small, in a relative sense), and that next year they'll double the density? 10nm will bring 4 times greater density by 2017 or 2018, and 7nm will bring 8x, 5nm 16x. By 2020, you could shrink Haswell GT2 to less than 25mm2. Sandy Bridge class computing performance will be so cheap, it will effectively be free.

I guess my question is, in the face of the direction the industry is headed, is that kind of low latency connection relevant?
 
Last edited by a moderator:
Uhm, ok, so a CPU releasing - hopefully - in what, 2015 or something, will do two TF. We have GPUs right NOW at four and a half. What do you think a GPU will do in '15; six, eight? More?
The part you're missing is that only a tiny minority buys such monster GPUs, while octa-core will enter mainstream in 2015 (at 14 nm it would measure only about 100 mm²). To put this into context, 2 TFLOPS would be more than the PlayStation 4's entire GPU. From a CPU!

Discrete GPUs are becoming a niche product, and they're focusing on graphics again instead of GPGPU. So application developers have no interest in targeting that. Intel has increased FLOPS more than fourfold between Westmere and Haswell, and could do it again with Skylake. That's the kind of ROI developers are looking for. Not just because it could speed up high throughput workloads 16-fold just five years later, but mainly because it takes relatively little effort to achieve that. You can have any programming language vectorized, without worrying about new threads to synchronize with or shifts in data locality. Your code loops just execute faster, period.

Trying to target NVIDIA, AMD and Intel GPUs instead is a nightmare and you're only getting a fraction of the peak performance due to the heterogeneous synchronization and data transfer overhead, the GPU's poor handling of complex workloads, and the many pitfalls of dealing with various bottlenecks among various architectures.
Regardless it'll be enough to squish any CPU available on the market right then, which by in of itself makes it worth targetting by developers. I know you're a big software/microprocessor fan, but discrete graphics processors aren't going to go away just because 8-core chips become mainstream.
I'm not saying they're going away altogether any time soon. I'm saying GPGPU is going the way of the dodo in the consumer market. Anything that involves sending data back and forth between the CPU and discrete GPU, is doomed to fail due to the bandwidth wall. Integrated GPUs suffer less from that (but don't eliminate it), but they're far weaker to begin with and they still suffer from all the cumbersome heterogeneous programming issues.

Homogeneous computing has a much brighter future, with CPUs getting better at high throughput every generation, while offering a straightforward and well-behaved programming model.
Also, will CPU memory interfaces be able to keep up with 8-core, 512-bit SIMD units...? And in any case, high-end GPUs already have multiples more bandwidth than any CPU releasing in a few years' time.
You're looking at it all wrong. GPUs are wasteful with bandwidth and therefore they need a lot. This is not a good thing! Because they run thousands of threads, lots of on-die storage is required just to hold thread contexts. All these threads have to share tiny caches, so the miss rate is high and they have to reach out for data further away from the execution units. This costs a lot of power.

This problem is only getting worse. It's not the ALUs that consume the most power, it's getting data into them. And power is the limiting factor to scaling performance. CPUs are not nearly as much affected by this. Haswell quadrupled FLOPS per cycle over Westmere, while increasing clock frequency and actually reducing power consumption. So we're seeing a stark convergence in FLOPS/Watt.

And again, the discrete GPU market is in decline. Integrated GPUs have to share the same bandwidth as the CPU. DDR4 and on-package or on-die DRAM can increase that bandwidth, but incur a cost that is delaying them from widespread use. And because GPUs need higher bandwidth per FLOP than the CPU cores, things are evolving in favor of the CPU cores.
 
I don't think GPGPU is going the way of the dodo. To me, it seems like it's evolving. If anything, GPGPU has had a pretty big upsurge in adoption since 2010. I think the eventual idea is that you won't have to code specifically for the GPU -- the compiler will take care of it.
 
I don't think GPGPU is going the way of the dodo. To me, it seems like it's evolving. If anything, GPGPU has had a pretty big upsurge in adoption since 2010. I think the eventual idea is that you won't have to code specifically for the GPU -- the compiler will take care of it.
Good luck with that. They said the same thing about multi-core. Heterogeneous computing is a lot worse for the compiler to handle.
 
They said the same about the vector extensions. :LOL:

Are you suggesting compilers haven't perfected auto-vectorization yet!?! :D

All kidding aside, this thread is about interconnects folks, not instruction sets or GPGPU architectures!
 
Back
Top