For the sake of GPGPU, is it time for an AGP-style interface again?

Using an interface that keeps the chip from functioning as a peer in the system's memory space is a barrier to usability.

This can be due to latency and bandwidth constraints related to the physical expansion bus.
There's also the higher-level abstraction of the expansion bus interface and the OS and driver layers you have to drill through.

Something more desirable would be a communication method that operates below as many layers as possible and is capable of operating as autonomously as possible, which goes back to something like the cache-coherent interconnects already used by CPUs.

This is particularly true if on-die, since you can discard all the design choices that trade off performance and latency for a plastic slot that needs to cater to everything that can try (sometimes poorly) to plug into a controller. On-die, there's also no need to cater to every off-spec implementation, either.

For short-distance and low-latency work, physical integration is a stronger bet. Optical interconnects seem to be the next step for traversing distances larger than the silicon package. The bandwidth numbers can be much higher, although latencies will be longer than on-die. That might be a cost range above consumer tech for some time, as that seems to be finding its first use in larger scale server/HPC uses with storage, throughput, or system needs that physically cannot be satisfied by a consumer box.
 
Using an interface that keeps the chip from functioning as a peer in the system's memory space is a barrier to usability.

This can be due to latency and bandwidth constraints related to the physical expansion bus.
There's also the higher-level abstraction of the expansion bus interface and the OS and driver layers you have to drill through.

Something more desirable would be a communication method that operates below as many layers as possible and is capable of operating as autonomously as possible, which goes back to something like the cache-coherent interconnects already used by CPUs.

This is particularly true if on-die, since you can discard all the design choices that trade off performance and latency for a plastic slot that needs to cater to everything that can try (sometimes poorly) to plug into a controller. On-die, there's also no need to cater to every off-spec implementation, either.

For short-distance and low-latency work, physical integration is a stronger bet. Optical interconnects seem to be the next step for traversing distances larger than the silicon package. The bandwidth numbers can be much higher, although latencies will be longer than on-die. That might be a cost range above consumer tech for some time, as that seems to be finding its first use in larger scale server/HPC uses with storage, throughput, or system needs that physically cannot be satisfied by a consumer box.
It makes me think about what we were discussing a while ago about dual APU set-up (/weird set-up for a next-gen console).
Actually be it through Hyper Transport or QPI, could it prove more convenient that PCI-express or a hypothetical new type of interface?

By extension could multi-socket mobo see a revival (assuming bandwidth constrains strangling most APU are alleviate one way or another)?
 
Last edited by a moderator:
The coherent interconnects are lower overhead, but a purely on-die bus or shared memory hierarchy can go even lower.
Data still needs to be put into packets and sent through the external controllers in a manner consistent with protocol, and then there's the physical distances and the limits of the traces. On-package or interposer could simplify things or allow them to be driven faster, although there's still the conversion at each end of the process.

Transactions that rely on an internal bus or caches can be used for communication that is about as low-level as can be done by the system, and can leverage the pipelines already in place at whatever speed those units have.
 
The coherent interconnects are lower overhead, but a purely on-die bus or shared memory hierarchy can go even lower.
Data still needs to be put into packets and sent through the external controllers in a manner consistent with protocol, and then there's the physical distances and the limits of the traces. On-package or interposer could simplify things or allow them to be driven faster, although there's still the conversion at each end of the process.

Transactions that rely on an internal bus or caches can be used for communication that is about as low-level as can be done by the system, and can leverage the pipelines already in place at whatever speed those units have.
I read that a couples of times, it is a finely crafted piece of English :)
So pretty much go as big as you can before considering any of those options, right?
 
For the consumer level, I think this is the direction it's going to take.

Increasing integration is going to get most of the benefits for consumers within the the price range and form factors they want.
That's the case for tablets on down, and laptops are getting close to being satisfied outside of the gaming barely-a-laptop niche.
Integration is leaving less and less of the mid and top end systems left for discrete, so there's less motivation to do something special for the remainder.

The remaining enthusiasts or workstation users are increasingly expected to pay for the priviledge. Even if a GPU used HT or QPI, it would procede to pay the multisocket tax that lower-end servers are avoiding by just buying higher core count server chips.


I'm wondering if system design will revisit the BTX era of a high TDP socket area. It fell out of favor before insanely hot GPUs were the norm, and integration is leading to higher-power combined packages or SoCs.
 
Skylark or whatever the codename is for intel's generation after broadwell is reported to feature PCIe version 4 which alledgedly doubles transmission rates to 16gbit/lane. ...So the GPU era is obviously not behind us just yet, and won't be for a number of years in the future either.
 
Well I don't think anyone was saying dedicated GPUs would be completely gone by 2016... regardless, PCI-E is being increasingly used for storage, among other things.
 
The question about creating a new expansion bus is more a question of what the market will do for the sake of two pieces of silicon actively sharing compute.
GPUs favor a particularly bandwidth heavy and high-throughput workload that skews things a bit, but the GPGPU or just plain GPCPU nature of the devices is a question of lesser importance. Both types can be made to drive a board interface to its limit.

The answer in the client space is very much the question of what special things vendors will do for the solutions that already have an interconnect--multisocket CPU, and the answer is nothing.

Putting a CPU or two on the GPU sounds like a decent intermediate step that allows workloads to partition a little more favorably over the all-or-nothing approach taken today, and it doesn't rely on the outside world to do anything, which is what inertia is favoring.

AMD was once interested in making a coherent PCIe standard, but nothing has been stated recently. That would be an evolutionary step, if physical integration doesn't overtake it. An optical standard could take that and increase the distances and bandwidths permissible for some serious compute. It's something Intel has been demonstrating for server systems.
 
SPMD vectorization on the CPU is straightforward, just like on the GPU.

Don't try to turn the attention away from the problem of compiling for a heterogeneous architecture, with compiling for a homogeneous architecture. That's downright pitiful.
And what exactly adds so much to the complexity when running SPMD style vectorized code on a heterogenous architecture that it gets "downright pitiful"?
Btw., I think you misunderstood my statement about the difficulties of vectorizing compilers.
 
And what exactly adds so much to the complexity when running SPMD style vectorized code on a heterogenous architecture that it gets "downright pitiful"?
It's not about running it. It's about compiling it without "having to code specifically for the GPU". Compiling for a homogeneous architecture will always be much simpler in every regard. So referring to any vectorization issues on the CPU in response to my criticism about heterogeneous architectures, is like pointing at a speckle on someone's face when you're covered in dirt.
Btw., I think you misunderstood my statement about the difficulties of vectorizing compilers.
Then please make me understand it correctly.
 
It's not about running it. It's about compiling it without "having to code specifically for the GPU". Compiling for a homogeneous architecture will always be much simpler in every regard. So referring to any vectorization issues on the CPU in response to my criticism about heterogeneous architectures, is like pointing at a speckle on someone's face when you're covered in dirt.
Please consider your own post I answered to (emphasis is yours, not mine):
SPMD vectorization on the CPU is straightforward, just like on the GPU.

Don't try to turn the attention away from the problem of compiling for a heterogeneous architecture, with compiling for a homogeneous architecture. That's downright pitiful.
It was you who said that the vectorization issue is basically solved with the same ease (or not, depending on the problem), regardless if it is a wide homogeneous or a wide heterogeneous architecture. I just fail to see where the pitiful complications come in with SPMD style vectorization. On Kaveri you can also use CPU pointers on the GPU and vice versa, if you think of that.
Then please make me understand it correctly.
You were claiming something is hard to solve for a compiler. I pointed out, that a few years ago the same was claimed for vectorization. After that you claimed it is relatively easy, especially for stuff with sufficient data parallelism. I completely agree with you here. ;)

PS:
Somehow I feel the thread may have the wrong title for this topic.
 
Last edited by a moderator:
It was you who said that the vectorization issue is basically solved with the same ease (or not, depending on the problem), regardless if it is a wide homogeneous or a wide heterogeneous architecture. I just fail to see where the pitiful complications come in with SPMD style vectorization.
The SPMD vectorization, in isolation, is straightforward on either architecture. But that's just one small part of compiling for a heterogeneous architecture without "having to code specifically for the GPU". The compiler has to determine which parts of the code are best suited for executing on the GPU, it has to manage thread creation, synchronize tasks, balance workloads, rearrange data, etc. Basically the compiler is asked to make a heterogeneous architecture behave as if it's homogeneous. So obviously compiling for an actual homogeneous architecture with wide vectors is way easier. Hence, any reference to vectorization issues for a homogeneous architecture, is pitiful.

I don't know how many more ways I can phrase this. :)
On Kaveri you can also use CPU pointers on the GPU and vice versa, if you think of that.
No, I'm not, but while we're on the subject I think it comes with an important caveat. It makes the data movements implicit, which is convenient, but that doesn't mean developers (or compilers for that matter) no longer have to think about it. Being unaware of it also makes you unaware of the overhead. The same is true for sharing pointers between homogeneous cores, but, scalar and vector code can run on the same core.
You were claiming something is hard to solve for a compiler. I pointed out, that a few years ago the same was claimed for vectorization. After that you claimed it is relatively easy, especially for stuff with sufficient data parallelism. I completely agree with you here. ;)
Just to be clear here, you agree multi-core is hard for the compiler?
You agree that (SPMD) vectorization is a solved problem?
Then why did you bring up vector extensions as if it's similar to multi-core?
Do you think multi-core compilation will soon be a solved problem?
 
The SPMD vectorization, in isolation, is straightforward on either architecture. But that's just one small part of compiling for a heterogeneous architecture without "having to code specifically for the GPU". The compiler has to determine which parts of the code are best suited for executing on the GPU, it has to manage thread creation, synchronize tasks, balance workloads, rearrange data, etc. Basically the compiler is asked to make a heterogeneous architecture behave as if it's homogeneous.So obviously compiling for an actual homogeneous architecture with wide vectors is way easier.
No. When using a SPMD scheme, the developer has done this already. He explicitly expressed the data parallelism for certain parts of the code (which can be identified with that it can be offloaded to some throughput cores). That's my point. You linked to intels SPMD compiler. Look at what you have to do that something gets executed in SPMD fashion. It's of course really close in concept to the usual GPGPU stuff (as it is also SPMD). That means compiling it for some throughput cores should be a cakewalk. In fact, for your linked example the SPMD code parts are also compiled with a different compiler (the SPMD one) and the main programm calls the SPMD subroutines (which can also run asynchronously with some additional effort, as distributing it over multiple cores is not done automatically).
As I said, I fail to see where the additional complications come in, especially if we look at the case of a common (and coherent) address space (Kaveri).
No, I'm not, but while we're on the subject I think it comes with an important caveat. It makes the data movements implicit, which is convenient, but that doesn't mean developers (or compilers for that matter) no longer have to think about it. Being unaware of it also makes you unaware of the overhead.
That's why it was good that you brought up intel's SPMD compiler. Here, the developer automatically thinks about such stuff, even if he is not aware of it. ;)
Just to be clear here, you agree multi-core is hard for the compiler?
You agree that (SPMD) vectorization is a solved problem?
Then why did you bring up vector extensions as if it's similar to multi-core?
Do you think multi-core compilation will soon be a solved problem?
The vectorization comment was an ambivalent statement, if that was not clear so far. Vectorization is of course not solved in the general case. For some problems like ones you have formulated in SPMD fashion it gets trivial, for others not so much.

And to close the circle to the first part of this post (and giving a hint for decyphering the middle part): An interesting side effect of SPMD is that data accesses tend to be somewhat separated between the serial and the SPMD parts so that the potential problems you mentioned above largely disappear. You really have to look for corner cases.
 
Last edited by a moderator:
I'm wondering if system design will revisit the BTX era of a high TDP socket area. It fell out of favor before insanely hot GPUs were the norm, and integration is leading to higher-power combined packages or SoCs.

High TDP socket is a current practice, with the two-socket 150W workstation Xeon.
The PC market just adapted to high TDP, in the Pentium 4 / Pentium D bread oven era, by simply using a big heatsink, 80 to 120mm rear fan, PSU with 120mm fan on most all computers rather than switching to BTX.
And you can have further cooling options.

BTX looked elegant with its nice straightforward air flow.. but maybe it is inflexible (made for single CPU and single GPU?) and it was criticized for the placement of DIMM slots too far away from the CPU, not a problem when the northbridge is a separate chip but painful with integrated memory controller.
 
AMD was once interested in making a coherent PCIe standard, but nothing has been stated recently. That would be an evolutionary step, if physical integration doesn't overtake it. An optical standard could take that and increase the distances and bandwidths permissible for some serious compute. It's something Intel has been demonstrating for server systems.

I came here to mention it. Isn't PCIe coherency pretty much what AMD is betting their ass on, in a way? I've assumed it's mandatory and is what allows HSA to work between a CPU/APU and dedicated GPU.

I've assumed it's the motive for socket FM2+ motherboards as well, and that Kaveri will support it as well as "Radeon 8970" or whatever the compatible big card is called.

On the Intel side, a socket based 14nm Xeon Phi was announced already, so we can say they intend to use QPI (maybe using the Haswell-EX socket, and you can have systems where you can mix and match Haswell-EX and Xeon Phi at whim)
 
No. When using a SPMD scheme, the developer has done this already.
I wasn't implying that. The context was always "not having to code specifically for the GPU". My only point was that the CPU's vector extensions are at least as easy to use as the GPU, once the data parallel portions of the code have been identified and vectorization of them is deemed beneficial. That's still the hard part. But it's much harder with GPGPU. There's a lot more overhead to take into account, and it's far less predictable, especially when there are lots of GPU variations.
 
I wasn't implying that. The context was always "not having to code specifically for the GPU". My only point was that the CPU's vector extensions are at least as easy to use as the GPU, once the data parallel portions of the code have been identified and vectorization of them is deemed beneficial. That's still the hard part. But it's much harder with GPGPU. There's a lot more overhead to take into account, and it's far less predictable, especially when there are lots of GPU variations.
What is the additional overhead for APUs? There is no real need for more than a function call through the driver (which is handled by the compiler/some library) and if you want it portable some JIT compiling of the kernel (can also be handled by the compiler). In the actual code it doesn't have to look much different from the case when using intels SPMD compiler.
A varying amount of available cores one has to take into consideration in either case when weighing if it is worth or not. One could actually argue, that the (automatic) distribution over the available resources and to run the tasks asynchronous may be even simpler than for a CPU because the GPU has its own queuing, scheduling and work distribution hardware causing no additional overhead. ;)
 
Back
Top