If ATI/AMD GPUs are VLIW, what is NVIDIA's GPUs architecture acronym?

If the lanes cannot always behave like threads, it is more appropriate to consider Nvidia's architecture to be SIMD if you aren't in their marketing department.
Yes, some sanity! I consider SPMD (or if you want, "SPMD on SIMD" or something) a good compromise term too.
 
Last edited by a moderator:
Yes, some sanity! I consider SPMD (or if you want, "SPMD on SIMD" or something) a good compromise term too.

I am not that familiar with CUDA or Open CL, but being SPMD will put them in MPI rerritory, are these video cards flexible enough to have an efficient MPI implementation?

edit: clarity
 
I consider AMD and Nvidia to both be many SIMD processors with AMD having the further detail of instructions executing in 4 or 5 wide VLIW.
 
They sit very far from a ethernet/IB chip to have an "efficient" MPI implementation.

They need not communicate through a network. Shared memory or on die buffers are enough.

They would need to be capable of running multiple processes, and the hardware needs to be flexible enough so that the MP interface can be implemented.

The network would only be necessary for aggregating video cards, similar to what is done with SMPs.
 
SPMD does not require message passing; the two are typically associated but the term itself only implies that "Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster" (from Wikipedia). That's precisely how the GPU programming models work. There's one program that is fed different input on different processors and can branch independently to execute different code based on that input (at some granularity determined by the implementation).

But yes, this is a distinction made for that GPU programming models. In terms of the hardware, they are pretty much pure SIMD (with a bit of VLIW on ATI).
 
Possibly to leverage mpi skills and for easy expandability in case the code has to be scaled up in future. I would not want to program in MPI if I can help it.

Not true.

MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).

I have seen your position against cache coherency in these forums, so I am surprised you are a relatively negative about MPI.
 
SPMD does not require message passing; the two are typically associated but the term itself only implies that "Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster" (from Wikipedia). That's precisely how the GPU programming models work. There's one program that is fed different input on different processors and can branch independently to execute different code based on that input (at some granularity determined by the implementation).

But yes, this is a distinction made for that GPU programming models. In terms of the hardware, they are pretty much pure SIMD (with a bit of VLIW on ATI).

Thanks, have not though much about SMPD outside of MPI or message passing style programming models.
 
MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).
I dunno about that. I dislike MPI for it's programming model, the manual munging of bytes from A to B.

I have seen your position against cache coherency in these forums, so I am surprised you are a relatively negative about MPI.
I am not against cache coherency per se. I am merely indifferent to it beyond the point of, say Fermi's coherency.
 
Not true.

MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).
If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).
 
If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).

Agreed. I have seen applications where SM scales better. Personally, I have seen more that scale better on MPI, mainly in scientific computing. I can imagine other fields of computer science would see SM scaling better more often than MPI.

The issue is that it is not always possible to have a large scale SMP, since you need dedicated hardware for hierarchical or directory based cache coherency. If you need to scale past 1000s of processes, you will most likely use a solution that fits in the distributed memory approach. So in that sense, it is not shared memory vs dist. memory, it is what hardware you have available, and it turns out distributed memory is far more common on large scale systems.
 
The issue is that it is not always possible to have a large scale SMP, since you need dedicated hardware for hierarchical or directory based cache coherency. If you need to scale past 1000s of processes, you will most likely use a solution that fits in the distributed memory approach. So in that sense, it is not shared memory vs dist. memory, it is what hardware you have available, and it turns out distributed memory is far more common on large scale systems.
Yes, I know. Shared memory on a cluster with 100,000 nodes doesn't sound too feasible.

But in a more general view one has the problem that there is an inherent limit to the usefulness of massive cluster systems anyway. They only work well for a quite limited set of problems. At a certain size you are basically guaranteed to hit a massive wall with real problems. Only very few selected problems scale to tens of thousands of nodes. A common usage scenario of such clusters is to partition them in a dynamical way so dozens or hundreds of users are using them in parallel, each one using only a small part of the full size (often something like 256 cores or even less). And for such sizes, shared memory isn't out of reach.
 
Yes, I know. Shared memory on a cluster with 100,000 nodes doesn't sound too feasible.

But in a more general view one has the problem that there is an inherent limit to the usefulness of massive cluster systems anyway. They only work well for a quite limited set of problems. At a certain size you are basically guaranteed to hit a massive wall with real problems. Only very few selected problems scale to tens of thousands of nodes. A common usage scenario of such clusters is to partition them in a dynamical way so dozens or hundreds of users are using them in parallel, each one using only a small part of the full size (often something like 256 cores or even less). And for such sizes, shared memory isn't out of reach.

I have worked with 512 core SMPs. My original point is, that using MPI is not just because you are reusing trained personal in MPI, or that you are too lazy to learn other programming models, but that even in SMPs, MPI scales better for many algorithms.

Add to that, the ones I have worked with personally, mostly PDE solvers, scale better on MPI that on Open MP, even for small number of processes. Agreed it is problem dependent, typically dependent on memory access patterns.
 
Back
Top