If ATI/AMD GPUs are VLIW, what is NVIDIA's GPUs architecture acronym?

Frontino · Jan 28, 2011

mczak said:
GF100 is only 16 SIMD clusters...

Uh... yes. My bad. I double typed "32".

Andrew Lauritzen · Jan 28, 2011

3dilettante said:
If the lanes cannot always behave like threads, it is more appropriate to consider Nvidia's architecture to be SIMD if you aren't in their marketing department.

Yes, some sanity! I consider SPMD (or if you want, "SPMD on SIMD" or something) a good compromise term too.

compres · Jan 28, 2011

Andrew Lauritzen said:
Yes, some sanity! I consider SPMD (or if you want, "SPMD on SIMD" or something) a good compromise term too.

I am not that familiar with CUDA or Open CL, but being SPMD will put them in MPI rerritory, are these video cards flexible enough to have an efficient MPI implementation?

edit: clarity

3dcgi · Jan 29, 2011

I consider AMD and Nvidia to both be many SIMD processors with AMD having the further detail of instructions executing in 4 or 5 wide VLIW.

rpg.314 · Jan 29, 2011

compres said:
I am not that familiar with CUDA or Open CL, but being SPMD will put them in MPI rerritory,

I should think not.

are these video cards flexible enough to have an efficient MPI implementation?

They sit very far from a ethernet/IB chip to have an "efficient" MPI implementation.

compres · Jan 29, 2011

rpg.314 said:
They sit very far from a ethernet/IB chip to have an "efficient" MPI implementation.

They need not communicate through a network. Shared memory or on die buffers are enough.

They would need to be capable of running multiple processes, and the hardware needs to be flexible enough so that the MP interface can be implemented.

The network would only be necessary for aggregating video cards, similar to what is done with SMPs.

rpg.314 · Jan 29, 2011

compres said:
They need not communicate through a network. Shared memory or on die buffers are enough.

Programming shared mem systems using MPI, I dunno who would want that.

nutball · Jan 29, 2011

rpg.314 said:
Programming shared mem systems using MPI, I dunno who would want that.

Erm... someone with a cluster of shared-memory systems?

compres · Jan 29, 2011

rpg.314 said:
Programming shared mem systems using MPI, I dunno who would want that.

Not sure if you are being serious. This is quite common.

rpg.314 · Jan 29, 2011

nutball said:
Erm... someone with a cluster of shared-memory systems?

I was speaking about a single smp system.

rpg.314 · Jan 29, 2011

compres said:
Not sure if you are being serious. This is quite common.

Possibly to leverage mpi skills and for easy expandability in case the code has to be scaled up in future. I would not want to program in MPI if I can help it.

Andrew Lauritzen · Jan 29, 2011

SPMD does not require message passing; the two are typically associated but the term itself only implies that "Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster" (from Wikipedia). That's precisely how the GPU programming models work. There's one program that is fed different input on different processors and can branch independently to execute different code based on that input (at some granularity determined by the implementation).

But yes, this is a distinction made for that GPU programming models. In terms of the hardware, they are pretty much pure SIMD (with a bit of VLIW on ATI).

compres · Jan 29, 2011

rpg.314 said:
Possibly to leverage mpi skills and for easy expandability in case the code has to be scaled up in future. I would not want to program in MPI if I can help it.

Not true.

MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).

I have seen your position against cache coherency in these forums, so I am surprised you are a relatively negative about MPI.

compres · Jan 29, 2011

Andrew Lauritzen said:
SPMD does not require message passing; the two are typically associated but the term itself only implies that "Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster" (from Wikipedia). That's precisely how the GPU programming models work. There's one program that is fed different input on different processors and can branch independently to execute different code based on that input (at some granularity determined by the implementation).

But yes, this is a distinction made for that GPU programming models. In terms of the hardware, they are pretty much pure SIMD (with a bit of VLIW on ATI).

Thanks, have not though much about SMPD outside of MPI or message passing style programming models.

rpg.314 · Jan 30, 2011

compres said:
MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).

I dunno about that. I dislike MPI for it's programming model, the manual munging of bytes from A to B.

I have seen your position against cache coherency in these forums, so I am surprised you are a relatively negative about MPI.

I am not against cache coherency per se. I am merely indifferent to it beyond the point of, say Fermi's coherency.

compres · Jan 30, 2011

rpg.314 said:
I dunno about that. I dislike MPI for it's programming model, the manual munging of bytes from A to B.

Like it or not, it is the best way so far to scale many algorithms that are not EP on super computers.

Gipsel · Jan 31, 2011

compres said:
Not true.

MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).

If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).

compres · Jan 31, 2011

Gipsel said:
If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).

Agreed. I have seen applications where SM scales better. Personally, I have seen more that scale better on MPI, mainly in scientific computing. I can imagine other fields of computer science would see SM scaling better more often than MPI.

The issue is that it is not always possible to have a large scale SMP, since you need dedicated hardware for hierarchical or directory based cache coherency. If you need to scale past 1000s of processes, you will most likely use a solution that fits in the distributed memory approach. So in that sense, it is not shared memory vs dist. memory, it is what hardware you have available, and it turns out distributed memory is far more common on large scale systems.

Gipsel · Feb 1, 2011

compres said:
The issue is that it is not always possible to have a large scale SMP, since you need dedicated hardware for hierarchical or directory based cache coherency. If you need to scale past 1000s of processes, you will most likely use a solution that fits in the distributed memory approach. So in that sense, it is not shared memory vs dist. memory, it is what hardware you have available, and it turns out distributed memory is far more common on large scale systems.

Yes, I know. Shared memory on a cluster with 100,000 nodes doesn't sound too feasible.

But in a more general view one has the problem that there is an inherent limit to the usefulness of massive cluster systems anyway. They only work well for a quite limited set of problems. At a certain size you are basically guaranteed to hit a massive wall with real problems. Only very few selected problems scale to tens of thousands of nodes. A common usage scenario of such clusters is to partition them in a dynamical way so dozens or hundreds of users are using them in parallel, each one using only a small part of the full size (often something like 256 cores or even less). And for such sizes, shared memory isn't out of reach.

compres · Feb 2, 2011

Gipsel said:
Yes, I know. Shared memory on a cluster with 100,000 nodes doesn't sound too feasible.

But in a more general view one has the problem that there is an inherent limit to the usefulness of massive cluster systems anyway. They only work well for a quite limited set of problems. At a certain size you are basically guaranteed to hit a massive wall with real problems. Only very few selected problems scale to tens of thousands of nodes. A common usage scenario of such clusters is to partition them in a dynamical way so dozens or hundreds of users are using them in parallel, each one using only a small part of the full size (often something like 256 cores or even less). And for such sizes, shared memory isn't out of reach.

I have worked with 512 core SMPs. My original point is, that using MPI is not just because you are reusing trained personal in MPI, or that you are too lazy to learn other programming models, but that even in SMPs, MPI scales better for many algorithms.

Add to that, the ones I have worked with personally, mostly PDE solvers, scale better on MPI that on Open MP, even for small number of processes. Agreed it is problem dependent, typically dependent on memory access patterns.

If ATI/AMD GPUs are VLIW, what is NVIDIA's GPUs architecture acronym?

Frontino

Andrew Lauritzen

Moderator

compres

3dcgi

rpg.314

compres

rpg.314

nutball

compres

rpg.314

rpg.314

Andrew Lauritzen

Moderator

compres

compres

rpg.314

compres

Gipsel

compres

Gipsel

compres

Similar threads