GF100 is only 16 SIMD clusters...
Uh... yes. My bad. I double typed "32".
GF100 is only 16 SIMD clusters...
Yes, some sanity! I consider SPMD (or if you want, "SPMD on SIMD" or something) a good compromise term too.If the lanes cannot always behave like threads, it is more appropriate to consider Nvidia's architecture to be SIMD if you aren't in their marketing department.
Yes, some sanity! I consider SPMD (or if you want, "SPMD on SIMD" or something) a good compromise term too.
I should think not.I am not that familiar with CUDA or Open CL, but being SPMD will put them in MPI rerritory,
They sit very far from a ethernet/IB chip to have an "efficient" MPI implementation.are these video cards flexible enough to have an efficient MPI implementation?
They sit very far from a ethernet/IB chip to have an "efficient" MPI implementation.
Programming shared mem systems using MPI, I dunno who would want that.They need not communicate through a network. Shared memory or on die buffers are enough.
Programming shared mem systems using MPI, I dunno who would want that.
Programming shared mem systems using MPI, I dunno who would want that.
Erm... someone with a cluster of shared-memory systems?
Possibly to leverage mpi skills and for easy expandability in case the code has to be scaled up in future. I would not want to program in MPI if I can help it.Not sure if you are being serious. This is quite common.
Possibly to leverage mpi skills and for easy expandability in case the code has to be scaled up in future. I would not want to program in MPI if I can help it.
SPMD does not require message passing; the two are typically associated but the term itself only implies that "Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster" (from Wikipedia). That's precisely how the GPU programming models work. There's one program that is fed different input on different processors and can branch independently to execute different code based on that input (at some granularity determined by the implementation).
But yes, this is a distinction made for that GPU programming models. In terms of the hardware, they are pretty much pure SIMD (with a bit of VLIW on ATI).
I dunno about that. I dislike MPI for it's programming model, the manual munging of bytes from A to B.MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).
I am not against cache coherency per se. I am merely indifferent to it beyond the point of, say Fermi's coherency.I have seen your position against cache coherency in these forums, so I am surprised you are a relatively negative about MPI.
I dunno about that. I dislike MPI for it's programming model, the manual munging of bytes from A to B.
If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).Not true.
MPI has been shown to scale better, compared to threads and Open MP, which depend on cache coherency, on many algorithms (on SMPs).
If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).
Yes, I know. Shared memory on a cluster with 100,000 nodes doesn't sound too feasible.The issue is that it is not always possible to have a large scale SMP, since you need dedicated hardware for hierarchical or directory based cache coherency. If you need to scale past 1000s of processes, you will most likely use a solution that fits in the distributed memory approach. So in that sense, it is not shared memory vs dist. memory, it is what hardware you have available, and it turns out distributed memory is far more common on large scale systems.
Yes, I know. Shared memory on a cluster with 100,000 nodes doesn't sound too feasible.
But in a more general view one has the problem that there is an inherent limit to the usefulness of massive cluster systems anyway. They only work well for a quite limited set of problems. At a certain size you are basically guaranteed to hit a massive wall with real problems. Only very few selected problems scale to tens of thousands of nodes. A common usage scenario of such clusters is to partition them in a dynamical way so dozens or hundreds of users are using them in parallel, each one using only a small part of the full size (often something like 256 cores or even less). And for such sizes, shared memory isn't out of reach.