If ATI/AMD GPUs are VLIW, what is NVIDIA's GPUs architecture acronym?

To OP: I'd say Nvidia's architecture is a set of wide SIMD cores. You could call it implicitly SIMD and that'd be approximately correct too.

In the GF104 and newer ones, it's actually a multi-issue core where each instruction is a SIMD one. You might call it SuperSIMD to be amusing :p The key to note though, is that each instruction must be SIMD.

In contrast, I'd call ATI's architecture a core that is SIMD or perhaps SVMD - Single VLIW, Multiple Data.

In contrast, I think about CPU cores somewhat differently. They are usually OOOEs with SIMD, but the key distinction is that the SIMD is really optional. You don't have any branch granularity issues, etc. etc.


RE: Shared memory vs. MPI, there are a lot of reasons to use shared memory, e.g. if you don't know or like MPI, or your algorithms aren't amenable. There are a few folks who use giant shared memory systems to write algorithms, which are later deployed via MPI. There are also quite a few that use MPI on shared memory (e.g. an ALtix). Your message passing is really fast if it's through memory : )

DK
 
If you have large resources, which can be shared (read only, but lets say it gets updated regularly, for instance each timestep in a simulation and each work item needs the complete resource), a shared memory approach works *much* better than MPI duplicating those resources for every core and sending them around completely redundant in each timestep. So for those algorithms message passing does not scale much better, it does much worse in fact. Btw., this is also a class of algorithms often suited quite well to GPUs (you can spare most of the communication between your work items).
Depends on the size of the dataset ... with a larger than cache dataset, cache coherency will make a bit of a mess here, causing more traffic and increasing latency. Now of course you can simply copy the dataset to each node to get the same performance as MPI, as long as your system has some kind of mullticast DMA at least.

Cache coherency would be more clearly superior if each node needed an unknown at runtime cache fitting subset of the complete resource ...
 
But aren't NVIDIA GPUs many SIMD within many MIMD? And aren't those just classes of architecture of Flynn's taxonomy?

http://en.wikipedia.org/wiki/Instruction_set#External_links

At the bottom of that page I linked there is a list of architectures:

CISC
EDGE
EPIC
MISC
OISC
RISC
VLIW
NISC
ZISC

For example, in the RISC category fall the Cell B.E. and ARM processors.
In the VLIW category fall ATI/AMD GPUs and the Itanium processors.
So, where does NVIDIA GPUs stand?
Maybe CISC?
NV has never tell us about the inside instruction of its GPU so its hard too guess ,i think.
 
Depends on the size of the dataset ... with a larger than cache dataset, cache coherency will make a bit of a mess here, causing more traffic and increasing latency. Now of course you can simply copy the dataset to each node to get the same performance as MPI, as long as your system has some kind of mullticast DMA at least.

Cache coherency would be more clearly superior if each node needed an unknown at runtime cache fitting subset of the complete resource ...

You realize that even using clusters you have coherency within each node, right?

Also, you need to think about what costs more power:
1. Sending data over ccHT, QPI
2. Sending data over ethernet

One is more efficient over short distances, the other is more efficient for long distances.

DK
 
10GBaseT cards have a little heatsink and fan :), or at least the earlier ones.
but you might use some non ethernet networking when you're down to building a big cluster.

I find the high power use pretty surprising and there's still no cheapo switch either (sth with 16 gigabit and two 10GBaseT ports) so you can just build an off-the-shelf rig with an ssd and a raid 10 of the fastest hard drives, and have an insanely fast network :oops:

this is off-topic but when you run a network of crappy workstations (pentium 4 w/ 512MB) with the /home mounted on the network over 100Mb ethernet, your desktop performance is limited by the stupid slow network.

I hope for big breakthroughs in network bandwith, with silicon-on-laser, or graphene thing if there's some down the road. or what about a terabit link on a dual GPU card :)
 
Isn't SIMD completely orthogonal to the question of RISC/CISC/VLIW instructions? Why are we lumping them together? Both AMD's and nVidia's architectures are SIMD yet their instruction formats are completely different.

I think Dave came closest with SuperSIMD but I don't understand the phrase "each instruction is a SIMD one". Isn't the "I" in SIMD already the instruction? SIMD to me is an execution strategy or processor architecture, not an instruction.
 
I think Dave came closest with SuperSIMD but I don't understand the phrase "each instruction is a SIMD one". Isn't the "I" in SIMD already the instruction? SIMD to me is an execution strategy or processor architecture, not an instruction.

Well, you can see it this way: there are SIMD instructions in x86, such as MMX/SSE/AVX. However, most instructions in x86 are not SIMD instructions.

On the other hand, practically every instruction in a (recent) NVIDIA GPU (or AMD GPU) is a SIMD instruction. There is no scalar instruction. Compare this to a typical vector computer, which generally has both SIMD instructions (for its vector unit) and scalar instructions (for its scalar unit).
 
You realize that even using clusters you have coherency within each node, right?
Yeah, so? Snooping doesn't become messy with only a couple of cores.
Also, you need to think about what costs more power:
1. Sending data over ccHT, QPI
2. Sending data over ethernet
What costs more power ... pushing data to a node using plain HT or a node pulling the same data over ccHT, for a big system with snoop filters (like Larrabee 2 was going to use for instance).

There are scaling costs to snooping based cache coherence ... less relevant when you have big iron with costs of 100s of bucks per core, with huge amount of transistors being thrown at diminishing returns any way ... more so with a CMP with cores which effectively cost a buck or less. For a GPU I'd rather see the transistors going into say 32 or 64 bit ported banked caches ... and the GPU using less convenient but cheaper solutions to communication and coherence than the one size fits all solution of snooping.
 
Last edited by a moderator:
Back
Top