"UMA vs NUMA" or "The Great Memory Architectu

Gunhead

Regular
I'm trying to make sense of high end memory architectures, because they may reflect on consumer hardware some sunny day. Also because I'm fascinated, without any ulterior motive. Unfortunately, I'm getting only corporate PR for all my googling efforts. Can you good people show me my way and discuss SGI's NUMAflex, ccNUMA, Sun's stuff, IBM's mainfarme approaches, etc, whatnot? -- Because I'm sure users/experts exist and even lurk. TIA!

(Poor start, I admit.)
 
SGIs NUMAflex are based on computing nodes built of 4 CPUs with shared memory accessed through a crossbar memory controller. The memory controller links to other nodes memory controllers through NUMAlink, which is the way for a CPU to access memory on a different computing node.


Compare this to an multi AMD Opteron system, where each Opteron can be seen as a single computing node and the inter-CPU Hypertransport links are taking the place of the NUMAlink.

I do believe we will see some extremely powerfull systems based on the Opteron. 8-way systems is supported without any extra logic, and I guess we will see some chips making it possible to connect 8-way systems to create really large clusters.
 
CoolAsAMoose, thanks! Any idea on delays (OK latencies, forgot the parlance) when accessing a memory space outside the "local" node? SGI's Cray-based stuff is proposed to be the cat's ass, so I'm very interested to hear...
 
For the AMD Opteron:

"Latency difference between local and remote memory in an 8P system is comparable to the difference between a DRAM page hit and a DRAM page conflict".

Got it from page 44 of
this presentation.

I guess latency in many cases will be lower in a 4P system compared to an 8P as remote memory may be one step further away with 8P. Compare the illustrations on page 42 (4P) and page 43 (8P) in the presentation.


I'm not sure about the latency in a NUMAflex system, but I guess it's similar to an AMD Opteron MP system. However, the AMD Opteron (as the upcoming Hammer-based Athlon) should have a big advantage latency-wise when accessing local memory due to the integrated MMU and the direct connection to fast DDR-SDRAM.
 
Im not so sure 8 or less processor Opteron setups will act like a NUMA. Developers expect it to work like a SMP system, if they pay no care its seems easy to run into pathological situations where one memory pool is being hit disproportionally (a reactive system which could migrate pages would not suffer such problems to the same degree, unfortunately AMD provides very little in that regard). My guess is that most systems will run in the interleaved memory mode.
 
The memory allocation problem on a NUMA machine is akin to that of processor affinity. Moving a proces from one MPU to another is also costly on a SMP because of the cached context. NUMA increases this cost (even significantly).

Still, if the operating system is made aware of how the memory is laid out, the memory allocator should be able to act accordingly. The really interesting problem is what to do, when all memory on one MPU is allocated and more is requested: Do we move a process (and free up some memory) or just allocate at the nearest neighbour.

But I guess Mfa is right. With the relative low latency penalty associated with the MPU <-> MPU links, you can get away with treating it as just a big flat symmetric memory (not sure about 8 way)

Cheers
Gubbi
 
Back
Top