New ITAGAKI interview touches on 360 and PS3 comparison

darkblu · Sep 26, 2005

aaaaa00 said:
If you want you can seperate the address spaces with the MMU, then you have things called processes. On console platforms, they're not generally used, but all modern OSes on SMP architectures support that abstraction if you want.

setup separate address spaced and then use IPC? sorry, but you just blew a chunk of the potential advantage smp had over the cellular design - the oh-so-precious total memory coherency. congratulations.

This is true of SPEs too. I'm still not clear how a DMA properly synchronizes SPE execution to the PPE. If the PPE wants a job executed, how does it signal to the SPE it wants the job to be done? If the SPE is already busy, how does the PPE add it to the SPE's job queue? How does it get a signal back from the SPE that the job is complete? How does it know where to collect the results from?

i'm not aware ot actual implementation ether but i think we can safely assume hardware-run queues and interrupt notifications, can't we?

All of these operations require two threads to touch some sort of data structure, hence there has to be some sort of locking protecting those data structures, right?

yes, with locking most likely between _two_ threads running on the PPE. as opposed to many more on the xecpu.

Consider the case in which you have 3 threads and 1 CPU.

The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it. Because the low priority thread is never given a timeslice, it will continue to hold the lock, blocking the high priority thread from ever completing.

This is the priority inversion problem.

thanks for the lecturing. now re-consider again what i wrote to you in my previous post and try to comprehend it (because if you had done this the first time you wouldn't have written the above).

Now consider the case in which you have 3 threads and 2 CPUs.

The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it onto CPU0. Then it will see that the low priority thread is ready to run, and dispatch it onto CPU1.

says who? who said the lower priority thread was cpu-agnostic? you assume too much. how about cpu affinities? or you conveniently exclude them out of the scheme?

Because the low priority thread is given a timeslice, it will run, and will eventually release the lock, unblocking the high priority thread.

no. the only condition stated in the problem was that the middle priority thread competes for cpu with the lower priority one, you cannot assume there's conveniently a spare processor where you can run the latter one. pleace try to look more seriously at this problem.

If you have exactly 6 threads assigned to 6 hardware contexts which are executing concurrently, you will never get a priority inversion problem -- a thread that holds a resource will continue to execute until it releases the resource, allowing the other thread to acquire the resource. There is no starvation precisely because the OS scheduler is not involved and there is no prioritization of threads going on, they're all running concurrently.

yes, in the best possible theoretical setup. in practice, though, those 6 threads will have _some_ contention somewhere among themselves, thus some of them will be blocked now and then, thus you may want to get _some_ lower-priority work done on those hw context during that time (only if you care about good cpu utilization, of course), thus you will get the os scheduler involved, thus we get back to the place where we stated from with this problem.

It's stupid to blow 1000s of cycles on a software context switch for a high performance thread. For high performance threads you want to dedicate a hardware thread to it. For low performance threads, just pack them all onto one hardware thread to avoid scheduling them with your high performance threads and messing them up.

alright. so it turns out all you did not totally neglect the thread affinities. that's actually very good. now you can get back to the priority inversion problem and re-consider it.

You can create 5 threads, then use SetThreadAffinityMask() to lock them to the 5 hardware threads that the system provides. From then on, those 5 threads will always execute on those 5 hardware threads and the system will never move them to other cores or cause them to preempt each other in software. Then you create all your low performance threads, lock them to the remaining hardware thread, and those get software threaded automatically by the OS scheduler.

well, good luck with parallelizing those 5 threads so that they never randezvouz. chances are they will, in which case you may want to utilize their hw context somehow.

You can vary this ratio or arrangement however you like. You can create 3 threads and lock them to the 3 cores and not use SMT or software scheduled threads at all. 4 threads, two on one, and one each on the remaining. Or 10 threads, 5 on the hardware threads, and the remaining 5 software scheduled on the last hardware thread. You have the flexibility to set it up however you want.

you have the flexibility, yes. our original argument was about 'ease' and 'correctness' and their relation, though. we never questioned the flexibility of smp mulithreading - it's flexible, alright.

Regarding cache line evictions, remember, going out to main memory is going to cost 100s of CPU cycles. So the first cache eviction that thread A hits will cause it to block, and allow B to execute -- remember it will be 100s of CPU cycles before the data comes back from memory and actually kicks out the cache line that B needs -- even if it does, which isn't likely -- because the cache will probably choose a colder cache line to evict than the one that B just spent a hundred CPU cycles working with.

In any case, cache eviction is a performance issue, not a correctness issue. You don't have to worry about this when you're just trying to get your multithreaded code to work -- it will run just fine, but slowly.

sorry, i though it was you who brought to this argument the many things you have to worry about with smp 'deviations', say, accessing memory locales under numa.

You can always go back later and clean it up and optimize it and rearrange your data structures, add the prefetching, and cache locking and whatever, whereas with stuff like DMAs and LS and asymmetric threads, you have to be upfront and all of it has to be right and working before your program can start doing any useful work.

aha. same with smp. try doing useful work with priority inversions ; )

aaaaa00 · Sep 26, 2005

darkblu said:
setup separate address spaced and then use IPC? sorry, but you just blew a chunk of the potential advantage smp had over the cellular design - the oh-so-precious total memory coherency. congratulations.

I've been arguing about having the flexibility to construct the same abstractions, not about performance. (Besides, you can have threads where a large piece of the address space is shared and there's a private piece of address space for the thread -- most OSes provide the notion of shared chunks of address space between processes.)

i'm not aware ot actual implementation ether but i think we can safely assume hardware-run queues and interrupt notifications, can't we?

If it's implemented in hardware, then it's not as flexible or general. My whole point all along is that SMP is the most general and flexible form of multithreading, which is why it's the dominant form on general purpose CPUs.

You can construct the software equivalent of a "run queue" and "interrupt notification" from the building blocks present on an SMP and it's OS.

The point is, any piece of code that you can come up with on a CELL architecture, I can probably convert to run on an SMP without nearly as much trouble as going in the reverse direction. Perhaps it won't run as efficiently or as fast, but it will run.

says who? who said the lower priority thread was cpu-agnostic? you assume too much. how about cpu affinities? or you conveniently exclude them out of the scheme?

If you explicitly lock the lower priority thread to CPU0, we can ignore CPU1 entirely and the problem reduces to the first example. If you don't, the OS scheduler will decide it's ok to put it on CPU1 at some point. This will release the priority inversion and let the high priority thread continue. Sure, it's not optimal, but it does resolve the priority inversion.

I also pointed out that most OSes have a priority inversion detection or mitigation system built into their scheduler, which will try to look for these and resolve them by boosting the priority of the low thread temporarily.

well, good luck with parallelizing those 5 threads so that they never randezvouz. chances are they will, in which case you may want to utilize their hw context somehow.

CELL doesn't as much solve these problems as much as it just avoids them by enforcing restrictions in hardware in how jobs and SPEs interact.

You can enforce the same restrictions in software with your architecture on SMP. If you feed your threads CELL structured small independent jobs, they won't block on each other as far as locking and scheduling are concerned.

If you setup the system with 5 HW threads, with the sixth running regular software threads, or some such similar arrangement, you can make it look no different than a smaller and slower CELL, with the queues and interrupts implemented in software.

You have the flexibility to do any type of architecture on an SMP, which has been my point from the beginning, because it is the most general form of multithreading.

you have the flexibility, yes. ur original argument was about 'ease' and 'correctness' and their relation, though. we never questioned the flexibility of smp mulithreading - it's flexible, alright.

Thank you.

sorry, i though it was you who brought to this argument the many things you have to worry about with smp 'deviations', say, accessing memory locales under numa.

Well not all of the issues you have to worry about right away when coding.

My two examples were Opteron and Itanium.

On Itanium, weakly ordered memory model is definitely a correctness problem, if you don't nail it right away, your program won't work.

On Opteron, NUMA is a performance problem -- you can still transparently get access to the other node's memory, it's just slower.

Both these designs are SMP with some deviation. Each deviation adds things to think about when writing code on top of the basic multithreading issues. CELL is an extreme deviation from the classical SMP design, so it adds a lot of extra stuff to think about.

New ITAGAKI interview touches on 360 and PS3 comparison

darkblu

aaaaa00

Similar threads