darkblu said:
"overlapping address spaces with potential context violations
If you want you can seperate the address spaces with the MMU, then you have things called processes.
On console platforms, they're not generally used, but all modern OSes on SMP architectures support that abstraction if you want.
everything ever touched by more than one thread should be thread safe or you should be absolutely sure what you're doing
This is true of SPEs too. I'm still not clear how a DMA properly synchronizes SPE execution to the PPE. If the PPE wants a job executed, how does it signal to the SPE it wants the job to be done? If the SPE is already busy, how does the PPE add it to the SPE's job queue? How does it get a signal back from the SPE that the job is complete? How does it know where to collect the results from?
All of these operations require two threads to touch some sort of data structure, hence there has to be some sort of locking protecting those data structures, right?
actually, the part after the AND above is totally superfluous.
Consider the case in which you have 3 threads and 1 CPU.
The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it. Because the low priority thread is never given a timeslice, it will continue to hold the lock, blocking the high priority thread from ever completing.
This is the priority inversion problem.
(Most OSes do some sort of priority inversion detection and/or boost the priority of CPU starved threads to help reduce the impact of priority inversion.)
Now consider the case in which you have 3 threads and 2 CPUs.
The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it onto CPU0. Then it will see that the low priority thread is ready to run, and dispatch it onto CPU1. Because the low priority thread is given a timeslice, it will run, and will eventually release the lock, unblocking the high priority thread.
There is NO priority inversion problem in this case.
Now consider the case in which you have 3 threads and 3 CPUs.
The thread running on CPU0 is blocked waiting on a resource held by the thread on CPU2. There is a thread consuming all cycles on CPU1. The thread on CPU2 will still execute to completion and release the lock, thus allowing the thread on CPU0 acquire the resources and continue.
There is NO priority inversion problem in this case.
If you have exactly 6 threads assigned to 6 hardware contexts which are executing concurrently, you will never get a priority inversion problem -- a thread that holds a resource will continue to execute until it releases the resource, allowing the other thread to acquire the resource. There is no starvation precisely because the OS scheduler is not involved and there is no prioritization of threads going on, they're all running concurrently.
sorry, i missed that - why? you have N threads for the high performance parts (N = num hw threads) and an arbitrary number of other 'non-high performance' threads - and you get software threading, just not among the high-performance threads, supposedly.
It's stupid to blow 1000s of cycles on a software context switch for a high performance thread. For high performance threads you want to dedicate a hardware thread to it. For low performance threads, just pack them all onto one hardware thread to avoid scheduling them with your high performance threads and messing them up.
You can create 5 threads, then use SetThreadAffinityMask() to lock them to the 5 hardware threads that the system provides. From then on, those 5 threads will always execute on those 5 hardware threads and the system will never move them to other cores or cause them to preempt each other in software. Then you create all your low performance threads, lock them to the remaining hardware thread, and those get software threaded automatically by the OS scheduler.
You can vary this ratio or arrangement however you like. You can create 3 threads and lock them to the 3 cores and not use SMT or software scheduled threads at all. 4 threads, two on one, and one each on the remaining. Or 10 threads, 5 on the hardware threads, and the remaining 5 software scheduled on the last hardware thread. You have the flexibility to set it up however you want.
after the slight corretions above i don't see why anynmore.
not only thread (N) can starve (N - Y) on a scheduling basis, but also (N - Y) can be running and still its SMT 'roomate' can trash the former's cache so badly, that you get a brand new form of 'priority inversion' - one where (N + X) cannot run because a thread of arbitrary low priority (even lower than N - Y) to which the former has no contention relations whatsoever is cache-bulling (N + X)'s lock-keeper (N - Y).
Regarding cache line evictions, remember, going out to main memory is going to cost 100s of CPU cycles. So the first cache eviction that thread A hits will cause it to block, and allow B to execute -- remember it will be 100s of CPU cycles before the data comes back from memory and actually kicks out the cache line that B needs -- even if it does, which isn't likely -- because the cache will probably choose a colder cache line to evict than the one that B just spent a hundred CPU cycles working with.
In any case,
cache eviction is a performance issue, not a correctness issue. You don't have to worry about this when you're just trying to get your multithreaded code to work -- it will run just fine, but slowly.
You can always go back later and clean it up and optimize it and rearrange your data structures, add the prefetching, and cache locking and whatever, whereas with stuff like DMAs and LS and asymmetric threads, you have to be upfront and all of it has to be right and working before your program can start doing any useful work.