Guden Oden said:
DeanoC said:
As such if we were to get a situation where each Xenon core is multi-threaded but PUs aren't then even if we get equal theoritical figures then Xenon will have better real world figures.
That's not neccessarily the case, as witnessed with hyperthreaded P4s vs. athlon 64 chips. If we have a case where a multithreaded core doesn't beat a singlethreaded core when both use the same ISA, imagine predicting how the outcome would be when the cores use different ISAs!
Which was partly true with the Northwood core, but less so with Prescott.
The P4 isn't particularly elegant in the way it statically splits resources when Hyper Threading is enabled. It divides it's global scheduling window into two equally sized chunks, most queues are also split in two, like the memory instruction queue, the general instruction queue etc.
Weird thing like write combine buffers etc. are also split.
Prescott increases the size of some of the buffers, but
not the size of some very important resources like the trace cache and the global scheduler.
What this means is that instead of getting better throughput you can end up in situations where each of the two threads ends up stalling, having run out of resources, instead of just stalling one thread, with the other chugging along.
For example by splitting the global scheduler, you have two threads with half the latency tolerance of the single thread you replaced it with. Another problem is that the two Hyper Threads takes turn on fetching three uops from the trace cache on alternating cycles, so even if you have two perfectly mixed thread, one with a high number of memory dependencies, and on which is uop throughput limited, the P4 can't take advantage of that.
Guden Oden said:
Besides... The more threads you have competing over the same cache, the bigger the risk they're going to be pushing each other out of it, and six threads and 1MB cache (of which some could be reserved for vertex buffers for the GPU) could create quite a mess. Even at best it's less cache per thread than what the GC offers now I might add.
Right, this is particularly a problem with the P4's tiny 12K uop trace cache; thrashing is probably not uncommon.
However, IBM's Power 4/5 cpu multithreading implementations seem to be much more robust, as witnessed by their not-so-stellar single thread performance (specInt&FP) but crazy throughput (specRATEs).
First their resource division scheme seems to be much more flexible, and second, Xenon CPUs are looking to have
lots of cache.
Cheers
Gubbi