I would wish Intel would extend those cards in the lower end. some 500bucks 50core card, just for development/research would be nice. Otherwise, sooner or later, there will be more software optimized for GPUs than for XeonPhi and the x86 argument will be void.
A link in the comments goes to a heise online report "Intel's supercomputer accelerators: Xeon Phi with 72 cores, integrated InfiniBand and 3 teraflops DGEMM" (original), which claims (among other specs) that Knights Landing will have up to 72 "advanced" Silvermont cores.New article from David Kanter:
http://www.realworldtech.com/knights-landing-cpu-speculation/
Silvermont changes more than just adding OoOE, which is btw for the integer pipelines only and not the FP/SIMD or memory pipelines. I expect that it's precisely because it's for the ALUs only that the OoOE die impact is so small, but that's no good at all for something like Xeon Phi. You can't use Saltwell vs Silvermont performance as an evaluation of performance of OoO vs SMT.
Saltwell's SMT is pretty different from what's in Xeon Phi anyway, not to mention that the typical workloads are very different. If you're in a situation where you can require that there are at least two threads loaded on the thread in order to get good performance then hiding latency and filling execution unit occupancy with simple SMT is going to be a much better choice than doing so with OoOE. Xeon Phi falls under that category, unless Intel really does want to do a unified core like Nick thinks.
Silvermont is OOOE is for integer and memory, the FPU is in-order.
David
The queue in front of the AGEN unit itself is in-order, which is what I was referring to in my post.
According to the optimization manual:
"Out-of-order execution for integer instructions and de-coupled ordering between non-integer and memory instructions."
"Memory instructions must generate their addresses (AGEN) in-order and schedule from the scheduling queue in-order but they may complete out-of-order."
Simply saying Silvermont is OOOE for memory is not telling the whole story.
In-order issue isn't really a problem with a retry buffer and a single AGU.
David
Unless the address-generation to use latency is a single cycle (the manual doesn't say what it is but I strongly doubt this) it's going to stall at least occasionally on loads depending on loads, ie any kind of pointer chasing. That's not just going to happen with traversing data structures but also stuff like applying LUTs to loaded values or loading a pointer off the stack. If you have something like this in a tight loop you could miss a lot of opportunity for hiding the latency of the loads. These can be hard to hide much of in software with only 16 registers, and considerably worse if you're stuck with 8 registers like in Android.
The queue in front of the AGEN is fed from the global ROB/instruction queue when operands are ready. Ie. once issued, to the AGEN reservation station, ops runs at full tilt.
You pointer chasing schenario doesn't stall the AGEN unit because the AGEN from the dependent load won't be issued to the AGEN queue until the first load is completed
Cheers
http://tomforsyth1000.github.io/blog.wiki.html#[[Why didn't Larrabee fail?]]
Somega thought about Larabee in a Tom Forsythia blog post
8 texture sampling units took 10% of die space and they would have wanted 16 to be competetive = 20% of die space. I didn't realize modern texture sampling units are so large. I am wondering whether the tiny 4KB L1 texture caches are tied to the sampling unit (they borrowed sampling units from their GPU). IIRC it contains uncompressed texels ready for filtering. Intel GPUs also have L2 texture caches (16 KB / 24 KB) but a general purpose cache would likely replace the bigger L2 texture cache.