Interesting you think of vector processor such as the
ones described in this wikipedia entry? Indeed by those standards even GPU are pretty narrow vector machines. For some reasons when I think of modern vector processors I think of something closer in width of SPUs or the SIMD in our nowadays CPUs.
GPUs are rather narrow compared to the long vector machines.
I'm not sure there is a hard line for vector machines, but GPUs have a few of the other facets of full vector machines that x86 does not yet have.
Am I right to think that the wider the vectors the more complicated it becomes for the hardware to support those read/write scatter/gather operations?
It can become more complicated. Part of the complication will stem from how many of these individual accesses will be done simultaneously, and what happens with regards to accesses to the same locations and whether the operation is interruptible or whether the architecture will care.
I'm not sure I get that properly. As I get it's a blend of "once this go against really wide vector machine" and " some instructions/functionalities don't make sense (a bit like SPU can't do every things a PPU does even though you were to "waste" 3 of their "lanes" out of 4).
It can go either way. There are things like synchronization operations that involved performing a read/modify/write operation as a single atomic event.
For a scalar operation, there isn't the possibility that it could be interrupted or contend with itself. Without carefully defining the corner cases, the instruction may not work. Even if it does, it could involve enough overhead to make it prohibitive.
There was a paper on atomic vector operations linked in the forum previously, I'm not sure where at the moment.
I get this part better or so I hope, utilization should not be spoken in insulation without regard to efficiency.
That is what has been argued. Care should be taken in that everyone is using the same definition for utilization and efficiency. Sometimes we can all use the same words but be talking about very different things.
What I had in mind when I posted earlier after reading your talk with Nick, Rpg.314, JohnH and the others is may be what Nick calls for won't happen (I guess something like Larrabee) either for real technical problems or lack of utility of such a shift but could "plain" "narrow" (in regard to historic Vectors processor designs) could fit the bills?
That topic isn't just about vector instructions, but overall design philosophy.
The focus on vectors in my earlier comment is that there are assumptions made in their functionality that do not apply universally, so they are not as generic as scalar ops can be.
Other parts of the argument don't necessarily concern themselves with vector versus scalar. The discussion on texturing includes discussion on the layout of the cache, with regards to how it can hinder full scatter/gather throughput in the form that Larrabee most likely implemented it.
This actually has something to do with trying to shoehorn vector capability onto a memory pipeline that is has as its basis a scalar design. I'll expound on this more in a bit.
I don't know how next iterations of larrabee will look like, but the more I read comments on the matter and on GPU too the more I feel it looks way too much like a "standard" CPU (in regard to cache especially), on the other hand I fell like it also tried to hard to look like a GPU (pretty wide but not but historical standards).
This is actually a criticism leveled at SIMD extensions in general, and x86 in particular (because it tended to be the worst offender).
The short vectors, the inflexible memory operations, the clunky permute capability are "vector" extensions for a design that emphasizes low-latency scalar performance.
Scatter/gather is not simple to perform at speed on the very same memory ports that the scalar ops use, and it is not simple to make it a first-class operation when there are some pretty hefty requirements imposed by the rules of operating on the scalar side (coherence, consistency, atomicity etc.) Very frequently, the vector operations tend to be more lax, but this also means they are not as generic as the scalar side.
Could in its goal to push X86 Intel have passed on a opportunity to create a de facto standard (with the matching licensable ISA) for vector processors?
Intel has had ~15 years to do it, but x86 is not a vector ISA. Its extensions were SIMD on the cheap, and they were roundly criticized for their lack of flexibility and restrictions in their use.
Each iteration has improved certain aspects, as transistor budgets expanded, but there are some strong constraints imposed by the scalar side.
This is also where there is a difference of opinion.
There is the position that there can be a core that can do everything as well as a throughput-oriented design while still being focused on latency-sensitive scalar performance, all while not blowing up the transistor and power budget.
This goes beyond a vector/scalar and concerns the overarching question of generality and specialization.
I've an indistinct feeling that Intel took the problem from the wrong angle, could have they forgotten that sometime more is less? Instead of taking the challenge to compete against high end GPU while using a X86 front end, going for pretty wide (still narrow) SIMD and all the complications that this implies (hardware headache), would they have been better entering the market from below instead of trying to enter the market from above?
The market below is far more power-conscious. It would take even longer to compete there. There are integrated designs that don't have unified shaders nor full programmability because the power usage was not acceptable.
Even for the high end, the chip was massive, and no real numbers showed up to indicate it would be competitive at the time of release.
Part of the problem is that Intel needed a compelling advantage and a unified message. The design was too delayed to be compelling, and Intel's message was never coherent (and there were signs that various divisions were not trying too hard to help it).