22 nm Larrabee

My point was that gather/scatter alone isn't going to win the game.

Of course one needs less space in the instruction cache for instance. But that gives you a performance benefit probably just in the single percentage range. And one occupies a lot less execution resources, true. But again, this won't be a game changer, even if it would buy 30% or even 50% performance on average (which it probably won't do). What would help is a cache/memory structure which can handle a lot of simultaneous requests to different addresses, i.e. a cache with let's say 8 read ports or something like that. So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much.

By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...
Well, the magic pixie dust in scatter gather is supposed to make MS word go faster...
 
The relatively small code expansion that resulted from transitioning from x86 to x86-64 was enough to negate the doubling of the register file. The difference either way was a few percent, yes.

The situation with a scatter/gather could involve 8, 16, or 32 instructions being folded into one, so the benefits could be stronger in performance critical code.
The benefits would probably be stronger in an OoO processor than the in-order architecture used in the paper.

One important consideration in an OoO implementation is that the scatter/gather should not be blocking.
The penalty in the case that it remains blocking may be enough to make the OoO engine counterproductive in scatter/gather heavy code. The disambiguation hardware would be negated, and the core could not generate MLP from other instructions in the stream while the scatter/gather blocks issue.


The debate about the nature of the memory subsystem has come up in other threads.
 
That's a 3 year old paper using a sw simulator. Doesn't mean much for/against a real product.
Sure, but the point was that Intel is well aware that "SIMD efficiency is compromised in the presence of irregular data access patterns", and is researching gather/scatter implementations. Given that AVX and FMA make the bottleneck worse, I'd be surprised if they haven't started giving it even more attention. The experience with Larrabee probably also helps bring it to a real product sooner rather than later.
 
So where is the missing performance factor of 4 to 8 to low end GPUs/IGPs supposed to come from? Not from gather/scatter additions ito the ISA in my opinion. It makes things simpler and also a bit faster, but not that much.
Let's take the example of a piecewise parabolic approximation of a function. This essentially requires two FMA operations, and three table lookups. Using SSE, this takes 28 instructions for 4 values. With AVX and support for FMA and gather/scatter, it would take 5 instructions for 8 values. So it could potentially be up to 11 times faster.

Of course it's not all that simple in practice, but I hope this illustrates that the potential goes far beyond a few ten percent. Even a simple serial implementation of gather/scatter would be a "game changer" when combined with AVX and FMA.
By the way, texture units are great things, especially the accompanying specialized caches. Nvidia didn't remove the separate L1 texture caches and did not integrate them into the general purpose L1/local memory. And even Larrabee had texture units/cache. If I would have to guess the reason ...
Like I said before, Larrabee tried to compete against high-end GPUs. Texture units were a necessity, even though they're of little or no use for anything other than graphics. As we all witnessed, the failure as a GPU meant a commercial disaster.

The situation is very different for the CPU. There's lots of available die space to deliver 'adequate' graphics. And people need a CPU anyway, so competitive graphics performance alone does not determine commercial success. There are clearly different trade offs to be made, and personally I think it makes sense to keep it low risk and simply add some form of gather/scatter support.

Adding texture lookup instructions to the ISA seems like a bad idea to me anyway since it's very specialized and graphics is still evolving. At some point I'm expecting GPUs to also feature programmable texture filtering, and things like address calculations and perspective projection have already moved to the shader cores on some architectures. With a software renderer the majority of time is currently spent gathering texels, and besides, even with dedicated texture units there would still be a great need for gather/scatter in the rest of the graphics pipeline, not to mention for other applications.
 
Regarding gather/scatter, this paper, detailing Tarantula, a vector extension to the Alpha architecture, is a very good read.

Highlights:
8192 bit vector registers, holding 128 doubles.
Vbox loads and stores access the L2 directly (bypassing L1)
The L2 is pseudo multi-ported, with 16-ways supporting up to 16 accesses in parallel.
Addresses from gathering loads/scattering stores are grouped to minimized way/bank conflicts.
Aggregate bandwidth of 512 bytes/cycle (256 read, 256 write) for stride-1 accesses.

All this in 2002.

Cheers
 
It was put on paper in 2002, at least. It would have gone on a 65nm process, which was not due for some time.

It would have been interesting to see if Tarantula would have met some of the projections, two process nodes past where EV8 would have been. First Intel at 90nm and AMD at 65nm had significant problems with high-performance chips in the meantime. EV8 was a power monster at 130nm, and it was not confirmed EV8 would hit all its targets within 250W let alone scale perfectly for two problematic nodes.

Looking at the specs, we see numbers for some things like the registers, interconnect, and L2 that are several times more per-core than entire chips get all the way down to 22nm.
 
One important consideration in an OoO implementation is that the scatter/gather should not be blocking.
The penalty in the case that it remains blocking may be enough to make the OoO engine counterproductive in scatter/gather heavy code. The disambiguation hardware would be negated, and the core could not generate MLP from other instructions in the stream while the scatter/gather blocks issue.

In that case, scatter/gather without careful (and possibly expensive) re-layout of data structures would just defeat OoO in applications that were written without the forethought of vectorization.

For LRB1, it wouldn't matter if the scatter/gather was blocking, as it had lots of threads to hide latency with and it was in order anyway.

The cost of non blocking scatter/gather is interesting to look at. What additional complications might arise while implementing it vis-a-vis in order scatter-gather? I guess non blocking gather and blocking scatter would be easier to make.

It is interesting that even ARM's T604 GPU IP doesn't implement scatter/gather as they said that making these units is hard (or something like that).
 
Blocking could negate the benefits of having gather. If no operations can reorder around it, a single gather element that drops to the L2 would force a whole memory pipeline stall for ten or so cycles in the case of a fast L2 like Sandy Bridge. 20 or more for an L2 like that of Bulldozer. Given the importance of the L/S pipes. If an optimized loop shaves off 10 instructions using gather and thus buys 2-5 cycles, the tens of stall cycles would add the penalty right back and then some.
I think it shouldn't be that restrictive for an OoO core.


A speculative memory pipeline capable of reordering loads around stores would need to behave differently with a coalescing scatter/gather.

If N addresses are coalesced into a single cache line scatter, each of the N addresses must be broken out and individually tracked by the alias predictor and the store queue populated on a per-address basis to allow proper serialization.
Coalesced gather needs per-address checks of the store queue and aliasing predictor. Forwarding may be applied here as well.

The worst and best-case scenarios for the memory pipeline and scatter/gather are mostly opposite.
For the SGU, an operation that coalesces into a single cycle and single cache access is best.
This is the worst-case for the speculation and forwarding hardware, which speculates with at most 2 addresses. This can wind up kneecapping scatter/gather at far below its peak, or forcing a multiplication of the memory pipeline's speculation resources far beyond what the scalar case needs.

Perhaps it can be changed so that the pipeline is pessimistic when dealing with coalesced ops. The cache line's base address would be flagged as being for a scatter/gather and anything in that range will register as a conflict regardless of the actual address. This would add an additional small check per prediction and would add some amount of storage per predictor entry. There could be a separate scatter/gather alias predictor, but that's adding a second check in tightly timed memory pipeline.
The most simple alternative besides not speculating is to globally shut down memory speculation when a scatter/gather is in-flight, but that would compromise the benefit of the operation significantly.
 
Last edited by a moderator:
A scatter/gather implemented either with serialization or a banked L2 is far too high latency to make speculative execution of it ever to make sense IMO. Just use vertical multithreading.
 
A speculative memory pipeline capable of reordering loads around stores would need to behave differently with a coalescing scatter/gather.

As you point out gather/scatter doesn't gel well with modern speculative load/store units. The advantage of gather/scatter is that you generate a fuckton of parallel memory requests. If you were to dump these in your normal speculating scalar load/store pipe, the structures used to track requests would be large and therefore slow.

The solution, IMO, is to use a weak memory ordering model for gather/scatter. Alpha used a weak memory ordering model. For x86 I'd mark pages in the page table to use a weak memory ordering model (I'd make 'vector' pages bigger too). That way the vector loads and stores could bypass the normal scalar load/store unit. As long as loads and stores can't alias, gathers and scatters can proceed out of order. If you compiler can't prove that loads and stores won't alias, it would have to insert memory fences (effectively serializing accesses).

Cheers
 
Last edited by a moderator:
As you point out gather/scatter doesn't gel well with modern speculative load/store units. The advantage of gather/scatter is that you generate a fuckton of parallel memory requests. If you were to dump these in your normal speculating scalar load/store pipe, the structures used to track requests would be large and therefore slow.

The solution, IMO, is to use a weak memory ordering model for gather/scatter. Alpha used a weak memory ordering model. For x86 I'd mark pages in the page table to use a weak memory ordering model (I'd make 'vector' pages bigger too). That way the vector loads and stores could bypass the normal scalar load/store unit. As long as loads and stores can't alias, gathers and scatters can proceed out of order. If you compiler can't prove that loads and stores won't alias, it would have to insert memory fences (effectively serializing accesses).

Cheers

With a weak memory model, you can basically forget trivially autovectorizing stuff. :???:
 
With a weak memory model, you can basically forget trivially autovectorizing stuff. :???:

Huh? Do you want a powerful vector ISA with gather/scatter to run C spaghetti ?

Why would you do that?

And 'trivially autovectorizing', isn't that an oxymoron ? :)

Cheers
 
First thanks for the answers that made me a bit less dumb :)

I may dare some other questions tho. Disclaimer those questions assume that it can be relevant to use a CPU core as the basis of a design intended to achieve high throughput and and data parallel workloads (which doesn't look like an idea a lot of people agree with).
Do I get the arguments properly.

Case 1: something CPU based has no chance to be competitive no matter when.

Case 2: given time there's a chance
*case 2.1: it's even possible within a highsingle thread performance CPU:
** From I read it could prove really difficult
** If it happens it would have a significant costs, in perfs per mm² / Watts which has a chance to send us back to case 1.
* case 2.2: simple narrow CPU and multi threading is the way to go.
** Still costly and non trivial
** As CPU "standard" perf would crumble it lowers the interest of using a CPU core in the first place
** Mapping wider vectors to narrower SIMD units could help a lot (hiding latencies for ops as scatter gather).
** Doing scatter/gather from L2 would help not paying an extra price on L1 (which would more complex, slower, more power hungry, etc.). Using the previous point + SMT there is ways to hide the extra latecies.
** Doing texturing work is out of question.

Here it comes, the one thousand billion dollars
Are bankers indeed banksters?
Do I get properly the different opinions as I don't even try to read the various links that are way to advanced for me?

A real question, is texturing really impossible (I mean with acceptable performances). There's nothing that can be add to an already specialized CPU core as a larrabee one that could help?
Could an tiny read only cache akin to those find in GPU help?
Could the scalar pipeline do the calculations?
Or really no matter what you add to the design (tex units aside... :LOL: ) it's a lost cause.
 
Huh? Do you want a powerful vector ISA with gather/scatter to run C spaghetti ?

Why would you do that?

And 'trivially autovectorizing', isn't that an oxymoron ? :)

Cheers

Because, apparently there are are lots of legacy apps out there that could use scatter gather but wouldn't fit with GPU's. :)
 
Three layers

Thanks for a great thread!

My understanding is that at one end you have CPU's with a large amount of logic to extract as much parallel execution as possible out of essentially serial code (memory prefetch, branch prediction, out of order execution, speculative execution and so on) and at the other end GPU's designed to stream data in and out preforming a huge number of similar operations in parallel with very simple SIMD cores.

It always seemed to me that Larrabee is somewhere in between.
Would there be problems for which Larrabee would excel at. Ones that fall between the characteristics of CPU's and GPU's.

Would a chip that included all three layers be the best general solution?
i.e. say 4 modules each with 1 CPU core, 4 Larrabee (each with a 512-bit vector processing unit) like cores and 16 SIMD cores with suitable layers of memory and interconnect at each layer.
 
A real question, is texturing really impossible (I mean with acceptable performances). There's nothing that can be add to an already specialized CPU core as a larrabee one that could help?
Larrabee has dedicated texture samplers. Its troubles to become a viable high-end GPU likely had nothing to do with texturing.

For the CPU, texturing at "acceptable" performance is possible. There's about a 4x performance gap between a quad-core Sandy Bridge CPU, and its IGP, while only making use of 128-bit SSE. Replacing the IGP with more CPU cores, using 256-bit AVX operations, using fused multiply-add (FMA) instructions, and using gather/scatter would massively improve effective performance. And by using 1024-bit registers the power consumption could be reduced to acceptable levels.

Since such a solution lacks dedicated texture samplers and is still based on an out-of-order architecture, it would be somewhat less efficient than Larrabee. However, unlike Larrabee it's not up against dedicated high-end GPUs from the competition, but up against tiny IGPs. A homogeneous 'unified' architecture has the advantage that a lot more chip area is available, and it is valuable for all kinds of markets.
 
Would a chip that included all three layers be the best general solution?
i.e. say 4 modules each with 1 CPU core, 4 Larrabee (each with a 512-bit vector processing unit) like cores and 16 SIMD cores with suitable layers of memory and interconnect at each layer.
No. The problem is that bandwidth is getting harder to come by than computing density. Moving data back and forth between three types of processors creates a lot of overhead. Applications have workloads that become ever more diverse, so in time it's best to have one type of processor which can handle a variety of workloads, with minimal data movement.

Opinions vary, but I believe such an architecture is within reach. The latest CPUs have a pretty good computing density (200 GFLOPS, and FMA could double that), but they lack efficiency at accessing irregularly stored data, and they are not power efficient due to the complex out-of-order execution architecture. The first issue can be fixed with gather/scatter support, while the second issue could be fixed by reducing the instruction rate by a factor four by executing 1024-bit operations in four cycles on 256-bit execution units.

Implementing these things is certainly not without challenges, but in perspective they don't seem unsurmountable. There are plenty of approaches to implement gather/scatter, and they can converge to the most balanced solution in multiple generations.
 
Larrabee has dedicated texture samplers. Its troubles to become a viable high-end GPU likely had nothing to do with texturing.

For the CPU, texturing at "acceptable" performance is possible. There's about a 4x performance gap between a quad-core Sandy Bridge CPU, and its IGP, while only making use of 128-bit SSE. Replacing the IGP with more CPU cores, using 256-bit AVX operations, using fused multiply-add (FMA) instructions, and using gather/scatter would massively improve effective performance. And by using 1024-bit registers the power consumption could be reduced to acceptable levels.

Since such a solution lacks dedicated texture samplers and is still based on an out-of-order architecture, it would be somewhat less efficient than Larrabee. However, unlike Larrabee it's not up against dedicated high-end GPUs from the competition, but up against tiny IGPs. A homogeneous 'unified' architecture has the advantage that a lot more chip area is available, and it is valuable for all kinds of markets.

Sigh....:rolleyes:
 
Back
Top