SSE4, future processors and GPGPU thoughts

Nick · Jan 6, 2007

levicki said:
I repeated that numerous times -- main use would be for interpolation.

Sorry, but that's still somewhat vague. One-dimensional, two-dimensional, three-dimensional interpolation? Linear, quadratic, polynomial, spline interpolation? Lots of applications actually have adequate solutions for this that doesn't require a gather instruction.

I mean adjacent like "one right next to the other" or sometimes even overlapping (reading the same value into two registers)...

Vector load and shuffle goes a long way for values right next to each other or overlapping.

...and in the worst case different cache line access would be needed which would be prefetched by the hardware prefetcher anyway as soon as the previous one is accessed. In any case such gather would rarely cross cache line much less page boundary. So forget cache and TLB misses.

You're right the correlation between accesses makes it unlikely to find one value in L1, one in L2, one in RAM and another in a swapped out page. But it can occur so we definitely need extra addressing logic to do things in parallel if we want to gain anything. At the very least we need to check for L1 cache misses. Anyway, since fast multi-port caches do exist I believe it's practical. I'm just saying it's not trivial and we can't just "forget" about cache or TLB misses.

Not only that, but in cases where you have two equal pointers you could load value once and copy register contents which would reduce bandwidth use.

Checking whether the pointers are equal is likely no better than just reading the same value twice.

At least my 3D reconstruction code would surely have great use from it.

Could you make a rough estimate of how much faster it would be? Remember that any extra transistor spent on this means one transistor less for another potentially useful feature, that might even be easier to design...

Nick · Jan 6, 2007

Gubbi said:
Also the scheduling apparatus would have to wait for all four loads to complete before doing any work on them since CPU state is maintained at instruction boundaries (ie. you can't have a partially updated register visible to the rest of the CPU). So you also get worse instruction latency out of it (and a lot worse if just one of the four loads miss cache).

That's not really true. The alternative is loading each element separately, so you also can't start using the vector before all loads have completed and they're stored in the register.

levicki · Jan 7, 2007

Nick said:
Sorry, but that's still somewhat vague

2D linear for example.

Nick said:
Lots of applications actually have adequate solutions for this that doesn't require a gather instruction.

Well name a few where you have linearily ordered loads.

Nick said:
Vector load and shuffle goes a long way for values right next to each other or overlapping.

Yes but you can't do that in realtime because you don't know the value positions in advance. Pointers are calculated in realtime and shuffle only accepts hard-coded immediate operand.

Furthermore, if you load two values from one row and the other two from next row as a two vectors you just halved your bandwidth by transfering 32 bytes instead of 16.

Nick said:
Checking whether the pointers are equal is likely no better than just reading the same value twice.

That would be wrong. Even L1 cache has higher latency than register file or internal buffers not to mention needless bus utilitzation.

Nick said:
Could you make a rough estimate of how much faster it would be? Remember that any extra transistor spent on this means one transistor less for another potentially useful feature, that might even be easier to design...

Unfortunately I can't make an estimate right now because I am busy working on some compression code. But take the code example with single loads and shuffles I posted and just compare that block of code with say movups instruction.

Nick · Jan 7, 2007

levicki said:
Well name a few where you have linearily ordered loads.

Pretty much every image processing application (e.g. PhotoShop). Lots of (2D) filtering going on there that requires no gather instruction.

Furthermore, if you load two values from one row and the other two from next row as a two vectors you just halved your bandwidth by transfering 32 bytes instead of 16.

SSE has instructions to load the lower and upper half of a register separately. No bandwith limitation there.

That would be wrong. Even L1 cache has higher latency than register file or internal buffers not to mention needless bus utilitzation.

You already have the latency from loading it once, you might as well do the second load completely in parallel with no extra latency. Yes there's higher bus utilization but there's nothing else you can do instead, so why compare pointers if the bus is available anyway.

But take the code example with single loads and shuffles I posted and just compare that block of code with say movups instruction.

Sure that snippet of code could be over two times faster. But I'm talking about the whole application. If gather operations take 10% of total CPU time, which is already quite significant, then we could hope for maybe 5% total application speedup with a dedicated gather instruction. That's always welcome but we have to question whether 5% for one application is worth it for Intel/AMD to create a whole new architecture.

I'm pulling your chain here, a little.

I truely believe such an instruction would be very useful, but I'm just questioning whether the benefits are worth the costs. If it would force clock frequency to be 5% lower or we all have to pay 5% more then we should really look for other architectural advancements that benefit more applications and cost less.

levicki · Jan 7, 2007

Nick said:
Pretty much every image processing application (e.g. PhotoShop). Lots of (2D) filtering going on there that requires no gather instruction.

Yes, but not all applications are image processing ones. We need CPUs for other things too.

Nick said:
SSE has instructions to load the lower and upper half of a register separately. No bandwith limitation there.

True but you still have to shuffle them. And what if you have:

Code:

|a|b|.|
|c|.|d|

Or even:

Code:

|a|.|.| <- a loaded twice
|c|.|d|

How do you load that? Note that you don't know the positions (pointer addresses) in compile time.

Nick said:
Yes there's higher bus utilization but there's nothing else you can do instead, so why compare pointers if the bus is available anyway.

No, the bus should be busy prefetching next value, instead of wasting cycles that way.

Nick said:
I'm pulling your chain here, a little.

Yeah, I noticed that

No problem, we are all here to learn.

Nick · Jan 7, 2007

levicki said:
Yes, but not all applications are image processing ones. We need CPUs for other things too.

Certainly, but I'm just trying to identify precise uses.

I think it might really help automatic vectorization. Other uses would be raytracing and physics processing.

No, the bus should be busy prefetching next value, instead of wasting cycles that way.

The bus between L1 cache and the register file certainly doesn't prefetch anything. And automatic prefetch typically only loads into L2 cache so the bus to L1 cache is also available. That bus only transfers a whole cache line at a time anyway. So even if you gather the same 32-bit value four times, if it's in L2 cache then first 64 bytes are transferred to L1 cache, then it's read from L1 cache into the register.

So I really think that comparing pointers doesn't gain you anything. The timing on this path is critical enough as it is and you really don't want to add comparators and multiplexers.

levicki · Jan 8, 2007

Nick said:
I think it might really help automatic vectorization. Other uses would be raytracing and physics processing.

I agree completely.

Nick said:
The bus between L1 cache and the register file certainly doesn't prefetch anything.

My bad. Language barrier.

Nick said:
So even if you gather the same 32-bit value four times, if it's in L2 cache then first 64 bytes are transferred to L1 cache, then it's read from L1 cache into the register.

It is read from L1 but L1 does have its latency. Better to copy register IMO instead of reading again unless the pointer compare doesn't have higher latency than L1 cache.

Gubbi · Jan 8, 2007

levicki said:
It is read from L1 but L1 does have its latency. Better to copy register IMO instead of reading again unless the pointer compare doesn't have higher latency than L1 cache.

Hmm, either you:
1. compare the pointers for equality and fires off the L1 request in parallel.
2. or you compare the pointers and if unequal you fire off the L1 request.

Nothing is gained in (1) performance-wise, you have to start 4 loads and only gain little in the event of an early out.

In (2) you have to wait for the comparison of the pointers to start the L1 request. This adds latency in a critical path and will almost certainly be a big net loss.

Can't see this as a win.

Cheers

Nick · Jan 8, 2007

levicki said:
It is read from L1 but L1 does have its latency. Better to copy register IMO instead of reading again unless the pointer compare doesn't have higher latency than L1 cache.

Yes it has its latency, but it's fully pipelined. Say the latency of reading one 32-bit value takes three clock cycles, then it takes only four clock cycles to complete the reading of two values, if we start the second read right after the first one. That's serial reading.

I did some searching and it looks like both Core 2 Duo and Athlon 64 have L1 data caches with two 8 byte read ports. This means that two read operations can be performed in three clock cycles, if we assume the same latency. That's parallel reading.

Keeping those caches with two read ports, they might be able to implement the 4-element gather instruction in just four clock cycles. That's possible by combining both parallel and serial reading.

It would use 32 bytes of bus bandwidth instead of the actually used 16 bytes of data, but that's really no issue. Even if you read just 1 byte it uses one of the two 8 byte busses. So in the worst case some other read operation is delayed by one clock cycle, but that won't happen too often and out-of-order execution likely compensates it. And if the gather instruction is in the critical path then that one clock cycle is really masked by the one clock cycle it takes for the gather instruction to do the second two reads.

Either way the gather instruction is quite powerful if it can be implemented like that. The alternatives (four 4 byte cache ports, comparing pointers) would likely have serious downsides.

The only remaining problem I see is that it would be hard to make the reads atomical. But I'm tempted to say that this is the software's problem. You really shouldn't be reading and writing concurrently in the same memory buffer when using a gather instruction.

levicki · Jan 10, 2007

Perhaps comparing pointers is not such a big win but I am starting to like this discussion. Hopefully someone from Intel reads it too.

Anyway, if it takes as you say 3 clock cycles to bring data from L1 and from optimization manual you can see that it takes only 1 clock cycle to compare registers (CMP, TEST) and you can start one compare each 0.5 cycles then you could perhaps still gain something at least in terms of saved bandwidth.

Nick said:
The only remaining problem I see is that it would be hard to make the reads atomical. But I'm tempted to say that this is the software's problem. You really shouldn't be reading and writing concurrently in the same memory buffer when using a gather instruction.

When I said atomic before I only meant that no other reads come in between. Surely there is no need to write to the same memory since gather is read only operation anyway or it would be called scatter.

3dilettante · Jan 11, 2007

levicki said:
Anyway, if it takes as you say 3 clock cycles to bring data from L1 and from optimization manual you can see that it takes only 1 clock cycle to compare registers (CMP, TEST) and you can start one compare each 0.5 cycles then you could perhaps still gain something at least in terms of saved bandwidth.

Which architecture are you counting on for that compare bandwidth? It's risky to rely on that for an implementation.

It would make things take about the same or even longer in the best case (all same pointer), assuming the .5 cycle compare latency.
4 pointers means 6 possible pairings, so 6 compares or 3 cycles.
If that first load goes through to cache, the op takes 6 cycles for just the load and branches.
The first load is required always, so that's a given.
If the loads are fired of sequentially, the others will take three additional cycles to complete, less if the cache can handle two loads a cycle.

It might save bandwidth, but it won't particularly matter unless the processor has SMT. Unfortunately, this instruction as you've described is atomic, so the safest way to keep it so would be to block the other thread from issuing any instructions, so the bandwidth would be reserved for pretty much the same amount of time.

When I said atomic before I only meant that no other reads come in between. Surely there is no need to write to the same memory since gather is read only operation anyway or it would be called scatter.

Atomic in its strictest sense means nothing interrupts the instruction. If another thread issues a write, it may go to a gather location, which it shouldn't until after the last load is completed. Atomicity means nothing can tell that this instruction was implemented sequentially. To do this, some additional ops might be needed to block other threads.

SSE4, future processors and GPGPU thoughts

Similar threads