22 nm Larrabee

Nick · Jul 7, 2011

rpg.314 said:
What happens when it causes 8 TLB misses, costing 10^3-10^4 cycles for one instruction?

Exactly the same thing as 8 TLB misses when you emulate gather with extract / insert. Nothing to worry about.

And what happens to coherency protocol while a core is stalled on a single memory uop?

Gather is a set of load operations. They can execute in any order and some may execute simultaneously. There is no change to the coherency protocol since you're only queuing up one cache line access per cycle per port.

aaronspink · Jul 7, 2011

rpg.314 said:
a) More RAM.

b) Closer to IO

c) Ability to gang together multiple chips in a cache coherent manner.

d) Close integration with CPU.

I'm not sure any of those compensate for an order of magnitude in bandwidth for the cases where GPU actually have an advantage.

rpg.314 · Jul 7, 2011

Nick said:
Exactly the same thing as 8 TLB misses when you emulate gather with extract / insert. Nothing to worry about.

Hardly the same thing. Emulated gather is multiple instructions, and hence can be interrupted. A single x86 instruction is un interruptible.

Gather is a set of load operations. They can execute in any order and some may execute simultaneously. There is no change to the coherency protocol since you're only queuing up one cache line access per cycle per port.

What about snoops from other cores during a stall? How should coherency snoops during a single very long uop be dealt with?

rpg.314 · Jul 7, 2011

aaronspink said:
First there is so much that a compiler cannot know that makes it a basic impossibility to due what you are suggesting.

The extra knowledge OoO has, afaics, is whether or not a load/store will hit L1/L2 or not. Since the L1 is so tiny for GPUs, assuming that memory ops will take a long time seems fine.

Is there anything else?

Nick · Jul 7, 2011

rpg.314 said:
Hardly the same thing. Emulated gather is multiple instructions, and hence can be interrupted. A single x86 instruction is un interruptible.

RTFM

rpg.314 · Jul 7, 2011

aaronspink said:
I'm not sure any of those compensate for an order of magnitude in bandwidth for the cases where GPU actually have an advantage.

Current CPUs can do ~8 flops/byte (loaded or stored). Any application with a higher arithmetic intensity should benefit. I think SW rendering would be one of them.

Less power and less cost are always welcome, of course.

But yes, APUs will need some kind of on package DRAM to truly shine.

aaronspink · Jul 7, 2011

rpg.314 said:
The extra knowledge OoO has, afaics, is whether or not a load/store will hit L1/L2 or not. Since the L1 is so tiny for GPUs, assuming that memory ops will take a long time seems fine.

Is there anything else?

yes. address disambiguation is one of a major ones. There are plenty of others as well. Decades of research into this area has been done, and there are a lot of thing that a compiler just cannot schedule for reliably.

rpg.314 · Jul 7, 2011

Nick said:
RTFM

My x86 assembly is a little rusty. Would you care to tell me which instructions allow, let's say switching processes, in the middle of their execution?

Voxilla · Jul 7, 2011

rpg.314 said:
c) If emulating them is not a problem, then why is Swiftshader stuck at dx9? Why can't/doesn't it do dx11?

DX11 software emulation can be done as does WARP, though it is only DX11 level 10.1.
Microsoft even has a use for it:

Casual Games
Existing Non-Gaming Applications
Advanced Rendering Games
Other Applications

rpg.314 · Jul 7, 2011

aaronspink said:
yes. address disambiguation is one of a major ones. There are plenty of others as well. Decades of research into this area has been done, and there are a lot of thing that a compiler just cannot schedule for reliably.

Address disambiguation is only for OoO memory pipeline. GCN shows us that an in order memory pipe can allow getting rid of the scoreboard, so it is not necessarily a net loss.

I guess, I should say that in the context of GPUs, static scheduling with 6-8 threads/core can approach the effectiveness of CPU style OoO, in terms of aggregate throughput.

rpg.314 · Jul 7, 2011

DX11 software emulation can be done as does WARP, though it is only DX11 level 10.1.

I meant competitive performance as well.

Voxilla · Jul 7, 2011

rpg.314 said:
I meant competitive performance as well.

It is competitive with Swiftshader, it 's not the reference rasterizer.

Nick · Jul 7, 2011

rpg.314 said:
My x86 assembly is a little rusty. Would you care to tell me which instructions allow, let's say switching processes, in the middle of their execution?

vgather

3dilettante · Jul 7, 2011

Repeated string operations can also be suspended in the middle of execution, which makes me wonder.

aaronspink · Jul 7, 2011

rpg.314 said:
Address disambiguation is only for OoO memory pipeline. GCN shows us that an in order memory pipe can allow getting rid of the scoreboard, so it is not necessarily a net loss.

actually, memory disambiguation allies both to in order and out of order memory pipelines.

I guess, I should say that in the context of GPUs, static scheduling with 6-8 threads/core can approach the effectiveness of CPU style OoO, in terms of aggregate throughput.

AKA you can probably run enough things slow enough that it doesn't matter. Which is fine as long as you have enough things.

rpg.314 · Jul 8, 2011

3dilettante said:
Repeated string operations can also be suspended in the middle of execution, which makes me wonder.

IIRC, string ops crack to multiple uops atleast some of the time.

rpg.314 · Jul 8, 2011

Nick said:
vgather

Since this instruction can throw exceptions half way, I would expect it to crack to multiple uops, if not 3 uops per element. Cracking to a single uop is highly unlikely, at least in the first iteration.

3dilettante · Jul 8, 2011

rpg.314 said:
IIRC, string ops crack to multiple uops atleast some of the time.

This is part of what makes me curious.

Nick · Jul 8, 2011

rpg.314 said:
Since this instruction can throw exceptions half way, I would expect it to crack to multiple uops, if not 3 uops per element. Cracking to a single uop is highly unlikely, at least in the first iteration.

It just has to update the mask operand. It can restart the same instruction (with the new mask register). No need for more than one uop.

I find it quite ironic that you keep looking for excuses why it might be slow. You think they're intentionally going to make it slow after recognizing in their research papers that gather support was sorely lacking?

rpg.314 · Jul 8, 2011

Nick said:
It just has to update the mask operand. It can restart the same instruction (with the new mask register). No need for more than one uop.

I find it quite ironic that you keep looking for excuses why it might be slow. You think they're intentionally going to make it slow after recognizing in their research papers that gather support was sorely lacking?

If Intel can implement a single uop gather, then great. It's just that there are real issues that need to be tackled. Making the execution of uops interruptible seems to add some complexity that is, afaik, not present at the moment.

22 nm Larrabee

Similar threads