22 nm Larrabee

What happens when it causes 8 TLB misses, costing 10^3-10^4 cycles for one instruction?
Exactly the same thing as 8 TLB misses when you emulate gather with extract / insert. Nothing to worry about.
And what happens to coherency protocol while a core is stalled on a single memory uop?
Gather is a set of load operations. They can execute in any order and some may execute simultaneously. There is no change to the coherency protocol since you're only queuing up one cache line access per cycle per port.
 
a) More RAM.

b) Closer to IO

c) Ability to gang together multiple chips in a cache coherent manner.

d) Close integration with CPU.

I'm not sure any of those compensate for an order of magnitude in bandwidth for the cases where GPU actually have an advantage.
 
Exactly the same thing as 8 TLB misses when you emulate gather with extract / insert. Nothing to worry about.
Hardly the same thing. Emulated gather is multiple instructions, and hence can be interrupted. A single x86 instruction is un interruptible.

Gather is a set of load operations. They can execute in any order and some may execute simultaneously. There is no change to the coherency protocol since you're only queuing up one cache line access per cycle per port.
What about snoops from other cores during a stall? How should coherency snoops during a single very long uop be dealt with?
 
Last edited by a moderator:
First there is so much that a compiler cannot know that makes it a basic impossibility to due what you are suggesting.
The extra knowledge OoO has, afaics, is whether or not a load/store will hit L1/L2 or not. Since the L1 is so tiny for GPUs, assuming that memory ops will take a long time seems fine.

Is there anything else?
 
I'm not sure any of those compensate for an order of magnitude in bandwidth for the cases where GPU actually have an advantage.

Current CPUs can do ~8 flops/byte (loaded or stored). Any application with a higher arithmetic intensity should benefit. I think SW rendering would be one of them. :) Less power and less cost are always welcome, of course.

But yes, APUs will need some kind of on package DRAM to truly shine.
 
The extra knowledge OoO has, afaics, is whether or not a load/store will hit L1/L2 or not. Since the L1 is so tiny for GPUs, assuming that memory ops will take a long time seems fine.

Is there anything else?

yes. address disambiguation is one of a major ones. There are plenty of others as well. Decades of research into this area has been done, and there are a lot of thing that a compiler just cannot schedule for reliably.
 
c) If emulating them is not a problem, then why is Swiftshader stuck at dx9? Why can't/doesn't it do dx11?

DX11 software emulation can be done as does WARP, though it is only DX11 level 10.1.
Microsoft even has a use for it:
  • Casual Games
  • Existing Non-Gaming Applications
  • Advanced Rendering Games
  • Other Applications
 
yes. address disambiguation is one of a major ones. There are plenty of others as well. Decades of research into this area has been done, and there are a lot of thing that a compiler just cannot schedule for reliably.

Address disambiguation is only for OoO memory pipeline. GCN shows us that an in order memory pipe can allow getting rid of the scoreboard, so it is not necessarily a net loss.

I guess, I should say that in the context of GPUs, static scheduling with 6-8 threads/core can approach the effectiveness of CPU style OoO, in terms of aggregate throughput.
 
Address disambiguation is only for OoO memory pipeline. GCN shows us that an in order memory pipe can allow getting rid of the scoreboard, so it is not necessarily a net loss.

actually, memory disambiguation allies both to in order and out of order memory pipelines.


I guess, I should say that in the context of GPUs, static scheduling with 6-8 threads/core can approach the effectiveness of CPU style OoO, in terms of aggregate throughput.

AKA you can probably run enough things slow enough that it doesn't matter. Which is fine as long as you have enough things.
 
Since this instruction can throw exceptions half way, I would expect it to crack to multiple uops, if not 3 uops per element. Cracking to a single uop is highly unlikely, at least in the first iteration.
It just has to update the mask operand. It can restart the same instruction (with the new mask register). No need for more than one uop.

I find it quite ironic that you keep looking for excuses why it might be slow. You think they're intentionally going to make it slow after recognizing in their research papers that gather support was sorely lacking?
 
It just has to update the mask operand. It can restart the same instruction (with the new mask register). No need for more than one uop.

I find it quite ironic that you keep looking for excuses why it might be slow. You think they're intentionally going to make it slow after recognizing in their research papers that gather support was sorely lacking?

If Intel can implement a single uop gather, then great. It's just that there are real issues that need to be tackled. Making the execution of uops interruptible seems to add some complexity that is, afaik, not present at the moment.
 
Back
Top