I wonder how important double precision is. Generally Intel seems to be keeping a keen eye on it yet it makes transcendentals hairier, biasing implementation away from the pipelined designs we see in GPUs.
Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.
LOL, with 16 lanes in a SIMD you could do a funky set of parallel terms (one per lane) for a polynomial to produce one transcendental every few clocks...
Might as well link this as I just ran into it:
http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf
Interestingly only atan has degree more than 16 for double-precision: 22
Perhaps a microcode instruction could spit out the necessary operations.
Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.
I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.
The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.
So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.
I'd hope the FMAC is pipelined, but would the permutes?
The latency would be the sum of the FMAC and permute latencies.
The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.
Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.
It does avoid the permute stuff, though.
The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.
Perhaps this is all centred on the gather and scatter units? By their nature they have to do wide operations against memory (cache) so that average bandwidth of operands gathered/scattered is in the ballpark of ALU operand bandwidth.
Whereas G80 uses an operand window (gather window) between register file and SIMD, perhaps placing the operand window between cache and register file is the solution for Larrabee?
Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.
Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.
As such hardware is on a critical signal path, port width and buffering is used carefully.
All SIMD instructions, if they run solely from/to register file, have guaranteed operand bandwidth and no gather/scatter headaches. I'm guessing this is more like how SSE uses its register file (I really don't have a good understanding of SSE implementations
).
Since SSE can have one memory operand, the hardware can draw operands from memory, register file, or the bypass network.
SSE currently has no scatter/gather headaches because it can't do scatter/gather.
Just load multiple values and shift them around to gather, or do the reverse for scatter.
Yeah that's the kind of thing I was thinking. The register file is two-way, 8x 512-bits per hardware context, with the fetches and stores running on the "idle" hardware context. These fetch/store SSE instructions would actually be executed by the gather/scatter units.
I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.
If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.
Double-threading the SIMD is obviously going to complicate things
Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.
Threads will just alternate on the issue port.
but by keeping the count of in-flight registers tiny (unlike a GPU) Intel avoids the register file explosion that we see in R600, where the register file amounts to 1MB effective (and it has at least 3 read ports it seems though I still haven't untangled whether that's 3 physical ports or 3 emulated ports).
The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.
R600's register shenanigans frequently happen in parallel with the activity of other memory clients.
Depending on port count, the same cannot always be said for Larrabee.
If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.
I also forgot in my previous post that writing out a thread implies writing another one in.
As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.
To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.
Larrabee must occupy its vector unit for 17 cycles with other work.
So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.
The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.
It also depends on just where that gather/scatter hardware is, and how it is implemented.
edit:
Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.