Jawed
Legend
Sorry I meant pipelined in the sense of being a single instruction rather than being calculated by a macro.Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.
Blimey! Afterwards I realised that a polynomial's terms are heavily serially dependent just because of the successive powers so it prolly doesn't split across many lanes too well. Also with double-precision computation there's a halving in effective lane count and for single-precision the computation is prolly so quick (very few terms) that it's prolly not worth the effort.Perhaps a microcode instruction could spit out the necessary operations.
Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.
I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.
By permute I presume you mean swizzle, though I suppose what you're getting at is swizzling across the entire 16 lanes, not just within groups of 4 as GPUs do.The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.
So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.
I'd hope the FMAC is pipelined, but would the permutes?
The latency would be the sum of the FMAC and permute latencies.
There might be an opportunity to use the 16 lanes to produce 4 transcendental results in less clocks than if they were produced separately in parallel on the four (x,y,z,w) sets of lanes. Apart from the approximation step, the reduction and reconstruction steps provides further opportunities to enhance utilisation on a wide SIMD. In effect using the 16-lane width of the SIMD to overlap computations for a set of 4 results.
So you're saying generate 16 trascendentals in parallel? I dare say its simplicity is compelling and it could well coincide with the number of objects in a batch. The issue with running so many in parallel is the 16-way duplication of the lookup tables - though I guess they're all quite small.The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.
Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.
It does avoid the permute stuff, though.
The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.
I've not heard of L0 before...Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.
Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.
I presume you're referring to the cache line width of current SSE implementations. We don't know this for Larrabee do we?I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.
I dare say sequential registers would only apply for simpler programs.If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.
I'm thinking that it increases the porting complexity because both the SIMD and the gather/scatter units are fetching and storing concurrently - although in an alternating pattern. I suppose doubling the banking would be the easiest solution.Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.
Threads will just alternate on the issue port.
So we get back to the question of port widths...The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.
R600's register shenanigans frequently happen in parallel with the activity of other memory clients.
Depending on port count, the same cannot always be said for Larrabee.
If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.
With the SIMD being pipelined and with any one instruction only able to consume, at most, 3 operands, thread B instructions can start issuing before thread B's register set has been fully populated. Meanwhile thread A's register set can start being written out before A has finished. OK, I know, it's hairyI also forgot in my previous post that writing out a thread implies writing another one in.
As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.
To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.
Larrabee must occupy its vector unit for 17 cycles with other work.
Thinking that we could be waiting 2 years to find out is, ahem, maybe not so funny.The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.
It also depends on just where that gather/scatter hardware is, and how it is implemented.
This is bloody tantalising:edit:
Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.
http://ieeexplore.ieee.org/Xplore/l...9/04343860.pdf?tp=&isnumber=&arnumber=4343860
But it's hidden and I can't find anything else on the topic
Jawed