Larrabee's Rasterisation Focus Confirmed

Jawed · May 7, 2008

3dilettante said:
Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.

Sorry I meant pipelined in the sense of being a single instruction rather than being calculated by a macro.

Perhaps a microcode instruction could spit out the necessary operations.
Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.
I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.

Blimey! Afterwards I realised that a polynomial's terms are heavily serially dependent just because of the successive powers so it prolly doesn't split across many lanes too well. Also with double-precision computation there's a halving in effective lane count and for single-precision the computation is prolly so quick (very few terms) that it's prolly not worth the effort.

The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.
So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.
I'd hope the FMAC is pipelined, but would the permutes?
The latency would be the sum of the FMAC and permute latencies.

By permute I presume you mean swizzle, though I suppose what you're getting at is swizzling across the entire 16 lanes, not just within groups of 4 as GPUs do.

There might be an opportunity to use the 16 lanes to produce 4 transcendental results in less clocks than if they were produced separately in parallel on the four (x,y,z,w) sets of lanes. Apart from the approximation step, the reduction and reconstruction steps provides further opportunities to enhance utilisation on a wide SIMD. In effect using the 16-lane width of the SIMD to overlap computations for a set of 4 results.

The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.
Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.
It does avoid the permute stuff, though.
The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.

So you're saying generate 16 trascendentals in parallel? I dare say its simplicity is compelling and it could well coincide with the number of objects in a batch. The issue with running so many in parallel is the 16-way duplication of the lookup tables - though I guess they're all quite small.

Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.
Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.

I've not heard of L0 before...

I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.

I presume you're referring to the cache line width of current SSE implementations. We don't know this for Larrabee do we?

If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.

I dare say sequential registers would only apply for simpler programs.

Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.
Threads will just alternate on the issue port.

I'm thinking that it increases the porting complexity because both the SIMD and the gather/scatter units are fetching and storing concurrently - although in an alternating pattern. I suppose doubling the banking would be the easiest solution.

The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.
R600's register shenanigans frequently happen in parallel with the activity of other memory clients.
Depending on port count, the same cannot always be said for Larrabee.
If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.

So we get back to the question of port widths...

I also forgot in my previous post that writing out a thread implies writing another one in.
As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.
To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.
Larrabee must occupy its vector unit for 17 cycles with other work.

With the SIMD being pipelined and with any one instruction only able to consume, at most, 3 operands, thread B instructions can start issuing before thread B's register set has been fully populated. Meanwhile thread A's register set can start being written out before A has finished. OK, I know, it's hairy

The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.
It also depends on just where that gather/scatter hardware is, and how it is implemented.

Thinking that we could be waiting 2 years to find out is, ahem, maybe not so funny.

edit:
Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.

This is bloody tantalising:

http://ieeexplore.ieee.org/Xplore/l...9/04343860.pdf?tp=&isnumber=&arnumber=4343860

But it's hidden and I can't find anything else on the topic

Jawed

3dilettante · May 7, 2008

Jawed said:
By permute I presume you mean swizzle, though I suppose what you're getting at is swizzling across the entire 16 lanes, not just within groups of 4 as GPUs do.

I went back to check what I wrote, and it's more like the AltiVec permute instructions, which would allow the unit to pick out the elements from the previous iteration's source and result registers and build the needed operands for the next step.
The combination would have to span quad lanes and also be gathered from different registers.
SSE in other x86s isn't quite as flexible in this regard, but with extra steps the same end result can be created.
Either that, or a specialized transcendental unit can quickly pick out the needed elements, since where a given value is resulted and where it must be copied would be static and could be hardwired.

So you're saying generate 16 trascendentals in parallel? I dare say its simplicity is compelling and it could well coincide with the number of objects in a batch. The issue with running so many in parallel is the 16-way duplication of the lookup tables - though I guess they're all quite small.

The entries are pretty small, and a lookup table would probably be a pretty large part of the hardware in a transcendental unit.
If a storage location the size of the L1 is 1 cycle in latency, it is possible that readying the lookup wouldn't be worse.
It might potentially add issue latency for transcendental instructions, but how many back-to-back issues would be needed?

I've not heard of L0 before...

It's been mentioned before, though it seems perpetually "in the future".
I think some fanciful accounts of what would have come after AMD's K8 included mention of it.

I presume you're referring to the cache line width of current SSE implementations. We don't know this for Larrabee do we?

The leaked Larrabee slide said the line width was 64B, just like Gesher/Sandy Bridge.

Thinking that we could be waiting 2 years to find out is, ahem, maybe not so funny.

We'd have enough time to speculatively design Larrabee several times over.

This is bloody tantalising:

http://ieeexplore.ieee.org/Xplore/l...9/04343860.pdf?tp=&isnumber=&arnumber=4343860

But it's hidden and I can't find anything else on the topic

Jawed

Hmm, the architecture is outlined elsewhere as a 48-ALU design divided into 8 clusters.
Each cluster contains 3 adders and two multipliers.
That's 16 add+mul pairs, equivalent to 16 FMAC lanes.

MfA · May 8, 2008

The LSU is essentially already a L0 cache in present CPUs.

Scali · May 8, 2008

We shouldn't have to wait *that* long... Intel plans to have engineering samples out in the second half of 2008. I hope they'll have some working software on it aswell, and perhaps are willing to release more info on the architecture and how they employ it for rendering.

3dilettante · May 8, 2008

MfA said:
The LSU is essentially already a L0 cache in present CPUs.

It's not because the entries only persist until the memory operations waiting in the buffers are retired.

If AMD thought the same way, there would have been no point in mentioning the L0 because they've had LSUs for over a decade.

3dilettante · Jul 8, 2008

Just to update one of the Larrabee threads:

http://babelfish.yahoo.com/translat....de/ct/08/15/022/&lp=de_en&btnTrUrl=Translate

Unless Gelsinger's just making stuff up, a 32-core 45nm Larrabee at 2 GHz could be expected to produce 2 SP TFLOPs.

This appears to support the speculation that Larrabee's SP/DP ratio is 2:1, since Intel already proposed (although this data is much older) that Larrabee could do 1 TFLOP DP with 24 cores at 2.5 GHz.
The DP ratio is a far sight better than current GPUs.
The SP peak is more problematic a comparison, as there is a range of 1 to 1.5 process nodes that GPUs will transition through in the meantime.
It seems clear that by late 2009/early 2010, peak numbers will be even less comparable, where we'll be comparing GPUs versus Larrabee in some possible kind of partial emulation.

If other speculation that each core is roughly 10mm2 is true, we could also suppose that at a bare minimum, Larrabee will be at least 320mm2.

Wattage numbers will prove interesting, I think.
Power improvements on TSMC's processes are not expected to be that large, while Intel's rumored to have a worst-case draw of 300W on a 45nm Intel process.
GPUs may increase power draw when their FLOP counts reach that high, though just looking at the 4850's peak FLOPs/Watt numbers would put it in a better light compared to Larrabee.

The integer pipeline will harken back to the general outlines of the original Pentium, though the FP half of things would be radically expanded.

nAo · Jul 8, 2008

While a 2 Ghz clock makes perfect sense to me it might end up being a conservative figure..
BTW..does your 10mm2 estimation takes in account TMUs and L2 as well?

3dilettante · Jul 8, 2008

Possibly.
Intel's older slides had a 2.5 GHz ceiling.

3dilettante · Jul 8, 2008

nAo said:
BTW..does your 10mm2 estimation takes in account TMUs and L2 as well?

Going by the old Intel slides B3d showed a while back, no.
This is each core and its corresponding L1s.

I'm still not sure if the corresponding sector of L2 per core will count towards 10mm2 or not.

I've excluded any special-purpose hardware, memory controllers, and all the fun bits of the uncore that are so important for multi-core designs.

My bare minimum estimate is one that I expect to be exceeded by a good amount. If the core is only the core and L1s, I'd expect it to be exceeded by a very significant amount.

nAo · Jul 8, 2008

Then I guess the 10mm2 figure must be quite old and based on a different process, given that L2, TMUs, memory controller, etc.. might account for 1.5-2 times the cores area or even more.

3dilettante · Jul 8, 2008

Intel never gave an estimate of anything beyond the core size. In my interpretation, it never included anything but the execution core + L1s in the 10mm2 estimate.

Without knowing the number of other units and their relative areas, I couldn't estimate more than 32 times the estimated core area. Obviously, the non-core elements have an area >>0.

corysama · Aug 6, 2008

Jawed said:
This is bloody tantalising:

http://ieeexplore.ieee.org/Xplore/l...9/04343860.pdf?tp=&isnumber=&arnumber=4343860

But it's hidden and I can't find anything else on the topic

Jawed

If that gets your attention, you'll probably enjoy this:
http://sqrl.mcmaster.ca/~anand/papers/SPUSPMath.pdf

Simon F · Aug 6, 2008

Jawed said:
http://ieeexplore.ieee.org/Xplore/l...9/04343860.pdf?tp=&isnumber=&arnumber=4343860
But it's hidden and I can't find anything else on the topic

How is it hidden? It just appears to describe using polynomial evaluation with a small number of different polynomials for each function on a particular hardware architecture.

Jawed · Aug 6, 2008

Simon F said:
How is it hidden? It just appears to describe using polynomial evaluation with a small number of different polynomials for each function on a particular hardware architecture.

Plebs like me can't read the document: pay to view.

Jawed

Jawed · Aug 6, 2008

corysama said:
If that gets your attention, you'll probably enjoy this:
http://sqrl.mcmaster.ca/~anand/papers/SPUSPMath.pdf

With this and the Imagine paper kindly forwarded to me, hopefully there's some clues on what Intel will be doing for transcendentals on Larrabee.

Presumably there'll be a split between single-precision high-speed versions for graphics and something more refined for double precision.

I'll look at these later. Thanks all.

Jawed

Larrabee's Rasterisation Focus Confirmed

Jawed

3dilettante

MfA

Scali

3dilettante

3dilettante

nAo

Nutella Nutellae

3dilettante

3dilettante

nAo

Nutella Nutellae

3dilettante

corysama

Simon F

Tea maker

Jawed

Jawed

Similar threads