Larrabee's Rasterisation Focus Confirmed

trinibwoy · May 6, 2008

Jawed said:
Ever since my original discussion with Bob on the subject of instruction issue my view has been that the two warps that make up a pixel shader batch (or CUDA "warp") issue as mirror pairs.

Found the thread. Doesn't seem to support your argument though, just a bunch of people who seem to believe the same thing I do....including Bob

Bob said:
Jawed said:

Bob's answer depends solely on co-issue, not on multiple-thread scheduling.

Click to expand...

No it doesn't.

3dilettante · May 6, 2008

Jawed said:
I was just trying to point out that G80 has a hierarchy of port operations and it time-splats wide reads in order to get around port-count limitations in general. I don't think a port-analysis of G80 is particularly transferable to Larrabee, though.

I'm merely trying to make the attempt at finding some form equivalence.
They're going to be running the same workload, some kind of rough metric for the effective number of ports would be an interesting data point.

Did you read that blog entry? I know it's about unaligned accesses, but the fact is C2 had half the performance it should have had - while Opteron whizzed through with barely a bump. It just seems Intel has got away with things like this in the consumer space because it's not very demanding...

Intel's such a dominant player that if any rare app happens to run into this problem, it can usually count on performance-sensitive developers to bend over backwards to work around it.

I dare say I'm blasé simply because this whole thing is about throughput.

If it can physically fit in a rather tiny core, sure.
Each port at a minimum is already 4 times larger than a port on a standard x86.
It's not a trivial matter to suddenly decide to double things up or slather on enough ports for 4 simultaneous accesses from different threads.

aaronspink · May 6, 2008

3dilettante said:
Intel's such a dominant player that if any rare app happens to run into this problem, it can usually count on performance-sensitive developers to bend over backwards to work around it.

And Intel has already announced they are improving the performance of unaligned loads when using aligned instruction in future products.

But the truth is that unaligned loads really are quite rare and except for a very limited amount of cases are avoidable.

As far as TLBs, I don't think that should be an issue in the graphics space where you are likely just to utilize either super pages or basic 1:1 mapping.

Aaron Spink
speaking for myself inc.

Andrew Lauritzen · May 6, 2008

Mintmaster said:
nAo is saying that if the two triangles have near-equal derivatives, one of these quads gets merged into the other to eliminate redundant work. I'm surprised that this is true, but I presume he has good reason to make this claim.

Actually I'd be really surprised if this is true as well. In particular I've seen a noticeable difference in speed in rendering a single triangle that covers the full screen rather than two triangles in a full screen quad, due to the redundant diagonal quads issue. The G80 rasterizer research page linked earlier says the same thing. Maybe it's different with ATI hardware though as I haven't tested it fully. It does seem like a bit of a jump to infer that such an optimization can be made in the general case though, particularly with arbitrary user shaders involved at both the vertex and pixel level.

TimothyFarrar · May 6, 2008

aaronspink said:
But the truth is that unaligned loads really are quite rare and except for a very limited amount of cases are avoidable.

Really?

Texturing is basically unaligned loads + gather. Kind of makes up a majority of the loads in most fragment shaders.

TimothyFarrar · May 6, 2008

Andrew Lauritzen said:
Actually I'd be really surprised if this is true as well. In particular I've seen a noticeable difference in speed in rendering a single triangle that covers the full screen rather than two triangles in a full screen quad, due to the redundant diagonal quads issue. The G80 rasterizer research page linked earlier says the same thing. Maybe it's different with ATI hardware though as I haven't tested it fully. It does seem like a bit of a jump to infer that such an optimization can be made in the general case though, particularly with arbitrary user shaders involved at both the vertex and pixel level.

Yeah, I've also seen results which seem to support this as well, ie framebuffer overdraw (which includes fragments which fail the pixel coverage test) visualization results which show the common blocky pattern along edges. Perhaps for large triangles the fragment packing doesn't work, while for really small tris it does.

nAo · May 6, 2008

It's an optimization that can't obviously be done all the time, but modern hardware can do it..according what I was being told

Jawed · May 6, 2008

trinibwoy said:
You're ignoring the text which clearly identifies three different types of operations corresponding to the TEX, ALU and SF pipelines. I'm not sure how much more explicit it can get than this.

I agree it could work that way. With mirroring I don't think it needs to work that way.

If it isn't working this way now, it may work this way in the future - the key thing is to get "superscalar ALU+TMU" instruction issue.

This may be why MUL is/was missing, it can only work in GPUs where MAD + SF are superscalar.

G80 might be superscalar across MAD + SF but it only ever issues from the mirrored pair or warps. Which makes it no different to a co-issue because that's a trivial pairing of one instruction from A with the other type from B.

If it's genuinely able to issue MAD + SF without being constrained by a mirrored pair then I will concede it is truly out of order and not a co-issue setup. Evidence gladly welcomed

If G80 or derivatives are supercalar across MAD+SF+TMU then the scoreboarding/operand-gather/instruction-issue hardware is clearly even more hairy than I thought

Sure, you can read it as many things but the concepts described apply to the general case of instruction re-ordering and aren't particulary focused on anything related to the "missing MUL" co-issue.

Sorry, I didn't mean to focus on missing MUL - merely that it's a nice example of why compilation gets hairy.

If compilation is simplified by a fully superscalar ALU configuration then it's re-complicated by the asymmetry of the ALUs, the register file bandwidth (remembering to include constant cache and parallel data cache bandwidths may ameliorate this), read-after-write latency and branching.

Jawed

Jawed · May 6, 2008

Thanks, nice explanation.

Mintmaster said:
nAo is saying that if the two triangles have near-equal derivatives, one of these quads gets merged into the other to eliminate redundant work.

Woah. OK, so what does "near equal" mean. Does OGL or D3D specify precision?

And this is something that ATI does but NVidia doesn't? What would be the visual artefacts arising from this? I'm guessing the common edge might make itself seen.

Jawed

Jawed · May 6, 2008

3dilettante said:
I'm merely trying to make the attempt at finding some form equivalence.
They're going to be running the same workload, some kind of rough metric for the effective number of ports would be an interesting data point.

We don't really have much idea about instruction and batch throughput. e.g. I think it's prolly 4x vec4 issue per SIMD. Will there be a transcendental unit?

For all we know the 4 hardware threads are actually used to "emulate" the highly threaded operation of a GPU, e.g. each hardware thread is actually able to context switch amongst 16 soft threads. Would it be possible to soft-context-swap a set of, say 8, registers into memory (cache)? Could Larrabee successfully hide the latency of such a swap because the set of 4 hardware threads always has at least 2 that are out of hardware context?

Jawed

Jawed · May 6, 2008

aaronspink said:
As far as TLBs, I don't think that should be an issue in the graphics space where you are likely just to utilize either super pages or basic 1:1 mapping.

R600 appears to implement multiply-sized pages concurrently:

Method and apparatus for fragment processing in a virtual memory system

Jawed

TimothyFarrar · May 6, 2008

Jawed said:
Woah. OK, so what does "near equal" mean. Does OGL or D3D specify precision?

And this is something that ATI does but NVidia doesn't? What would be the visual artefacts arising from this? I'm guessing the common edge might make itself seen.

Are we talking about one micro-poly which happens to raster to 4 fragments but in different 2x2 quads, is repacked into one quad? This seems possible in hardware since the plane equation is the same (interpolation source data is the same). Not sure if the hardware can do this? Guessing Yes?

Second case is two adjacent micro-polys which share an edge in one 2x2 quad with different plane equations, guessing these don't get repacked.

Or am I completely off base in this?

3dilettante · May 7, 2008

Jawed said:
We don't really have much idea about instruction and batch throughput. e.g. I think it's prolly 4x vec4 issue per SIMD.

To rephrase: one vector instruction can be issued to a vector unit.
Since it seems some Intel figures have stated publically that an FMAC instruction is available to Larrabee, we can map that to the 8-16 DP figure given in the earlier Larrabee slide, which--barring some flaky issue restrictions--indicates each core has one vector unit.

Will there be a transcendental unit?

Could be. None of the slides go into that.
There are a number of ways that can go.
It could be fully separate, or complex ops can share hardware with the FMAC unit.

For all we know the 4 hardware threads are actually used to "emulate" the highly threaded operation of a GPU, e.g. each hardware thread is actually able to context switch amongst 16 soft threads. Would it be possible to soft-context-swap a set of, say 8, registers into memory (cache)?

8 vector registers?
By soft-context-swap, you mean have each master thread emulate a context switch with successive writes?
The effectiveness of such a solution could depend on a lot of things, such as the physical port count.
Unless there is a form of bulk write that can write multiple registers to memory, we're talking about a soft-context switch that will take up 8 port cycles and 8 instructions out of the core's issue bandwidth.
Possibly, if there is a large load/store buffer, the successive writes can wait to commit to memory and take advantage of a wider cache port.

If we assume two fully active threads, one thread could do a switch while the other continued working, assuming an internal width of at least 2 threads-worth of resources and two data cache ports.
That would require that the other active thread keep active for 8 cycles to hide this switch, or only 4 if it doesn't use any cache bandwidth at all and sticks to within the reg file.

If we crudely link the vector registers' capacity to an equivalent number of 4-channel FP32 elements, that's between 32 and 16 elements, if we go by what you posited by having only 2 threads fully running.

Barbarian · May 7, 2008

TimothyFarrar said:
Really?

Texturing is basically unaligned loads + gather. Kind of makes up a majority of the loads in most fragment shaders.

How so? Why would you have a 32bit RGBA texture that is not aligned on 4bytes? Actually a lot of recent hardware aligns textures on 4Kb boundaries. That's plenty alignment.

nAo · May 7, 2008

Barbarian said:
How so? Why would you have a 32bit RGBA texture that is not aligned on 4bytes? Actually a lot of recent hardware aligns textures on 4Kb boundaries. That's plenty alignment.

This is because you spend too much time on a console with crazy alignment requirements, try to work on the other one as well!

Jawed · May 7, 2008

3dilettante said:
To rephrase: one vector instruction can be issued to a vector unit.
Since it seems some Intel figures have stated publically that an FMAC instruction is available to Larrabee, we can map that to the 8-16 DP figure given in the earlier Larrabee slide, which--barring some flaky issue restrictions--indicates each core has one vector unit.

Yeah, in single-precision terms:

((x,y,z,w),(x,y,z,w),(x,y,z,w),(x,y,z,w))

Could be. None of the slides go into that.
There are a number of ways that can go.
It could be fully separate, or complex ops can share hardware with the FMAC unit.

I wonder how important double precision is. Generally Intel seems to be keeping a keen eye on it yet it makes transcendentals hairier, biasing implementation away from the pipelined designs we see in GPUs.

LOL, with 16 lanes in a SIMD you could do a funky set of parallel terms (one per lane) for a polynomial to produce one transcendental every few clocks...

Might as well link this as I just ran into it:

http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf

Interestingly only atan has degree more than 16 for double-precision: 22

8 vector registers?
By soft-context-swap, you mean have each master thread emulate a context switch with successive writes?

Yeah, 8x 512-bit registers populated to form a hardware context from one of many virtualised states.

D3D10 requires support for 4096 128-bit registers per object.

Since Intel has to implement virtualised shader state then it might go one further and virtualise threads by creating a pool of software contexts.

Hmm, as far as a software-GPU is concerned this should be entirely up for grabs - what does Swiftshader do? Presumably Intel is retaining SSE functionality so it's really a matter of the most advantageous way to use soft contexts (if it makes any sense for Larrabee-as-GPU).

The effectiveness of such a solution could depend on a lot of things, such as the physical port count.
Unless there is a form of bulk write that can write multiple registers to memory, we're talking about a soft-context switch that will take up 8 port cycles and 8 instructions out of the core's issue bandwidth.
Possibly, if there is a large load/store buffer, the successive writes can wait to commit to memory and take advantage of a wider cache port.

Perhaps this is all centred on the gather and scatter units? By their nature they have to do wide operations against memory (cache) so that average bandwidth of operands gathered/scattered is in the ballpark of ALU operand bandwidth.

Whereas G80 uses an operand window (gather window) between register file and SIMD, perhaps placing the operand window between cache and register file is the solution for Larrabee? All SIMD instructions, if they run solely from/to register file, have guaranteed operand bandwidth and no gather/scatter headaches. I'm guessing this is more like how SSE uses its register file (I really don't have a good understanding of SSE implementations

).

If we assume two fully active threads, one thread could do a switch while the other continued working, assuming an internal width of at least 2 threads-worth of resources and two data cache ports.
That would require that the other active thread keep active for 8 cycles to hide this switch, or only 4 if it doesn't use any cache bandwidth at all and sticks to within the reg file.

If we crudely link the vector registers' capacity to an equivalent number of 4-channel FP32 elements, that's between 32 and 16 elements, if we go by what you posited by having only 2 threads fully running.

Yeah that's the kind of thing I was thinking. The register file is two-way, 8x 512-bits per hardware context, with the fetches and stores running on the "idle" hardware context. These fetch/store SSE instructions would actually be executed by the gather/scatter units.

With the gather and scatter units being the interface to the real world for all data, the SIMD can cosy up to a very small register file - presumably much like SSE's SIMD does. Double-threading the SIMD is obviously going to complicate things but by keeping the count of in-flight registers tiny (unlike a GPU) Intel avoids the register file explosion that we see in R600, where the register file amounts to 1MB effective (and it has at least 3 read ports it seems though I still haven't untangled whether that's 3 physical ports or 3 emulated ports).

So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.

Jawed

Barbarian · May 7, 2008

Jawed said:
So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.

That would be the most logical thing to do. Current GPUs struggle with constant register indexing, something that would be trivial with virtualized register file.
If the rumors of 1-cycle L1 reads plus reg-mem vectors instructions are true, that would effectively give 32kb register file per core.

MfA · May 7, 2008

I kinda doubt it can sustain 1 cycle reads for vectorized reads.

3dilettante · May 7, 2008

Jawed said:
I wonder how important double precision is. Generally Intel seems to be keeping a keen eye on it yet it makes transcendentals hairier, biasing implementation away from the pipelined designs we see in GPUs.

Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.

LOL, with 16 lanes in a SIMD you could do a funky set of parallel terms (one per lane) for a polynomial to produce one transcendental every few clocks...

Might as well link this as I just ran into it:

http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf

Interestingly only atan has degree more than 16 for double-precision: 22

Perhaps a microcode instruction could spit out the necessary operations.
Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.
I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.
The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.
So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.
I'd hope the FMAC is pipelined, but would the permutes?
The latency would be the sum of the FMAC and permute latencies.

The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.
Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.
It does avoid the permute stuff, though.
The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.

Perhaps this is all centred on the gather and scatter units? By their nature they have to do wide operations against memory (cache) so that average bandwidth of operands gathered/scattered is in the ballpark of ALU operand bandwidth.

Whereas G80 uses an operand window (gather window) between register file and SIMD, perhaps placing the operand window between cache and register file is the solution for Larrabee?

Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.
Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.

As such hardware is on a critical signal path, port width and buffering is used carefully.

All SIMD instructions, if they run solely from/to register file, have guaranteed operand bandwidth and no gather/scatter headaches. I'm guessing this is more like how SSE uses its register file (I really don't have a good understanding of SSE implementations ).

Since SSE can have one memory operand, the hardware can draw operands from memory, register file, or the bypass network.
SSE currently has no scatter/gather headaches because it can't do scatter/gather.
Just load multiple values and shift them around to gather, or do the reverse for scatter.

Yeah that's the kind of thing I was thinking. The register file is two-way, 8x 512-bits per hardware context, with the fetches and stores running on the "idle" hardware context. These fetch/store SSE instructions would actually be executed by the gather/scatter units.

I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.
If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.

Double-threading the SIMD is obviously going to complicate things

Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.
Threads will just alternate on the issue port.

but by keeping the count of in-flight registers tiny (unlike a GPU) Intel avoids the register file explosion that we see in R600, where the register file amounts to 1MB effective (and it has at least 3 read ports it seems though I still haven't untangled whether that's 3 physical ports or 3 emulated ports).

The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.
R600's register shenanigans frequently happen in parallel with the activity of other memory clients.
Depending on port count, the same cannot always be said for Larrabee.
If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.

I also forgot in my previous post that writing out a thread implies writing another one in.
As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.
To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.
Larrabee must occupy its vector unit for 17 cycles with other work.

So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.

The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.
It also depends on just where that gather/scatter hardware is, and how it is implemented.

edit:
Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.

TimothyFarrar · May 7, 2008

Barbarian said:
How so? Why would you have a 32bit RGBA texture that is not aligned on 4bytes? Actually a lot of recent hardware aligns textures on 4Kb boundaries. That's plenty alignment.

Un-aligned loads not in term of PC alignment, but in terms of main memory granularity, texture cache line granularity, vector granularity, and that compressed textures don't technically have pixels aligned.

So if you have a vector unit which can only do SIMD aligned loads (like say the cell), texture fetch obviously needs to break vector alignment and do general non-vector aligned gather to fetch texture samples.

Interesting side question, not sure if compressed textures get kept in the texture cache compressed or uncompressed?

Larrabee's Rasterisation Focus Confirmed

trinibwoy

Meh

3dilettante

aaronspink

Andrew Lauritzen

Moderator

TimothyFarrar

TimothyFarrar

nAo

Nutella Nutellae

Jawed

Jawed

Jawed

Jawed

TimothyFarrar

3dilettante

Barbarian

nAo

Nutella Nutellae

Jawed

Barbarian

MfA

3dilettante

TimothyFarrar

Similar threads