Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. trinibwoy

    trinibwoy Meh Legend

    Found the thread. Doesn't seem to support your argument though, just a bunch of people who seem to believe the same thing I do....including Bob :)

     
  2. 3dilettante

    3dilettante Legend Alpha

    I'm merely trying to make the attempt at finding some form equivalence.
    They're going to be running the same workload, some kind of rough metric for the effective number of ports would be an interesting data point.

    Intel's such a dominant player that if any rare app happens to run into this problem, it can usually count on performance-sensitive developers to bend over backwards to work around it.

    If it can physically fit in a rather tiny core, sure.
    Each port at a minimum is already 4 times larger than a port on a standard x86.
    It's not a trivial matter to suddenly decide to double things up or slather on enough ports for 4 simultaneous accesses from different threads.
     
  3. aaronspink

    aaronspink Veteran

    And Intel has already announced they are improving the performance of unaligned loads when using aligned instruction in future products.

    But the truth is that unaligned loads really are quite rare and except for a very limited amount of cases are avoidable.

    As far as TLBs, I don't think that should be an issue in the graphics space where you are likely just to utilize either super pages or basic 1:1 mapping.

    Aaron Spink
    speaking for myself inc.
     
  4. Andrew Lauritzen

    Andrew Lauritzen Moderator Moderator Veteran

    Actually I'd be really surprised if this is true as well. In particular I've seen a noticeable difference in speed in rendering a single triangle that covers the full screen rather than two triangles in a full screen quad, due to the redundant diagonal quads issue. The G80 rasterizer research page linked earlier says the same thing. Maybe it's different with ATI hardware though as I haven't tested it fully. It does seem like a bit of a jump to infer that such an optimization can be made in the general case though, particularly with arbitrary user shaders involved at both the vertex and pixel level.
     
  5. TimothyFarrar

    TimothyFarrar Regular

    Really? ;)

    Texturing is basically unaligned loads + gather. Kind of makes up a majority of the loads in most fragment shaders.
     
  6. TimothyFarrar

    TimothyFarrar Regular

    Yeah, I've also seen results which seem to support this as well, ie framebuffer overdraw (which includes fragments which fail the pixel coverage test) visualization results which show the common blocky pattern along edges. Perhaps for large triangles the fragment packing doesn't work, while for really small tris it does.
     
  7. nAo

    nAo Nutella Nutellae Veteran

    It's an optimization that can't obviously be done all the time, but modern hardware can do it..according what I was being told :)
     
  8. Jawed

    Jawed Legend

    I agree it could work that way. With mirroring I don't think it needs to work that way.

    If it isn't working this way now, it may work this way in the future - the key thing is to get "superscalar ALU+TMU" instruction issue.

    This may be why MUL is/was missing, it can only work in GPUs where MAD + SF are superscalar.

    G80 might be superscalar across MAD + SF but it only ever issues from the mirrored pair or warps. Which makes it no different to a co-issue because that's a trivial pairing of one instruction from A with the other type from B.

    If it's genuinely able to issue MAD + SF without being constrained by a mirrored pair then I will concede it is truly out of order and not a co-issue setup. Evidence gladly welcomed :grin:

    If G80 or derivatives are supercalar across MAD+SF+TMU then the scoreboarding/operand-gather/instruction-issue hardware is clearly even more hairy than I thought :razz:

    Sorry, I didn't mean to focus on missing MUL - merely that it's a nice example of why compilation gets hairy.

    If compilation is simplified by a fully superscalar ALU configuration then it's re-complicated by the asymmetry of the ALUs, the register file bandwidth (remembering to include constant cache and parallel data cache bandwidths may ameliorate this), read-after-write latency and branching.

    Jawed
     
  9. Jawed

    Jawed Legend

    Thanks, nice explanation.

    Woah. OK, so what does "near equal" mean. Does OGL or D3D specify precision?

    And this is something that ATI does but NVidia doesn't? What would be the visual artefacts arising from this? I'm guessing the common edge might make itself seen.

    Jawed
     
  10. Jawed

    Jawed Legend

    We don't really have much idea about instruction and batch throughput. e.g. I think it's prolly 4x vec4 issue per SIMD. Will there be a transcendental unit?

    For all we know the 4 hardware threads are actually used to "emulate" the highly threaded operation of a GPU, e.g. each hardware thread is actually able to context switch amongst 16 soft threads. Would it be possible to soft-context-swap a set of, say 8, registers into memory (cache)? Could Larrabee successfully hide the latency of such a swap because the set of 4 hardware threads always has at least 2 that are out of hardware context?

    Jawed
     
  11. Jawed

    Jawed Legend

  12. TimothyFarrar

    TimothyFarrar Regular

    Are we talking about one micro-poly which happens to raster to 4 fragments but in different 2x2 quads, is repacked into one quad? This seems possible in hardware since the plane equation is the same (interpolation source data is the same). Not sure if the hardware can do this? Guessing Yes?

    Second case is two adjacent micro-polys which share an edge in one 2x2 quad with different plane equations, guessing these don't get repacked.

    Or am I completely off base in this?
     
  13. 3dilettante

    3dilettante Legend Alpha

    To rephrase: one vector instruction can be issued to a vector unit.
    Since it seems some Intel figures have stated publically that an FMAC instruction is available to Larrabee, we can map that to the 8-16 DP figure given in the earlier Larrabee slide, which--barring some flaky issue restrictions--indicates each core has one vector unit.

    Could be. None of the slides go into that.
    There are a number of ways that can go.
    It could be fully separate, or complex ops can share hardware with the FMAC unit.

    8 vector registers?
    By soft-context-swap, you mean have each master thread emulate a context switch with successive writes?
    The effectiveness of such a solution could depend on a lot of things, such as the physical port count.
    Unless there is a form of bulk write that can write multiple registers to memory, we're talking about a soft-context switch that will take up 8 port cycles and 8 instructions out of the core's issue bandwidth.
    Possibly, if there is a large load/store buffer, the successive writes can wait to commit to memory and take advantage of a wider cache port.

    If we assume two fully active threads, one thread could do a switch while the other continued working, assuming an internal width of at least 2 threads-worth of resources and two data cache ports.
    That would require that the other active thread keep active for 8 cycles to hide this switch, or only 4 if it doesn't use any cache bandwidth at all and sticks to within the reg file.

    If we crudely link the vector registers' capacity to an equivalent number of 4-channel FP32 elements, that's between 32 and 16 elements, if we go by what you posited by having only 2 threads fully running.
     
  14. Barbarian

    Barbarian Regular

    How so? Why would you have a 32bit RGBA texture that is not aligned on 4bytes? Actually a lot of recent hardware aligns textures on 4Kb boundaries. That's plenty alignment.
     
  15. nAo

    nAo Nutella Nutellae Veteran

    This is because you spend too much time on a console with crazy alignment requirements, try to work on the other one as well! ;)
     
  16. Jawed

    Jawed Legend

    Yeah, in single-precision terms:

    ((x,y,z,w),(x,y,z,w),(x,y,z,w),(x,y,z,w))

    I wonder how important double precision is. Generally Intel seems to be keeping a keen eye on it yet it makes transcendentals hairier, biasing implementation away from the pipelined designs we see in GPUs.

    LOL, with 16 lanes in a SIMD you could do a funky set of parallel terms (one per lane) for a polynomial to produce one transcendental every few clocks...

    Might as well link this as I just ran into it:

    http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf

    Interestingly only atan has degree more than 16 for double-precision: 22 :razz:

    Yeah, 8x 512-bit registers populated to form a hardware context from one of many virtualised states.

    D3D10 requires support for 4096 128-bit registers per object.

    Since Intel has to implement virtualised shader state then it might go one further and virtualise threads by creating a pool of software contexts.

    Hmm, as far as a software-GPU is concerned this should be entirely up for grabs - what does Swiftshader do? Presumably Intel is retaining SSE functionality so it's really a matter of the most advantageous way to use soft contexts (if it makes any sense for Larrabee-as-GPU).

    Perhaps this is all centred on the gather and scatter units? By their nature they have to do wide operations against memory (cache) so that average bandwidth of operands gathered/scattered is in the ballpark of ALU operand bandwidth.

    Whereas G80 uses an operand window (gather window) between register file and SIMD, perhaps placing the operand window between cache and register file is the solution for Larrabee? All SIMD instructions, if they run solely from/to register file, have guaranteed operand bandwidth and no gather/scatter headaches. I'm guessing this is more like how SSE uses its register file (I really don't have a good understanding of SSE implementations :oops: ).

    Yeah that's the kind of thing I was thinking. The register file is two-way, 8x 512-bits per hardware context, with the fetches and stores running on the "idle" hardware context. These fetch/store SSE instructions would actually be executed by the gather/scatter units.

    With the gather and scatter units being the interface to the real world for all data, the SIMD can cosy up to a very small register file - presumably much like SSE's SIMD does. Double-threading the SIMD is obviously going to complicate things but by keeping the count of in-flight registers tiny (unlike a GPU) Intel avoids the register file explosion that we see in R600, where the register file amounts to 1MB effective (and it has at least 3 read ports it seems though I still haven't untangled whether that's 3 physical ports or 3 emulated ports).

    So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.

    Jawed
     
  17. Barbarian

    Barbarian Regular

    That would be the most logical thing to do. Current GPUs struggle with constant register indexing, something that would be trivial with virtualized register file.
    If the rumors of 1-cycle L1 reads plus reg-mem vectors instructions are true, that would effectively give 32kb register file per core.
     
  18. MfA

    MfA Legend

    I kinda doubt it can sustain 1 cycle reads for vectorized reads.
     
  19. 3dilettante

    3dilettante Legend Alpha

    Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.

    Perhaps a microcode instruction could spit out the necessary operations.
    Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.
    I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.
    The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.
    So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.
    I'd hope the FMAC is pipelined, but would the permutes?
    The latency would be the sum of the FMAC and permute latencies.

    The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.
    Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.
    It does avoid the permute stuff, though.
    The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.

    Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.
    Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.

    As such hardware is on a critical signal path, port width and buffering is used carefully.

    Since SSE can have one memory operand, the hardware can draw operands from memory, register file, or the bypass network.
    SSE currently has no scatter/gather headaches because it can't do scatter/gather.
    Just load multiple values and shift them around to gather, or do the reverse for scatter.

    I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.
    If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.

    Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.
    Threads will just alternate on the issue port.

    The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.
    R600's register shenanigans frequently happen in parallel with the activity of other memory clients.
    Depending on port count, the same cannot always be said for Larrabee.
    If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.

    I also forgot in my previous post that writing out a thread implies writing another one in.
    As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.
    To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.
    Larrabee must occupy its vector unit for 17 cycles with other work.

    The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.
    It also depends on just where that gather/scatter hardware is, and how it is implemented.


    edit:
    Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.
     
    Last edited by a moderator: May 7, 2008
  20. TimothyFarrar

    TimothyFarrar Regular

    Un-aligned loads not in term of PC alignment, but in terms of main memory granularity, texture cache line granularity, vector granularity, and that compressed textures don't technically have pixels aligned.

    So if you have a vector unit which can only do SIMD aligned loads (like say the cell), texture fetch obviously needs to break vector alignment and do general non-vector aligned gather to fetch texture samples.

    Interesting side question, not sure if compressed textures get kept in the texture cache compressed or uncompressed?
     
Loading...

Share This Page

Loading...