Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I never said it was SMT across SIMD units, perhaps you should reread what I said, it might make more sense.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Your description pretty much stated the expectation that there would be 4 16-wide SIMD instructions executing simultaneously.
    Unless there were 4 such units, this would not be the case.

    In the case of Larrabee's vector code, it's not significantly different.
    The behavior of the individual units in the GPU is like an SSE unit running a multi-cycle instruction with a throughput less than one per cycle, except that in this case the throughput is higher for the GPU.

    It misses the point.
    Pipelines on separate issue ports can share hardware.
    Complex operations can monoplize enough hardware to prevent all FP issue, and possibly even some complex scalar operations, depending on how much is shared.

    If anything like any other CPU, there will be at most 1 issue port per unit, and likely multiple units per port.
    The G80's clusters are of lesser issue width than Larrabee might be, though we don't yet know what Larrabee's is.
    The comparison is also unclear because we don't know how much of G80's other units run in parallel with the shader core and perform much of the work that the standard ALUs have to emulate in Larrabee.
     
  3. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    From my understanding, you are wrong here. For one thing NVidia's patents clearly cover cases where multiple (2 or perhaps more) programs are being run concurrently on the same microprocessor (ie 8-wide SIMD unit). And in fact it only makes sense to be interleaving execution of at least 2 programs (vertex and fragment, at a minimum) to insure you don't stall the fixed function hardware, otherwise it would have to buffer all the intermediate data to memory (which is really really bad idea and surly doesn't happen). Probably even overlaps execution of different batches well if they can be run in parallel (ie when one batch ends, and there are no serialization requirements from state changes, it probably overlaps execution of the next batch).

    Second, if you look at CUDA 2.0 beta, they have added a "streams" interface which provides a full API for running multiple kernels at the same time (even if now this is simply overlapping execution as one kernel starts to run out of warps, and it then schedules warps of the second kernel). No doubt that this got high priority as the PhyX port to CUDA probably requires this, and requires a very quick DX to CUDA to DX workflow transitions. So I think it is safe to say that NVidia was simply perhaps a little behind in driver / software to expose this functionality to developers.

    But in a general case, excluding the obvious need to overlap vertex and fragment to keep FF hardware busy, in many cases you would only want to overlap execution of different shaders as one program runs out of work (starts to finish), because that way you maximize your locality of data and texture cache effectiveness. The exception to this rule would be if you could pair execution of a ALU limited shader with a TEX/memory bandwidth limited shader. This case probably doesn't happen very often. Also there are still some unresolved issues here as to how to best offer this to a programmer (automatic via driver, software hints, or explicit software control).
     
  4. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Jawed, interesting idea, but since swizzle and in vector reorder is not free on SSE, your vector performance would be very low programming in this AOS style.
     
  5. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Uhhh, what in particular made you think that? Because I never said there would be 4 SIMD units per core or anything even remotely like that. I just said it would have 4-way SMT, and that a SIMD unit would be 16 wide.
     
  6. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I got the mention of one program counter per microprocessor directly from an nVidia employee... unless I misunderstood him, the hardware really cannot do what you're saying here. Ofcourse it could do some kind of 'context switch' at a higher level to run multiple kernels 'at the same time'.
     
  7. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    It will be interesting to see how Larrabee hyperthreads.

    I think it is safe to assume that the end of the pipeline is a single in-order ALU.

    So at some point stalls in the single merged instruction stream will probably block the entire pipeline, which is a common problem for classical hyperthreading. Things like L2 or perhaps TLB misses will probably flush (ie mark as skip) instructions from the stream for that (classic) thread (you won't know if you have a miss until much later in the pipeline), leaving bubbles. In order to re-fill the pipeline when the other three threads get flushed, you want to be able to issue at least 2x as fast as you can retire, so I'm guessing the pipeline picks a non-stalled thread and then does a dual issue (probably even better than dual issue) each clock ... this is what fills the single instruction stream flowing through the in-order processor.

    I'm thinking that Larrabee cannot afford to know about cache misses early enough in the pipeline to issue only from threads which won't stall (but obviously can issue from threads which arn't already stalled). It would simply add too much latency, and there simply isn't enough threads to schedule from to hide the latency. Combined this with the lack of re-order ability, and you have a problem.

    This is one key advantage in NVidia's design, in that it can accept the high thread latency, and schedule threads which are stall-free.

    Having an idle processor on a L2 cache miss is a common problem on CPUs, which hyperthreading doesn't really solve (cache is usually 1/hyperthreads as effective). Combined this with the problem of prefetching across page boundaries, and you cannot really hide this latency, you simply have to minimize your cache misses and eat the miss cost with an idle core.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I've quoted this before. Perhaps there's been some miscommunication, but I don't see how I could have misread this.

     
  9. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    The thing is, that same post also says:
    "However, a single logical instruction for Larrabee can be up to 16 operands wide".
    I wanted to say something like "where each unit is up to 16 operands wide" or "where each SIMD unit is 16 wide", but apparently that particular sentence was a bit too terse.
    From the context and all my other posts however it should be clear that I never once was speaking about multiple SIMD units per core, so it's really annoying that you keep bothering me about this single sentence, which perhaps could have been a tad more clear... but alas, I didn't see it at the time, and by now it's too late to edit it. So you will just have to live with it and accept that you indeed did misunderstand what I have been saying all the time, and that I don't think this one sentence is a good enough excuse, because that would mean that you ignored the rest of that post, aswell as all my other posts.
     
  10. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Maybe they were meaning that all the "threads" in a warp on a microprocessor have to have the same program counter, which is obviously true...

    Or maybe all warps of a given program/shader/kernel have to have the same program counter?
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Presumably it'll get fixed.

    http://www.intel.com/technology/itj/2007/v11i4/7-future-proof/8-action.htm

    Jawed
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    If you take the total workload of a current GPU, the ALU code written by the developer represents, say, 20%. This is imagining that all the fixed-function units were transformed into ALU code + fetches/stores. Hell, it could be 10%.

    The way I see it, Larrabee only needs to evenly distribute "fixed function" tasks across its cores in order to be able to hide memory-fetch (or branching) latencies. Ideally these fixed-function tasks need to be running on-die, so that they don't also stall (or don't stall very often).

    We might see two threads per core as "fixed-function" with another two operating as "shader-code" slaves. Each pair might act as a "hyperthread-pair" (filling the scalar path and the SIMD path) and the pairs will then time-slice against each other.

    Jawed
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I read the entire post, and grammatically there was nothing that truly contradicted that initial statement and the overall tenor of your argument that I believe is overselling the differences in the possible implementations of Larrabee and G80's clusters, for which I've seen descriptions from earlier reviews indicating they are more flexible when it comes to in-flight threads than you imply.

    Forgive me if one point I've made out of many offends you. I do not believe your annoyance is justified, since that last time I quoted that passage was in a direct response to a question you asked me.
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Without far more threads I just don't see any of this working. The only real alternative to manage outstanding memory requests implicitly through threads is handling it explicitly through software prefetch/pipelining (you need dozens of outstanding prefetches per core, far more than present CPUs have).

    Will make life hard on the compiler though. Kinda funny, a x86 based design which ends up harder to program for.
     
  15. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    You cannot be suggesting that graphics programmer shader code will only represent 10-20% of ALU and memory bandwidth usage on Larrabee?

    IMO, for Larrabee to be successful at GPU, the overhead of software fixed function (FF) better be awfully tiny, ie under 20%. In that case 80% or more time will be running vertex and fragment shaders (ie doing real work). So if you cannot do pure texture fetch latency hiding while running shaders alone, something is dreadfully wrong. I just cannot see the fixed function workflow being all that important for latency hiding (ie it should only capture a small fraction of latency hiding).
     
  16. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    I think there's definitely confusion when people talk of program counters. Take G80 for example. It's possible for it to have multiple warps in flight, each with its own program counter, yet there is only one program counter active for each pipe stage. New warps are swapped in as needed. This still fits with that Scali has been told by Nvidia. For a given pipe stage all threads (as Nvidia defines them) operate in a SIMD fashion with a single program counter.
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    From a software point of view GPUs would have multiple logical contexts. These contexts just won't occupy the same pipe stage. To software the differences in how the hardware handles the contexts could be hidden.
     
  18. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Getting back to software rendering on Larrabee, I think I have a better idea on how they are planning on efficiently doing small polygons at least from the ROP side,

    Given that we are assuming here that Larrabee is a tiled renderer, and that vector load/store goes directly to L2, I'm going to also assume that the ROP/OM tile is fully L2 cached. Probably then no sense in keeping the framebuffer tile in anything other than floating point (again pack/unpack would be too expensive in software). Final pack to smaller format done at resolve time. In which case it might make sense to store the cached frame buffer tile in 2x2 pixel quads per vector, {RGBA RGBA RGBA RGBA}, Which would then give us at least proper 2x2 granularity with vector aligned memory operations in the ROP. If the shader are 16-pixel SOA style, the shader SOA to ROP AOS reorder I'm guessing would take only 2 instructions per 2x2 quad.

    Jawed,

    Perhaps you are right, maybe Larrabee's new vector ISA is going to provide for better AOS style operations. So if this was true then perhaps shaders will run in 2x2 pix quads {aaaa, bbbb, cccc, dddd} style also. This is about the only way I can see them getting good performance out of small triangles, otherwise the overhead of re-grouping 2x2 quads into vectors could easily exceed the work do in the shader itself.

    Jawed or anyone have any ideas on how they would do this type of opcode encoding efficiently?
     
  19. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    61
    That was pretty much the first thing I wondered about when Larrabee specs first started coming out. From the PPT posted above, G80 has 24 (CUDA or PS, I guess 48 Vertex) warps and "a minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency". It's true that Larrabee looks to have a far larger / more full featured cache hierarchy that any current GPU, but I just can't see that being too much of a help for graphics loads. The main devil is textures, where caches will help you greatly within a triangle, and do nothing / hurt you across triangles. I guess that also gets into if / what sort of fixed-function texturing unit Larrabee will have.

    GPUs main design point is having lots of threads (/warps) in flight so that you can always be doing useful work while waiting for a full round-trip from main memory. If you don't have enough threads or they're not doing enough work between memory accesses, you're screwed.
    CPUs main design point is having lots of caches so you never have to touch main memory. If you don't have enough cache or your cache isn't prefetching the data you need, you're screwed.

    Larrabee looks to be fairly close to the CPU design point, so the question is if Intel will be able to structure the programs such that caching beats massively-threading.
     
  20. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    There's usually some coherency between triangles (or else you aren't making efficient use of vertex caching) so there's still likely to be some re-use of cached texture data from triangle to triangle. Of course, there's more to gain on a TBDR but even traditional renderers will get some benefit.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...