Larrabee's Rasterisation Focus Confirmed

Discussion in 'Rendering Technology and APIs' started by B3D News, Apr 24, 2008.

  1. Jawed

    Jawed Legend

    No, that's marketing bullshit. A thread in G80 is a group of 16 primitives, vertices or fragments. G80 joins two threads to make a "warp" when doing pixel shading, hence batch size (and dynamic branching coherency) of 32 during pixel shading.

    It looks very likely to be 16 objects in a thread. The SIMDs, in single-precision, appear likely to simultaneously process 4 objects, each vec4, per clock.

    Jawed
     
  2. Scali

    Scali Regular

    You might want to read the whole thread first, Jawed... *sigh*
     
  3. pcchen

    pcchen Moderator Moderator Veteran Subscriber

    Well, actually it's not that bad. The only place where different "threads" need to be serialized is when control flow went different ways. Since it has gather/scatter support for its shared memory, there's no serialization for accessing shared memory as long as when there's no bank conflict.

    Anyway, from the perspective of programming, it's reasonable to consider them individual "threads." It's just like, normal threads in any operating systems. An OS can have hundreds or thousands of threads "running at once" but of course they are not. Just not confuse "programming threads" with "hardware threads" and one will be fine.

    Actually the programming guides from NVIDIA states these quite explicitly. Otherwise we won't know it's a SIMD processor.
     
  4. aaronspink

    aaronspink Veteran

    not that bad when they are all running different instructions?

    quite honestly, I don't have a big issue calling them blue gooses as far as a programming interface goes, my only issue comes when discussing the actual hardware.

    Aaron Spink
    speaking for myself inc.
     
  5. Jawed

    Jawed Legend

    For the sake of utilisation SoA makes sense and any SSE instruction could be translated to SoA.

    Having said that, though, wouldn't Intel make Larrabee as close to "no-change" from SSE as possible? i.e. a vec4 (single-precision) SIMD? Larrabee, presumably, extends the SIMD from 1 vec4 resultant per clock to 4 vec4s per clock.

    Is it reasonable to suppose that SSE code will run unchanged on Larrabee?

    Jawed
     
  6. Jawed

    Jawed Legend

    In CUDA we know that there's a limit of 24 warps per multiprocessor - but I'm not aware of documentation that states the maximum number of threads each multiprocessor supports.

    Jawed
     
  7. aaronspink

    aaronspink Veteran

    from the presentation triniboy linked:

    max of 8 blocks, min of 1 block
    max of 768 "threads", min of 1 "thread"

    but if we are talking actual hardware threads, it appears a warp is a hardware thread. So the maximum number of threads is 24.
     
  8. Jawed

    Jawed Legend

    An instruction runs for 2 clocks.

    Here an instruction runs for 4 clocks.

    In both architectures, the duration of an instruction allows "wide reads" on the register file to feed into an operand shuffler. A wide read appears to consist of reading instruction-count * SIMD-width operands at the same time, e.g. in G80 this is 2 * 8 = 16 operands per clock.

    Jawed
     
  9. trinibwoy

    trinibwoy Meh Legend

    True, there may be a throughput penalty for some Physics/GPGPU stuff that doesn't map well to the SIMD setup but how often will that be an issue for graphics workloads? Agreed on the use of proper terminology when referring to the hardware.

    BTW, what are some of the worksloads the x86 MIMD co-processor in Larrabee would help with in terms of the current graphics APIs that you wouldnt otherwise do on the main CPU? Are there any obvious fixed function processes that it could replace?
     
  10. Jawed

    Jawed Legend

    That's a compiled, i.e. static, co-issue. There's nothing "dynamic" about it at all. The clever bit is in operand fetching/shuffling, not in ALU instruction issue. Since G80 is "windowing" operands, the ALUs can't tell where data is coming from or going to.

    Jawed
     
  11. trinibwoy

    trinibwoy Meh Legend

    You've mentioned this several times since G80 launched. Is there any documentation or evidence that points to SFU being statically co-issued? The related G80 patent points to exactly the opposite.
     
  12. Jawed

    Jawed Legend

    This is the case under CUDA. But a warp in CUDA is actually 2 hardware threads joined together. Under graphics, a vertex thread is 16-wide (not 32-wide like a CUDA warp). Perhaps G80 can support 48 vertex threads per multiprocessor. Then again, prolly not :smile:

    Jawed
     
  13. Jawed

    Jawed Legend


    See slide 13 of the presentation you linked earlier:
    • Fetch one warp instruction/cycle
    • Issue one "ready-to-go" warp instruction/cycle
    Jawed
     
  14. pcchen

    pcchen Moderator Moderator Veteran Subscriber

    To my understanding, under CUDA the basic unit is still 16-wide (i.e. a half warp). Serialization if necessary happens on a half warp basis. I'm not completely sure about this though.
     
  15. Arun

    Arun Unknown. Legend

    Jawed, no, that's not static nor compiled. See my answer below:
    No, G80 allows it to be from different threads. It is easy to see why this is possible from a hardware perspective:
    - Vertex Shader/Geometry Shader: 16-wide batch size; each scheduler cycle, you issue one instruction from one thread that feeds the ALU for one scheduler clock cycle (i.e. 2 ALU clock cycles), OR you issue one instruction from one thread that feeds the SFU for 4 scheduler clock cycles (i.e. 8 ALU clock cycles)
    - Pixel Shader/CUDA: 32-wide batch size; one scheduler cycle, you issue one instruction from one thread that feeds the ALU for two scheduler clock cycles (i.e. 4 ALU clock cycles). The next cycle, you issue one instruction from one thread that feeds the Interpolator (i.e. SFU unit too) for two scheduler clock cycles (i.e. 4 ALU clock cycles), OR you issue one instruction from one thread that feeds the SFU for 8 scheduler clock cycles (i.e. 16 ALU clock cycles).

    It should be easy to see that the thread selection hardware is able to deliver one 16-wide group every clock cycle, while the ALU only needs one 16-wide group every 2 clock cycles in the PS/CUDA case. As such, if you want to support a batch size of 16 for the VS, it is easy to support sending instructions from independent threads (or the same one if more desirable) to the two units with no performance penalty if you allow your batch size to grow to 32. This higher batch size also makes it possible to support more simultaneous pixels for a given number of warp/thread slots, thus improving real-world latency tolerance if you are not register limited. This is assuming there are 24 slots, not 48; the latter is extremely unlikely however given the lower latency tolerance requirements of the VS.

    Fair enough, all I was saying is that there's a small register scoreboard that gets checked to see if a thread can execute or not. As you point out, it's nothing incredibly advanced or unusual, but it's still worth pointing out since I'm pretty sure most if not all DX9 GPUs didn't support such a thing.

    Well, as I said, it works via a register scoreboard, and seems to allow checking two instructions per thread. The obvious advantage of such a scheme is that it allows you to run several loads at the same time while still running ALU instructions from the same thread quite often even if the loads return out of order.

    Since it works with a register scoreboard, my expectation is that the scheme is generalized for both loads and ALU instructions; it basically keeps executing until the scoreboard tells it it needs to wait. Then, whenever a result is returned either from a load or from the ALU pipeline, it checks at the next opportunity whether it can execute an instruction from this thread by checking the register scoreboard again. Does this seem reasonable?

    pcchen: I'm pretty sure branch granularity is 32-wide in CUDA; shared memory bank conflicts (ugh!) are 16-wide though, that's probably why you were confused... :)
     
  16. Jawed

    Jawed Legend

    These effects are caused by banking (in the RF, L1 constant cache and parallel data cache) as well as coalescing for writes to memory.

    Jawed
     
  17. 3dilettante

    3dilettante Legend Alpha

    I do not believe I am confused in this regard.
    Your description of Larrabee's function, however, does indicate some confusion.

    You say you are using "logical cores" in your description, but then you describe things in terms of physical units (wrongly). We will not be expecting Larrabee to have 16-wide units for everything. From the sounds of it we may only get one such unit per core.


    G80's clusters have multiple threads in waiting, with each ready thread getting multiple cycles on the SIMD before cycling out for another instruction.
    Larrabee has multiple threads, whose execution policy has not been publically defined as of yet.
    It may be SMT, or some form of hybrid round-robin.
    Both cases are not wildly different.

    Only if the pipelines don't share hardware.
    One of the quick and dirty ways to save die space is to use the FMAC unit in the more complex functions.
    If Larrabee is much like other SSE implementations, the vector unit will run both vector integer and vector float operations, and we won't have both going at the same time.
     
  18. 3dcgi

    3dcgi Veteran Subscriber

    Actually, I think running over 2 clock cycles is entirely different in implementation that running 2 threads in the same clock cycle. If Intel's HT doesn't run 2 threads in the same pipe stage then I need to review my terminology as I thought that's what HT is.
     
  19. 3dcgi

    3dcgi Veteran Subscriber

    This statement doesn't convince me of anything as Sun says the same about Niagra.

    I didn't remember this quote, but obviously others including Arun think it's SMT. I still doubt it's SMT across the SIMD units so 4 way doesn't make sense to me, but it's all speculation at this point so we'll see later this year.
     
  20. Scali

    Scali Regular

    I was obviously referring to the SIMD-width there. There may or may not be other units... but that wasn't relevant in that particular context... I was just clarifying that the width of these is the same as the G80. Unlike G80, Larrabee will however likely have x86-like ALU units alongside the SIMD unit.

    All G80's threads are identical! At least per multiprocessor. It's SPMD, only one program runs at a time.
    SMT doesn't have such a restriction, in fact, for SMT to be efficient, you don't even WANT them to be identical.

    I already referred to the Pentium Pro and its concept of 'execution ports' and I am going to refer to that once again.

    And if Larrabee's x86 implementation is much like other x86-implementations, we will have a regular ALU alongside the FPU/SIMD unit (and possibly have multiple execution ports per unit).
     
Loading...

Share This Page

Loading...