Nehalem SMT Implementation

Discussion in 'PC Industry' started by ShaidarHaran, Dec 3, 2008.

  1. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,026
    Likes Received:
    88
    I have yet to see an analysis of this. Does anyone know the difference between Nehalem's SMT implementation and that of Netburst?
     
  2. suryad

    Veteran

    Joined:
    Aug 20, 2004
    Messages:
    2,479
    Likes Received:
    16
  3. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,026
    Likes Received:
    88
    Thanks but I've read all that before. I was wondering what is the method of SMT utilized by Nehalem, specifically. Are there implementation differences compared to Netburst or is it really just HyperThreading all over again, with a larger ROB and more RS entries?
     
  4. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,613
    Likes Received:
    1,044
    It's the same static partitioning of the ROB again (2x64 entries) as with the P4. I would expect store buffers to be partitioned as well.

    In the server space it is going to be a huge win since it effectively halves all memory latencies for each context. It isn't quite as clear on the desktop as there are close to zero applications people actually use (including games) that are multithreaded, - at least to a point where they utilize more than four contexts.

    Cheers
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,548
    Likes Received:
    4,700
    Location:
    Well within 3d
    I don't see 2-way SMT as having that effect.

    The per-thread performance and percieved latency won't really improve.
    Utilization of the hardware improves, but the contexts themselves won't see significant improvement.
     
  6. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,613
    Likes Received:
    1,044
    No, not for the individual contexts of course. The perceived latency (distance between producing and consuming instructions) for each context running on a core is lower. The perceived resources (execution units x cycles) are fewer too, but to summed work of the two SMT contexts is higher than the work of a single context on a core.

    Cheers
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,548
    Likes Received:
    4,700
    Location:
    Well within 3d
    Perhaps there is a perceived reduction in time between producing and consuming threads, if they are both running and writing to a shared memory.
    The instructions within a thread won't see any such benefit.
    The SMT pipeline is wired for full-speed bypassing in a single-threaded case.

    Because both threads are active, they are actively competing for resources and the contexts will note when they stall, because another thread's instruction is no different than a NOP.

    If a multiply with 3 cycles latency in single-threading is issued, the apparent latency to the thread's next instruction is going to be 3 cycles + some statistical average of times the other thread arbitrarily takes some resource. The stalled thread will be actively trying to issue the next instruction the entire time.
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,613
    Likes Received:
    1,044
    I'm not talking about threads, but instructions. Your mul is the producer, the subsequent instruction that uses the result is the consumer.

    Let's say that a core is actually executing 4 instructions, on average, every cycle (limited by decode/retire).

    Your 3 cycle mul thus implies that the scheduler/ROB needs to find 12 instructions to execute before the result from the mul can be used. If you have 2 contexts running, you will, on average, get 2 instructions executed per cycle per context and thus only need to schedule 6 instruction before the result can be used.

    The result is that the perceived latency for each individual context is halved, you have fewer stalls and higher aggregate performance.

    Cheers
     
    #8 Gubbi, Dec 5, 2008
    Last edited by a moderator: Dec 5, 2008
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,548
    Likes Received:
    4,700
    Location:
    Well within 3d
    Your description is a closer match to either throughput or utilization than latency.
    Latency is the time from event A to event B.

    A 3-cycle latency is still at least 3 cycles from the POV of the hardware in all forms of threading.
    Fine-grained threading hides this in part because the thread itself cannot see the other cycles a round-robin scheduler devotes to other threads.
    SMT has all threads active all the time, so their respective schedulers know when they are passed over.

    If we have a 4-wide OoO core with two SMT threads of the same philosophy as Nehalem, the scheduler for each thread is designed to handle a peak throughput of 4 instructions per clock, and it will try to issue that many. If one thread stalls, the active thread still has to try issuing 4 instructions per clock.
    From the point of view one of the threads, it will have 4 instructions ready to go, but find that various resources are mysteriously taken, and from that thread's point of view, it actually does stall.

    To turn your example around with a 4-wide processor with two threads:

    If instruction issue is evenly shared and gives 2 issues per thread per cycle, we can have cases where latency increases.
    If one thread has a single-cycle ADD whose results are to be consumed by 3 later instructions, the other thread's issues will block full issue for the consumers of the ADD.
    In this case, SMT has effectively increased the perceived latency of the ADD.
     
  10. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    how does the "threads throughput" of a nehalem compares with that of a Sun Niagara?
     
  11. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,532
    Likes Received:
    485
    Location:
    Varna, Bulgaria
    Niagara (and Itanium-2), if I'm not mistaken, don't use SMT, but a more exotic form of TMT (Temporal MT), where in at a given time only one thread per core is active on executing, and if there is an event like load-latency threshold cross, the stalled thread is "stunned" and the next one is issued to the pipeline.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,548
    Likes Received:
    4,700
    Location:
    Well within 3d
    Niagara's cores are basically barrel processors, with a modified variant of fine-grained multithreading.
    Each core switches threads every cycle for the number of contexts it can support, but it can switch a thread out of the active set if it hits a long-latency event.

    Itanium 2 is switch on event.
     
  13. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,026
    Likes Received:
    88
    Sounds like a fancy name for time slices, which isn't really true SMT at all...
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...