Nehalem SMT Implementation

ShaidarHaran

hardware monkey
Veteran
I have yet to see an analysis of this. Does anyone know the difference between Nehalem's SMT implementation and that of Netburst?
 

Thanks but I've read all that before. I was wondering what is the method of SMT utilized by Nehalem, specifically. Are there implementation differences compared to Netburst or is it really just HyperThreading all over again, with a larger ROB and more RS entries?
 
It's the same static partitioning of the ROB again (2x64 entries) as with the P4. I would expect store buffers to be partitioned as well.

In the server space it is going to be a huge win since it effectively halves all memory latencies for each context. It isn't quite as clear on the desktop as there are close to zero applications people actually use (including games) that are multithreaded, - at least to a point where they utilize more than four contexts.

Cheers
 
I don't see 2-way SMT as having that effect.

The per-thread performance and percieved latency won't really improve.
Utilization of the hardware improves, but the contexts themselves won't see significant improvement.
 
The per-thread performance and percieved latency won't really improve.
Utilization of the hardware improves, but the contexts themselves won't see significant improvement.

No, not for the individual contexts of course. The perceived latency (distance between producing and consuming instructions) for each context running on a core is lower. The perceived resources (execution units x cycles) are fewer too, but to summed work of the two SMT contexts is higher than the work of a single context on a core.

Cheers
 
Perhaps there is a perceived reduction in time between producing and consuming threads, if they are both running and writing to a shared memory.
The instructions within a thread won't see any such benefit.
The SMT pipeline is wired for full-speed bypassing in a single-threaded case.

Because both threads are active, they are actively competing for resources and the contexts will note when they stall, because another thread's instruction is no different than a NOP.

If a multiply with 3 cycles latency in single-threading is issued, the apparent latency to the thread's next instruction is going to be 3 cycles + some statistical average of times the other thread arbitrarily takes some resource. The stalled thread will be actively trying to issue the next instruction the entire time.
 
I'm not talking about threads, but instructions. Your mul is the producer, the subsequent instruction that uses the result is the consumer.

Let's say that a core is actually executing 4 instructions, on average, every cycle (limited by decode/retire).

Your 3 cycle mul thus implies that the scheduler/ROB needs to find 12 instructions to execute before the result from the mul can be used. If you have 2 contexts running, you will, on average, get 2 instructions executed per cycle per context and thus only need to schedule 6 instruction before the result can be used.

The result is that the perceived latency for each individual context is halved, you have fewer stalls and higher aggregate performance.

Cheers
 
Last edited by a moderator:
Your description is a closer match to either throughput or utilization than latency.
Latency is the time from event A to event B.

A 3-cycle latency is still at least 3 cycles from the POV of the hardware in all forms of threading.
Fine-grained threading hides this in part because the thread itself cannot see the other cycles a round-robin scheduler devotes to other threads.
SMT has all threads active all the time, so their respective schedulers know when they are passed over.

If we have a 4-wide OoO core with two SMT threads of the same philosophy as Nehalem, the scheduler for each thread is designed to handle a peak throughput of 4 instructions per clock, and it will try to issue that many. If one thread stalls, the active thread still has to try issuing 4 instructions per clock.
From the point of view one of the threads, it will have 4 instructions ready to go, but find that various resources are mysteriously taken, and from that thread's point of view, it actually does stall.

To turn your example around with a 4-wide processor with two threads:

If instruction issue is evenly shared and gives 2 issues per thread per cycle, we can have cases where latency increases.
If one thread has a single-cycle ADD whose results are to be consumed by 3 later instructions, the other thread's issues will block full issue for the consumers of the ADD.
In this case, SMT has effectively increased the perceived latency of the ADD.
 
Niagara (and Itanium-2), if I'm not mistaken, don't use SMT, but a more exotic form of TMT (Temporal MT), where in at a given time only one thread per core is active on executing, and if there is an event like load-latency threshold cross, the stalled thread is "stunned" and the next one is issued to the pipeline.
 
Niagara's cores are basically barrel processors, with a modified variant of fine-grained multithreading.
Each core switches threads every cycle for the number of contexts it can support, but it can switch a thread out of the active set if it hits a long-latency event.

Itanium 2 is switch on event.
 
Niagara (and Itanium-2), if I'm not mistaken, don't use SMT, but a more exotic form of TMT (Temporal MT), where in at a given time only one thread per core is active on executing, and if there is an event like load-latency threshold cross, the stalled thread is "stunned" and the next one is issued to the pipeline.

Sounds like a fancy name for time slices, which isn't really true SMT at all...
 
Back
Top