22 nm Larrabee

Gubbi · Feb 3, 2012

3dilettante said:
Is the core going to handle re-renaming its rename registers and migrating values based on use data? There would be no software way to control this.

It should be completely in hardware. If the thread count goes up, you want to make as effective use of your lower level fast registers as possible, and not statically partition it.

3dilettante said:
Result forwarding is a significant source, though the port restriction was still noticeable, per Agner for many preceding Intel cores. With SB, this seems to have been alleviated or significantly reduced, though I haven't found a port count. With a physical register file, the ROB no longer contains data.

The point is that the Pentium Pro made due with just three ports to its register file while being competitive with other OOO implementations that had up to twelve ports. It is of course in part a result of the small number of registers in x86. With just 8 architected registers, the chance of accessing a register that haven't been written to in the last 32 instructions, and thus in the ROB, is a lot lower than if you had 32 registers.

3dilettante said:
I suggested a different domain. Sandy Bridge has several, and they are out of order.

Alright, I misunderstood. I thought you wanted to offload AVX from the core entirely, thus not tracking dependencies.

Cheers

3dilettante · Feb 3, 2012

Gubbi said:
It should be completely in hardware. If the thread count goes up, you want to make as effective use of your lower level fast registers as possible, and not statically partition it.

I think having the register file pointers updated by the renamer, retire unit, and a register history/reallocation engine may eat into the power savings of AVX 1024, just a tad.

The point is that the Pentium Pro made due with just three ports to its register file while being competitive with other OOO implementations that had up to twelve ports.

Mem/reg ops also meant it was possible that up to half of the sources wouldn't come from the register file, which also helped.
I'm curious what Intel has changed for SB, since the ROB no longer serves as a source.

Gubbi · Feb 7, 2012

3dilettante said:
I think having the register file pointers updated by the renamer, retire unit, and a register history/reallocation engine may eat into the power savings of AVX 1024, just a tad.

There is no reason to let the entire instruction tracking apparatus be aware of where register values reside.

The hybrid register file acts as a black box, with the fast file acting as a cache for the back file. You access it after instruction schedule with your renamed register identifier, the same way you do with all physical register file OOO machines. Either it hits in the fast file and you get the value quickly, or it doesn't and you wait a few cycles more.

This will add some latency but since extending the register resources is already a throughput oriented optimization it might be tradeoff worth doing.

3dilettante said:
Mem/reg ops also meant it was possible that up to half of the sources wouldn't come from the register file, which also helped.
I'm curious what Intel has changed for SB, since the ROB no longer serves as a source.

Since the scheduling machinery is very similar to P4, I'd say they massively multiported the register file.

Also, they might schedule an instruction to a reservation station before data is ready if the scheduler can determine that an instruction in the reservation station queue will produce the needed value, the instruction can then grab the value from the result bus.

Pentium 4 did this. One of the problems of P4 is that they did it speculatively for load dependent instructions too, expecting a L1 hit. If the load missed L1, they had to reschedule from just after the load, causing pathological behaviour. However for instructions that have low constant latency, the concept seems valid.

Cheers

Nick · Feb 9, 2012

It's official: Transactional Synchronization in Haswell.

3dilettante · Feb 9, 2012

It provides a transparent lock elision function with prefixes that can be safely ignored by legacy hardware.

The HTM part can't run on older hardware.
The general idea is similar between Haswell's instructions and AMD's barely remembered synchronization facility, though that is probably because the desired ends only allow certain means.

Intel's description is vague on certain details, such as minimum capacity guarantees, which right now may be undisclosed for competitive reasons. AMD's plan was to promise at least 4 cache lines.

Intel's scheme is potentially simpler, because it does not provide a mechanism for adding or removing cache lines from speculative monitoring, though this might mean that longer or more complex transactions may fail earlier due to incidental memory operations taking up buffer space.
On the other hand, Intel's implementation is much tighter with regards to side effects, whereas AMD's leakier scheme did not roll back a fair amount of CPU state and the state in memory or the TLB.

The downside to that is certain things like updating TLB status bits may lead to an abort on certain implementations.

I'm curious if AMD has much of a chance at bringing something compatible with this, and if my earlier speculation about Bulldozer's wonky memory performance means they had tried to do something along these lines and they won't be forever and a day behind.

moozoo · Feb 28, 2012

Release date

Any news on when Knights Corner will ship?
I'm guessing it won't be until after all the Ivy Bridge products are shipping.
say July?

denev2004 · Feb 29, 2012

moozoo said:
Any news on when Knights Corner will ship?
I'm guessing it won't be until after all the Ivy Bridge products are shipping.
say July?

Wait, Intel starts to really sell Kights Corner as a commercial product?

hoho · Feb 29, 2012

denev2004 said:
Wait, Intel starts to really sell Kights Corner as a commercial product?

So it seems, though I'm not sure if you can actually buy them from your local shop.

http://www.tacc.utexas.edu/news/press-releases/2011/stampede

When completed, Stampede will comprise several thousand Dell "Zeus" servers with each server having dual 8-core processors from the forthcoming Intel® Xeon® Processor E5 Family (formerly codenamed "Sandy Bridge-EP") and each server with 32 gigabytes of memory. This production system will offer almost 2 petaflops of peak performance, which is double the current top system in XD, and the real performance of scientific applications will see an even greater performance boost due to the newer processor and interconnect technologies. The cluster will also include a new innovative capability: Intel® Many Integrated Core (MIC) co-processors codenamed "Knights Corner," providing an additional 8 petaflops of performance.

They are also talking about upgrading to future versions of MIC as they come along.

denev2004 · Feb 29, 2012

hoho said:
So it seems, though I'm not sure if you can actually buy them from your local shop.

http://www.tacc.utexas.edu/news/press-releases/2011/stampede
They are also talking about upgrading to future versions of MIC as they come along.

So it seems to be a product that will just be sold to large customer.
Well, at least a good trend. LRB is no longer a Lab product.

hoho · Feb 29, 2012

They have also talked about selling them to universities and researchers so they could figure out how to actually use them efficienltly. I would guess it would be about as easy to get for Joe Average as highest-end Teslas are.

moozoo · Mar 1, 2012

Haswell and MIC

Could transactional memory in Haswell be an indication of MIC in Haswell?
Will MIC support transactional memory? it would make sense yes?

Nick · Mar 1, 2012

moozoo said:
Could transactional memory in Haswell be an indication of MIC in Haswell?

I don't see why there would be a correlation. It is intended to help multi-core scaling, before it severely holds back CPU performance progress.

Will MIC support transactional memory? it would make sense yes?

It makes sense and is feasible.

moozoo · Mar 14, 2012

Nick said:
It makes sense and is feasible.

Thanks for the great links

rpg.314 · Mar 15, 2012

Nick said:
It makes sense[/url] and is feasible.

Nice. Thanks for sharing.

Nick · Jun 8, 2012

Knights Corner's ISA specification has been released: Knights Corner micro-architecture.

Interesting detail: it now supports AVX2's VEX encoding, and the registers have been renamed to zmm0-zmm31 to further emphasize the similarities between the technologies.

denev2004 · Jun 19, 2012

Intel MIC first appear in TOP500

http://i.top500.org/system/177816

Homeles · Jun 19, 2012

denev2004 said:
Intel MIC first appear in TOP500

http://i.top500.org/system/177816

They've just been officially announced:
http://www.anandtech.com/show/6017/intel-announces-xeon-phi-family-of-coprocessors-mic-goes-retail

Ninjaprime · Jun 19, 2012

Seems pretty power efficent, its beating the Tesla 2090 systems in perf/watt. Only BlueGene/Q systems are beating it on perf/watt, on the large scale. I suspect this is the 45nm 32 core early version as well, would be nice to know for sure, if it does that good at 45nm, the 22nm product should be very impressive and competitive.

denev2004 · Jun 19, 2012

Ninjaprime said:
Seems pretty power efficent, its beating the Tesla 2090 systems in perf/watt. Only BlueGene/Q systems are beating it on perf/watt, on the large scale. I suspect this is the 45nm 32 core early version as well, would be nice to know for sure, if it does that good at 45nm, the 22nm product should be very impressive and competitive.

It appears to be..not

As well, we can derive the theoretical TFlops per Xeon Phi as the E5-2670 is a known ~166 GFlops per CPU - (180990 GFlops - 166 GFlops * 236) / 118 = 1201.8 GFlops per Xeon Phi

From Khato@Anandtech

Ninjaprime · Jun 19, 2012

denev2004 said:
It appears to be..not

Doesn't that line right up with a theoretical 32 core 1.2ghz 45nm MIC product? 1.2Tflops DP peak, right? Or am I off somewhere?

Edit: I'm off, that would be SP flops, DP would be half that. It is ballpark DP for a 50+ core product at 1.6ghz 22nm product though, and its actual, not theoretical, so yeah. Slightly better than Tesla 2090 systems, not bad for a first showing.

22 nm Larrabee

Similar threads