22 nm Larrabee

Is the core going to handle re-renaming its rename registers and migrating values based on use data? There would be no software way to control this.

It should be completely in hardware. If the thread count goes up, you want to make as effective use of your lower level fast registers as possible, and not statically partition it.

Result forwarding is a significant source, though the port restriction was still noticeable, per Agner for many preceding Intel cores. With SB, this seems to have been alleviated or significantly reduced, though I haven't found a port count. With a physical register file, the ROB no longer contains data.

The point is that the Pentium Pro made due with just three ports to its register file while being competitive with other OOO implementations that had up to twelve ports. It is of course in part a result of the small number of registers in x86. With just 8 architected registers, the chance of accessing a register that haven't been written to in the last 32 instructions, and thus in the ROB, is a lot lower than if you had 32 registers.

I suggested a different domain. Sandy Bridge has several, and they are out of order.

Alright, I misunderstood. I thought you wanted to offload AVX from the core entirely, thus not tracking dependencies.

Cheers
 
It should be completely in hardware. If the thread count goes up, you want to make as effective use of your lower level fast registers as possible, and not statically partition it.
I think having the register file pointers updated by the renamer, retire unit, and a register history/reallocation engine may eat into the power savings of AVX 1024, just a tad.

The point is that the Pentium Pro made due with just three ports to its register file while being competitive with other OOO implementations that had up to twelve ports.
Mem/reg ops also meant it was possible that up to half of the sources wouldn't come from the register file, which also helped.
I'm curious what Intel has changed for SB, since the ROB no longer serves as a source.
 
I think having the register file pointers updated by the renamer, retire unit, and a register history/reallocation engine may eat into the power savings of AVX 1024, just a tad.

There is no reason to let the entire instruction tracking apparatus be aware of where register values reside.

The hybrid register file acts as a black box, with the fast file acting as a cache for the back file. You access it after instruction schedule with your renamed register identifier, the same way you do with all physical register file OOO machines. Either it hits in the fast file and you get the value quickly, or it doesn't and you wait a few cycles more.

This will add some latency but since extending the register resources is already a throughput oriented optimization it might be tradeoff worth doing.


Mem/reg ops also meant it was possible that up to half of the sources wouldn't come from the register file, which also helped.
I'm curious what Intel has changed for SB, since the ROB no longer serves as a source.

Since the scheduling machinery is very similar to P4, I'd say they massively multiported the register file.

Also, they might schedule an instruction to a reservation station before data is ready if the scheduler can determine that an instruction in the reservation station queue will produce the needed value, the instruction can then grab the value from the result bus.

Pentium 4 did this. One of the problems of P4 is that they did it speculatively for load dependent instructions too, expecting a L1 hit. If the load missed L1, they had to reschedule from just after the load, causing pathological behaviour. However for instructions that have low constant latency, the concept seems valid.

Cheers
 
It provides a transparent lock elision function with prefixes that can be safely ignored by legacy hardware.

The HTM part can't run on older hardware.
The general idea is similar between Haswell's instructions and AMD's barely remembered synchronization facility, though that is probably because the desired ends only allow certain means.

Intel's description is vague on certain details, such as minimum capacity guarantees, which right now may be undisclosed for competitive reasons. AMD's plan was to promise at least 4 cache lines.

Intel's scheme is potentially simpler, because it does not provide a mechanism for adding or removing cache lines from speculative monitoring, though this might mean that longer or more complex transactions may fail earlier due to incidental memory operations taking up buffer space.
On the other hand, Intel's implementation is much tighter with regards to side effects, whereas AMD's leakier scheme did not roll back a fair amount of CPU state and the state in memory or the TLB.

The downside to that is certain things like updating TLB status bits may lead to an abort on certain implementations.

I'm curious if AMD has much of a chance at bringing something compatible with this, and if my earlier speculation about Bulldozer's wonky memory performance means they had tried to do something along these lines and they won't be forever and a day behind.
 
Release date

Any news on when Knights Corner will ship?
I'm guessing it won't be until after all the Ivy Bridge products are shipping.
say July?
 
Wait, Intel starts to really sell Kights Corner as a commercial product?
So it seems, though I'm not sure if you can actually buy them from your local shop.

http://www.tacc.utexas.edu/news/press-releases/2011/stampede
When completed, Stampede will comprise several thousand Dell "Zeus" servers with each server having dual 8-core processors from the forthcoming Intel® Xeon® Processor E5 Family (formerly codenamed "Sandy Bridge-EP") and each server with 32 gigabytes of memory. This production system will offer almost 2 petaflops of peak performance, which is double the current top system in XD, and the real performance of scientific applications will see an even greater performance boost due to the newer processor and interconnect technologies. The cluster will also include a new innovative capability: Intel® Many Integrated Core (MIC) co-processors codenamed "Knights Corner," providing an additional 8 petaflops of performance.
They are also talking about upgrading to future versions of MIC as they come along.
 
They have also talked about selling them to universities and researchers so they could figure out how to actually use them efficienltly. I would guess it would be about as easy to get for Joe Average as highest-end Teslas are.
 
Haswell and MIC

Could transactional memory in Haswell be an indication of MIC in Haswell?
Will MIC support transactional memory? it would make sense yes?
 
Seems pretty power efficent, its beating the Tesla 2090 systems in perf/watt. Only BlueGene/Q systems are beating it on perf/watt, on the large scale. I suspect this is the 45nm 32 core early version as well, would be nice to know for sure, if it does that good at 45nm, the 22nm product should be very impressive and competitive.
 
Seems pretty power efficent, its beating the Tesla 2090 systems in perf/watt. Only BlueGene/Q systems are beating it on perf/watt, on the large scale. I suspect this is the 45nm 32 core early version as well, would be nice to know for sure, if it does that good at 45nm, the 22nm product should be very impressive and competitive.

It appears to be..not
As well, we can derive the theoretical TFlops per Xeon Phi as the E5-2670 is a known ~166 GFlops per CPU - (180990 GFlops - 166 GFlops * 236) / 118 = 1201.8 GFlops per Xeon Phi

From Khato@Anandtech
 
It appears to be..not

Doesn't that line right up with a theoretical 32 core 1.2ghz 45nm MIC product? 1.2Tflops DP peak, right? Or am I off somewhere?

Edit: I'm off, that would be SP flops, DP would be half that. It is ballpark DP for a 50+ core product at 1.6ghz 22nm product though, and its actual, not theoretical, so yeah. Slightly better than Tesla 2090 systems, not bad for a first showing.
 
Last edited by a moderator:
Back
Top