22 nm Larrabee

Okay, I had no idea you could use a vector field as a memory offset for a contiguous load. Could you show where in the reference manual this is specified? Is there a way to pick which field or is it the first one or something like that?
I don't think such an instruction exists. I believe Gipsel was describing the functionality from a conceptual point of view. That is, if such instruction existed, it would mainly be a matter of collecting all other elements in the same cache line, and updating the mask register, to implement Knights Corner's gather.
 
@Exophase:
I was basically saying, that a gather in the worst case is probably not much faster than if you would manually and individually load all elements (and you could save the loop, which runs 16 iterations in the worst case with gather). The reason it would not work that well is just the impossibility to manipulate the write masks in an efficient way (you can't just use an immediate or a scalar reg as an writemask or manipulate a write mask directly with a bitshift).
But what you said would be the major hindrance (individualle moving the offsets of each element from the vector regs to scalar regs) isn't that much of a problem in my opinion. It's a single vector move to memory (L1-D$) and loading it afterwards to use it as (part of the) address for the broadcast instruction. It is highly probable that one can have one vector load op (the broadcast instructions are running through the vector pipe) and one scalar load running in parallel. Would have been pretty fast in this case if KC would allow immediates as masks.
 
Okay, I think I understand why I was confused - I didn't realize broadcasts came from memory instead of scalar registers. So your proposed loop is like this:

Code:
vmov(offsets_array, vector_offsets);
for(i = 0; i < 8; i++)
{
  int offset = offsets_array[i];
  vbroadcastss(dest_vector, &(base[offset]), vector_mask_field(i));
}

But I'm not convinced that these two loads can actually be performed in parallel. Do you have any information that the cache can perform two line loads for vector and scalar in one cycle? Even then, we'd have to know that the broadcast uses the vector load..

Even if these do co-issue I strongly doubt you'll be able to issue them back to back like that w/o stall, you'd eat the full latency of that scalar load, which is probably substantially when used in the effective address of the next instruction. You'd also need to hope that the store to load forwarding from a vector store to scalar load is any good. You'd probably need a fair number of independent instructions at the start of the loop to get things rolling, and you'd easily blow through all your scalar registers.
 
Okay, I think I understand why I was confused - I didn't realize broadcasts came from memory instead of scalar registers. So your proposed loop is like this:

Code:
vmov(offsets_array, vector_offsets);
for(i = 0; i < 8; i++)
{
  int offset = offsets_array[i];
  vbroadcastss(dest_vector, &(base[offset]), vector_mask_field(i));
}
I actually proposed to completely unroll it, but otherwise yes.
But I'm not convinced that these two loads can actually be performed in parallel. Do you have any information that the cache can perform two line loads for vector and scalar in one cycle? Even then, we'd have to know that the broadcast uses the vector load..
The latter is stated in the ISA Manual, iirc. But you are probably right with the former point: Although the vector memory operations have their own address generation (mainly to enable scatter/gather and the deviating meaning of offsets) and the L1-D$ enables two simultaneous accesses, it appears to have only a single read port and a single write port. So everything else doesn't matter much, after all.
Even if these do co-issue I strongly doubt you'll be able to issue them back to back like that w/o stall, you'd eat the full latency of that scalar load, which is probably substantially when used in the effective address of the next instruction.
You'd also need to hope that the store to load forwarding from a vector store to scalar load is any good. You'd probably need a fair number of independent instructions at the start of the loop to get things rolling, and you'd easily blow through all your scalar registers.
I somewhat doubt the load to use latency of the L1 is higher than 4 cycles (and a store won't be much slower). That Knights Corner supports 4way SMT looks like a natural fit to hide the L1 latency almost completely. I guess there is also a reasoning behind the possibility to directly use memory operands in the vector operations. The L1 has to works as an extension of the pretty small (compared to GPUs) register file. It has to be quite fast and low latency to accomplish that (and as intel builds 4 cycle latency, 32kB L1-D$ running at ~4GHz in their CPUs, a 32kB L1-D$ running at ~1.3 GHz does not appear as a very hard problem). You can almost think of the vector registers as a slightly larger register file cache similar to what nV likes to talk about in connection to Maxwell/Einstein/Project Denver, it is just explictly managed by the running program.
 
having a gather instruction that immediately returns instead of internal looping/stalling can have some nice advantage (if they prefetch the remaining loads)/

you could xor the returned mask, do your work with only those returned lanes.
I imagin something like a kd-tree traversal could be really efficient, the nearer a node is to the root, the higher the chance it's in L2 or even L1, but if it's a slow fetch from ram, all the other lanes could continue working. it would be a 'finer threading' than GPUs do with wavefronts/warps. With a very diverging access pattern e.g. secondary rays in a path tracer, I'd think it's way more efficient on LRB, as you could work on at least something, not keeping 31 lanes (or threads) on hold, because one is still not loaded.
 
I'd think it's way more efficient on LRB, as you could work on at least something, not keeping 31 lanes (or threads) on hold, because one is still not loaded.
As you have to loop also over the vector instructions processing the returned data until everything is returned (at least in the worst case), it looks like it could be an excellent way of wasting energy for no or even a negative performance gain. But I agree that one may be able to construct cases, where it could help. As you said: tree traversals are a candidate, but one has to look carefully, one needs a likely early out of the loop, otherwise it will be slower. If you need all of the returned data anyway, it's not an option to process some of it earlier and some other parts later (it's just SIMD with lane masking like GPUs after all).
 
As you have to loop also over the vector instructions processing the returned data until everything is returned (at least in the worst case), it looks like it could be an excellent way of wasting energy for no or even a negative performance gain.
spinning on a condition (condition mask from the load units) isn't beneficial either, unless intel does some smart energy savings in those tight loops like they do on their Core CPUs since Nahalem.
I'm not sure how many gather instructions you could have simultaneously in the pipe to unroll the spinning and start working on the first gather that is done, but it sounds like quite an overhead to emulate the warp/wavefront scheduling of GPUs. Yet, you would still be some spinning. it might have been smarter to have a stalling gather, if the other 'hyper threads' could continue working, getting the time slices of the waiting 'hyper thread'.

I also wonder what causes a return with a not cleared mask, whether it's a long latency memory fetching, or even just addressing across cache lines (in that case I hope those are 128 or 256 byte long).

But I agree that one may be able to construct cases, where it could help. As you said: tree traversals are a candidate, but one has to look carefully, one needs a likely early out of the loop, otherwise it will be slower. If you need all of the returned data anyway, it's not an option to process some of it earlier and some other parts later (it's just SIMD with lane masking like GPUs after all).
yeah, it just works if you have some kind of dynamic data consumer, a regular matrix x matrix would just have disadvantages in that way.
 
Do we have any independent performance data on either Knight's Ferry or Knight's Corner? It's a bit strange that Intel is so quiet on that front. I know there have been a few preliminary papers focused on KF scaling but haven't seen anything with absolute performance numbers.
 
Do we have any independent performance data on either Knight's Ferry or Knight's Corner? It's a bit strange that Intel is so quiet on that front. I know there have been a few preliminary papers focused on KF scaling but haven't seen anything with absolute performance numbers.

Im not surprised to see Intel really tight lipsed about it, im sure they dont want reveal too much data before the official launch.
 
I’m not surprised to see Intel really tight lipped about it, I’m sure they don’t want reveal too much data before the official launch.

Then that is a change for them.

They have been pumping Xeon Phi aka Knights Corner aka Larrabee non-stop.

For Intel to be this silent seems to mean either something is wrong or that maybe they are embarrassed by the K20 from Nvidia with it's 1.3 TF DP.
 
Do we have any independent performance data on either Knight's Ferry or Knight's Corner? It's a bit strange that Intel is so quiet on that front. I know there have been a few preliminary papers focused on KF scaling but haven't seen anything with absolute performance numbers.

AFAIK, everything so far is under NDA.
 
Then that is a change for them.

They have been pumping Xeon Phi aka Knights Corner aka Larrabee non-stop.

For Intel to be this silent seems to mean either something is wrong or that maybe they are embarrassed by the K20 from Nvidia with it's 1.3 TF DP.

Im guessing their power figures might not be so good?

And does anyone know when exactly the official launch date is :?:
 
Really? Does Intel regularly disclose performance numbers before lunching a new processor?

They've made some public claims about MIC performance running behind closed doors. However, they're not allowing independent researchers who have hardware in hand to say anything beyond "it works".

The best guess I've seen is this from the Register.

http://www.theregister.co.uk/2012/06/18/intel_mic_xeon_phi_cray/

Intel showed off the Xeon Phi chip pushing 1 teraflops at double-precision math running the DGEMM matrix math benchmark. That's only one card running one benchmark, and the real test is how a server cluster equipped with hundreds or thousands of Xeon Phi coprocessors will do running Linpack and other benchmarks, and then real workloads. Intel showed off a single Knights Corner coprocessor running Linpack at 1 teraflops peak (not sustained) performance at ISC on Monday.

To counter the MIC skeptics, Intel slapped together a cluster called "Discovery", which came in at 150 on the latest Top 500 rankings. This machine uses eight-core Xeon E5-2670 processors running at 2.6GHz in its two-socket server nodes. The nodes are lashed together with 56Gbs FDR InfiniBand cards and switches, and have Knights Corner coprocessors dropped into the servers as well. The exact feeds and speeds of the Discovery cluster were not divulged, but the machine has a total of 9,800 processor cores and a source at Intel tells El Reg it is "significantly lower than 100 nodes."

If you play around with some numbers (and El Reg can't resist) and assume you have two Knights Corner coprocessors per server node with 54 cores activated and two Xeon E5-2670s, you can get 9,796 cores across 79 server nodes. That would be 158 teraflops of raw peak Linpack performance from the aggregate MIC cards, and another 26.3 teraflops peak from the 1,264 Xeon cores.
This jibes almost perfectly with Intel's own peak performance with the Discovery cluster, which came in at 180.99 teraflops peak and 118.6 teraflops sustained on the Linpack test.

The important thing is that with what we presume were 158 MIC cards, Intel was able to get a computational efficiency of 65.5 per cent (meaning that share of cycles that could do work across the ceepie-phibie did work) and only burned 100.8 kilowatts.
 
Do a Google Search on the above and see the hype from Intel including expected performance numbers.

lol I don't think Marco has to google anything wrt larrabee. :p

And fwiw he's correct; Intel has remained pretty consistent on how it releases *juicy* information regarding larrabee. I know they have been a bit more open with CPUs, but that really hasn't translated to larrabee. The secrecy certainly doesn't seem new to me and can hardly be called a "change" in strategy.
 
Really? Does Intel regularly disclose performance numbers before lunching a new processor?

There are often "leaks" or previews of upcoming processors weeks or months before launch, usually through big websites like Anandtech.

Granted, that may not make as much sense for an HPC product.
 
However, they're not allowing independent researchers who have hardware in hand to say anything beyond "it works"
How is this any different from any other Intel processor launch? It's not like IVB machines were parachuted into selected ISV hands the day of the official launch.
 
I am looking forward to use those little beasts for hpc...I already broke the O(10^5) processor mark with my simulations...now I want the million...bring it on and build a real HPC system!
 
Yes, when it is Xeon Phi aka Knights Corner aka Larrabee.

Do a Google Search on the above and see the hype from Intel including expected performance numbers.

Everybody (minus Apple) discloses hype before the next shiny is about to launch. Real world numbers, tested by unbiased parties without the aid of cripple-everything-but-my-chips tools with real world code are another matter.
 
Back
Top