Larrabee delayed to 2011 ?

3dilettante · Sep 17, 2010

Jawed said:
Looking at slide 22, the only way I can interpret tessellation being done in the back-end (along with GS) is if TS is synonymous with VS->HS->TS->DS (i.e. it is not a reference purely to the TS stage).

That seems reasonable. The description appears to be at a somewhat higher level, where the particular implementation details of the DX11 pipeline would not be material.

The advantages of delaying some DS work would include reduced storage in global memory and re-distribution of workload (e.g. later DS might lead to better load-scheduling).

The reduced amount of data passed between the front and back ends would lead to bandwidth savings. Each core could potentially try to read from the same set of attributes, but this should be forwarded as needed within the cache hierarchy relatively quickly, and there is hopefully no write traffic to those locations in this phase.

GS can do a variety of things. If GS is used merely to delete vertices/triangles then in theory it can be delayed until after binning - again this is a load-balancing question, I think. i.e. run GS across lots of cores as they do binning, rather than on a few cores while creating bins.

Maybe there are some other usages of GS that are amenable to delayed execution (e.g. generating attributes)?

There may be ordering and atomicity constraints for GS. Perhaps if the scheduler can determine that there is no interaction between invocations, they can be allowed to persist over the non-deterministic delay between binning and bin pickup.

By the way, the term "rasteriser" is often used to describe all of these stages: setup->rasterisation->pixel shading->output merger (ROP). So it's possible to interpret the statement about the lack of a fixed-function rasteriser as actually descriptive of lack of "setup->rasterisation->pixel shading->output merger". To be honest I think this is very likely the correct interpretation.

This appears possible. Intel claimed earlier that the rasterizer part of the pipeline wasn't the hard part.

I pretty much always thought it would be years before Intel was competitive at the enthusiast end, but process would eventually allow it to catch up. A major question for the other IHVs is what proportion of die space ends up being programmable compute, and the higher that rises the more competitive Intel becomes.

Larrabee may have been near the top end of what is possible for a PCIe graphics accellerator, at about 2/3 of the die. The rest of the die had IO, controllers, UVD, texture blocks, and miscellaneous logic.
A good amount of the uncore would need to scale as well, otherwise the comput portion would be strangled.
x86 penalty aside, the decision to use full cores for that 2/3 of the die was also a contributing factor to the size and power concerns.
There can be programmable processing units either way, but past a certain number of fully-fledged CPU cores the utility of having even more would have been reduced. There was a lot of front-end and support silicon for the amount of vector resources one got per core.

Ailuros · Sep 19, 2010

Arun said:
Of course, ideally we'd all go pure MIMD. Rys, can I haz Series6? (and please don't break my heart and tell me it's SIMD now )

OT: I have the feeling you'll be tracking that pipe dream from generation to generation. Personally I'd prefer anything to have a "HE-IMD" (HE= highest efficiency) and you might as well kill the D at the end and you're "home" LOL.

Stupid acronyms aside, he won't be telling you but my old sniffing nose tells me that one of the aces of S6 has gotten Intel to sit up when they first saw it. Wrong thread, bad timing and I urgently need some coffee

Harison · Sep 23, 2010

liolio said:
I hope Charlie will have the chance to re-encode the vids, I've a tough time understanding their talk

There is full transcript now:

http://www.semiaccurate.com/forums/showthread.php?t=3361

On topic, its a bit sad Intel pushed Larrabee indefinitely, it would be nice to have a 3rd player (or even 2nd, NV future is a bit unclear atm in mass GPU business), even though in the beginning they would be behind competition. But since GPUs are more and more programmable, IMO its just a matter of time till they release it, even if it will be Larrabee 10th incarnation.

P.S. I can almost see Tim Sweeney crying somewhere

Kaotik · Sep 25, 2010

Apparently Knights Ferry, despite being marketed a bit different, is the first iteration of Larrabee as it is, so there might be chance for future developement to enter gfx markets too?

Harison · Sep 25, 2010

Kaotik said:
Apparently Knights Ferry, despite being marketed a bit different, is the first iteration of Larrabee as it is, so there might be chance for future developement to enter gfx markets too?

It should be, and maybe it wont take that long, because after servers they are implementing this tech. in laptops too, after that it should be specialized desktops, after that - released drivers for games rendering too, in the beginning probably Fusion-like. Technology is already there, the question is, how long drivers team will get the chips usable for games. Even with initial cancellation of Larrabee, I have no doubt Intel keeps working in this direction.

http://www.pcworld.com/businesscent...el_unveils_new_server_chip_with_32_cores.html

The company will merge the CPU cores and vector units into a single unit as chip development continues, Skaugen said.

The Knights architecture is the biggest server architecture shift since Intel launched Xeon chips, Skaugen said. The chip includes elements of the Larrabee chip, was characterized as a highly parallel, multicore x86 processor designed for graphics and high-performance computing. However, Intel last week said it had cancelled Larrabee for the short term, but said elements of the chip would first be used in server processors, and later in laptops.

The new architecture could also fend off competition from Nvidia's Tesla and Advanced Micro Devices' FireStream graphics processors, which pack hundreds of computing cores to boost application performance. The graphics processors are faster at executing certain specialized applications. The second fastest supercomputer in the world, Nebulae in China, combines CPUs with GPUs to boost application performance.

http://www.intel.com/pressroom/archive/releases/2010/20100531comp.htm

"The CERN openlab team was able to migrate a complex C++ parallel benchmark to the Intel MIC software development platform in just a few days," said Sverre Jarp, CTO of CERN openlab. "The familiar hardware programming model allowed us to get the software running much faster than expected."

rpg.314 · Mar 10, 2011

Didn't feel like making a new thread for this

Oregon Computer Architecture Team Award, Intel Corp, 2009, for successfully driving the ISA/architecture and implementation of LRB'3 general-purpose scatter/gather instructions, the first clean scatter/gather architecture for Intel.

http://sites.google.com/site/ykchen/honors

So cleaner than LRB1. What might be considered ugly about LRB1's scatter/gather? It seemed very nice to me. O(1) time gather for any datum in L1, unlike LRB1 which had O

3dilettante · Mar 10, 2011

I'm not sure how to interpret "LRB'3".
LRB's or LRB3?

edit:
"Clean" may mean done in hardware without microcode or software with hardware assist.

rpg.314 · Mar 10, 2011

3dilettante said:
I'm not sure how to interpret "LRB'3".
LRB's or LRB3?

edit:
"Clean" may mean done in hardware without microcode or software with hardware assist.

IIRC, charlie has often referred to the second coming of LRB as LRB3.0. Apparently LRB1.0 was uncovered and canned and LRB2.0 went with it. The third version was supposed to be the game changer. We'll see.

3dilettante · Mar 10, 2011

The award was for 2009, while the cancellation of the first LRB took place in 2010.
I find the time frame interesting, since it would imply significant progress in implementing the design by the time the shuffle was reported.

glw · Mar 10, 2011

That is one of the authors of a paper I linked to back in 2008.
http://forum.beyond3d.com/showpost.php?p=1251942&postcount=31

"Atomic Vector Operations on Chip Multiprocessors"
http://doi.acm.org/10.1145/1394608.1382154

MfA · Mar 10, 2011

rpg.314 said:
O(1) time gather for any datum in L1, unlike LRB1 which had O

Generally impossible without a generic n-ported cache, but you might be able to implement a n-banked cache (like local data in GPUs) cheaply to get it down to O(log n) for the general case of cache hits (or even better with some buffering of accesses) and O(1) for any access which hits all the banks. Being able to service cache accesses that fast can interact in annoying ways with the coherency though (up to n times the traffic).

Larrabee delayed to 2011 ?

3dilettante

Ailuros

Epsilon plus three

Harison

Kaotik

Drunk Member

Harison

rpg.314

3dilettante

rpg.314

3dilettante

glw

MfA