Larrabee at GDC 09

I wonder why the widows people complain so much about intel igp drivers. Their linux drivers at least are about as good as drivers get.

Bugs linger on in the windows drivers.

a dual screen set-up for instance, the intel driver keeps resetting the monitor placement so you're almost forced to put the screen on the side that the intel drivers want it.

Overlay hardly ever working (or resetting the color values to 0,0,0 upon playback of quicktime. things like that. they're utter, utter crap.
 
The second one is still Jasper Forest, not a Larrabee based product.
you right, perhaps i dont understand right cause this link say this:

"The first encounter with the photograph of silicic plate, which contains larrabee video chip, confirmed that the core area of these chips will not exceed 300 sq. mm. However, later in the network appeared a clearer photograph , which allowed some to estimate the crystal area Of larrabee.

Associate note that the core area of these chips is within the limits of the expected values..

larrabee plate 45 nm

On the silicic plate with 300 mm diameter are placed approximately 86 chips larrabee, although some sources call another number - 64 piece.
larrabee plate 45 nm

One way or another, the core area of elder version larrabee in this case can reach 600 sq mm. Wolfdale processors with the core area 107 sq mm have 410 ml transistors. Almost six times larger chip can place about 2 billion transistors. It would be desirable to hope that this entire speed potential will be used effectively. As explained Intel representatives , series products larrabee will appear in the beginning of the following year. "
 
On a second look at the wafer shot, considering that there is approximately 2mm edge clearance, intended for handling, align and safety marks, now the estimated die-size comes pretty close to 617 mm², actually.
 
According to this:

http://portal.acm.org/citation.cfm?id=1413409

which I can no longer access (luckily I managed to get the PDF a while back), in the 80 core Terascale chip FMAC has a latency of 9 cycles.

3.2 SGEMM

SGEMM is the BLAS 3 subroutine used to multiply two single precision general dense matrices. It is the key building block for many dense linear algebra algorithms. The BLAS 3 routine includes many variations (e.g. scaling, transposition, etc), but we consider only the simplest case: C = A x B, where A, B, and C are NxP, PxM, and NxM matrices, respectively.

The SGEMM routine can be defined by the following C code fragment:

Code:
for (i=0; i<N;i++) 
  for (j=0;j<M;j++) 
    C[i][j] = 0.0; 
    for(k=0;k<P;k++) 
      C[i][j] +=  
          A[i][k] * B[k][j];

Modern algorithms for SGEMM are block-based and decompose the problem into smaller SGEMM problems. For matrices of order N these algorithms have read/write operations that scale as O(N²) while computational effort is O(N³). By selecting large blocks that fit in cache, the read/write operations can be overlapped with computations, allowing SGEMM to execute near the peak floating point capacity of a chip.

The 80-core Terascale processor does not have a cache, but blocking could still be useful for overlapping communication and computation as well as register blocking. However, block-structured algorithms require triple loops over the blocks of the matrices. Since the Terascale Processor does not support nested loops, we have to use a different and less efficient algorithm based on dot products. We mapped the cores onto a ring topology. The A and C matrices were decomposed into rows with one row per core. B was decomposed by columns with one column per core. The algorithm proceeded as follows:

Code:
On core number i 
Loop over j = 1, M 
{ 
  C[i][j] =  
    dot_product (row Ai *  
            column Bj) 
  Circular shift column Bj 
        to neighboring core  
}

Optimization and Performance Analysis:

Additional optimizations used in the final algorithm
  • Used diagonal wrapped storage where the array holding each row of C started with the diagonal element and wraped around, thereby letting us use a single indexing scheme into C for each core.
  • Unrolled the dot product loop, so we could use a single loop to run over a column of C.
  • Organized code so elements of B were shifted immediately after they were used; thereby overlapping computation and communication
We selected the matrix dimensions N, M and P to fill the available memory: N=P=80 and M=206. Each element of C required a dot product of length M, resulting in M multiplies, M-1 adds, and 2M loads. Each FPU can retire one FMAC per cycle. The FPU is a 9-stage pipeline. So to load registers to feed the pipelines for a dot product required 2*9 registers for the operands and 1 register for the accumulator; or 38 registers for both FMAC units. Obviously, there were not enough registers to keep both FPUs running at 100% for a dot product. The fundamental performance bottleneck, however, was the fact that one can only load 2 operands from data memory per cycle. Since the FMACs need 2 operands from memory each, the loads limited peak dot product performance to 50% of the total available floating point performance.
Jawed
 
Who knows. They haven't alluded much to core size, or ring network die space, or which/how many MCs there will be.. So write the numbers on a piece of paper and throw a dart, that's a good way of guessing.:LOL:
 
On these 600 mm² ~ 617 mm² Larrabee chip, how many cores are we talking about: 12? 16? 24? 32? 48? 64?

Using Intel estimates of around 10 mm2 per core, 48 at best, or 16 if it is just a test. So its probably 32 or 24 perhaps with some cores disable.

I doubt its 64. On that die size it will be a tight fit for Cell with 64 SPUs. The Larrabee cores are probably at least twice as large as those SPUs or even more.
 
So larrabee core could be well above 40millions transistors if we split evenly all the extra logic (ring bus, texture units, memory controller, etc.) among the cores.
Intel could still manage to ~20 cores within the same transistors budget as a RV740 and be in the same ballpark as ATI in regard to the theoretical peak figure (~1GFlops). The problem for Intel is transistor density it looks like they are at a disadvantage against TMSC 40nm process (someone bring the figures earlier) they would end with a bigger chip. Is that right?

I've some questions now that more stuffs about larrabee are public.
It looks like the larrabee cores has ended up pretty big so what do you about some design choices? (I'm calling on armchair experts :LOL: ).
Ramdomly (out of head I don't know much so don't be mean if some question are stupid):
What do you think about VPU width? Would 8 wide or less have been better in some regard?(cache line size Vs vector width, better use of gather? // impact on clock speed? // more narrow units may be made cleverer?).
Would it makes sense to have less than 256KB of L2 per core or you think it's pretty optimal?
Intel stated that L2 latency will be around 10 cycles, is this critical for the kind of work loads larrabee will run (especcially as hyperthreading may hide some latencies)? Could Intel save on power budget here wtih slower L2 cache // clock the chip higher?
Could the inclusion of a L0 cache help the design while removing pressure on L1 cache (I mean Intel could design it with higher latency and save on the power consumption side // clock the chip higher).
Overall may Intel have go with narrower cores but more power efficient and higher clock?

I'm not implying that Intel did some wrong choices or that I've enough knowledge to criticize theirs choices but I saw really few critics on the design so I wonder what you guy would changed to the design.
 
The problem for Intel is transistor density it looks like they are at a disadvantage against TMSC 40nm process (someone bring the figures earlier) they would end with a bigger chip. Is that right?

Could be, but it is hard to tell.
The process size (40 nm or 45 nm) refers to the smallest possible feature.
This may or may not reflect how the average feature size relates. For example, if you compare Intel and AMD processors on the same process size, the average transistorcount per area is not the same (or other aspects for that matter, such as leakage, power dissipation, clockspeeds etc).
This partly has to do with the architecture itself (eg cache can be very 'compact', where more complex logic may have lower transistorcount per area), but also partly with tweaking the design to get the best possible yields from the manufacturing process, or perhaps to have more favourable power efficiency, etc.

Intel's process is obviously very mature (has been in mass-production for quite a while, with very good yields), and Intel also uses new materials like hafnium, which I don't think TSMC will use.
I think it may be very close.

I'm not implying that Intel did some wrong choices or that I've enough knowledge to criticize theirs choices but I saw really few critics on the design so I wonder what you guy would changed to the design.

I personally want to wait until I see some actual software running on the thing.
Since the whole rendering algorithm is different from what nVidia and ATi use, it's really hard to compare performance by just looking at some vague specs.
 
The 80 cores Terascale/Polaris uses 100M transistors to implement a "cache-less", grid-communicating array of 160 FMACs that can run at multi-GHz. Each core is only 2 FMACs, which means there's practically no SIMD-wise saving.

The 16 VPU lanes of each Larrabee core could cost no more than 0.5M transistors each. The design appears to be deliberately cheap with the only concessions being L1 routing and 4 threads (and consequently large, for Intel, register file).

I guess we're looking at cores costing about 25M transistors (excluding L2).

Jawed
 
What do you think about VPU width? Would 8 wide or less have been better in some regard?(cache line size Vs vector width, better use of gather? // impact on clock speed? // more narrow units may be made cleverer?).
Larrabee's starting point is an x86 core, with standard caches, extra vector units, and some claim its RTL comes directly from an already existing core.

Given the context of the x86-based design, the cost of narrowing VPU width is that the peak numbers can only be brought up by expanding some rather expensive things.

Clock speed: probably already pretty high for an x86 core with a short pipeline.

Extra SIMD units: extra wide instruction decoding and wider superscalar issue = additional transistor cost and further modification of the base design
More cache ports would be needed, more resources would be needed outside of the FP ALUs to handle more address calculations and emulation loop overhead.
If we go back to the premise that Larrabee's cores are leveraging an already existing design, these are not going to be added.

Extra cores = extra cores and all the corresponding hardware resources, extra cache tiles, and heavier demands on the ring bus

The width of 16 probably sits near an optimum point of average utilization and hardware cost.

Would it makes sense to have less than 256KB of L2 per core or you think it's pretty optimal?
Intel stated that L2 latency will be around 10 cycles, is this critical for the kind of work loads larrabee will run (especcially as hyperthreading may hide some latencies)? Could Intel save on power budget here wtih slower L2 cache // clock the chip higher?
There are probably loads that might want more, and others that want less.
The competition can't match Larrabee when it comes to capacity in the L2.
L1 and register file capacity on Larrabee is inferior in comparison to the register files of GPUs, though.
I suppose testing in the real world will tell, but the software infrastructure should be able to tune itself to the constraints of the L2's capacity.

Could the inclusion of a L0 cache help the design while removing pressure on L1 cache (I mean Intel could design it with higher latency and save on the power consumption side // clock the chip higher).
I'd wonder if the cache structures are the limit in either timing or power.
The vector and texture units would probably dominate.

Given that Larrabee leverages an existing design, injecting random new cache levels would not be an option.

Overall may Intel have go with narrower cores but more power efficient and higher clock?
Given the constraints that were imposed on the design, Larrabee is probably in line with the heavier per-core cost and more flexible but higher cost memory subsystem.
 
I find it puzzling why this algorithm wasn't created before chip design even started. It may not be the corner stone of the architecture, and it doesn't seem like it was much of a sweat to draw up, either.

How was the choice of 16-wide VPU made and was rasterisation, per se, an important factor? I suppose if you want "square" powers of two then between 4, 16 and 64 there really isn't much choice!

And it leads smack into the divergence penalty. Looking at fig 29:

http://www.ddj.com/hpc-high-perform...ionid=CTL4XDMAKYILUQSNDLOSKHSCJUNN2JVN?pgno=6

that triangle covers 5 tiles, 15 pixels and 7 quads. It's shading 80 pixels to produce 15 results, 19% utilisation. If the quads were packed (i.e. a bit of conditional routing) then it'd be shading 32 pixels for 15 results, 47% utilisation.

Alternatively, if the rasteriser works on strips/patches of triangles, then the tile masks will tend naturally towards being all set as multiple triangles per tile are shaded concurrently. Or, rather, if the setup engine takes tile masks for contiguous triangles then it can create shading tiles that are maximally populated.

That's then a sorting problem to find all the triangles in the bin set that are contiguous (or at least non-overlapping). Or it could be just a naive "does this triangle share an edge with the prior triangle?" test.

Jawed
 
I find it hard to believe the algorithm wasn't already well known by the Larrabee developers given Intel's history with tilers, it's a rather trivial variation of the way PowerVR works for instance (and probably all tilers).
 
I find it hard to believe the algorithm wasn't already well known by the Larrabee developers given Intel's history with tilers

Does TBR even work on their IGPs to date? Afaik it's even a hw implementation (that sucks).

***edit: where I forgot to note that the chipset team doesn't have much to do with the LRB team afaik.

, it's a rather trivial variation of the way PowerVR works for instance (and probably all tilers).
Obviously there will always be similarities as of course differences that aren't always visible on first sight.

Dumb question: is there any indication that LRB's driver could sort triangles? (no not PowerVR related LOL; Olick was mentioning something about triangle sorting in his Siggraph presentation and I'm just curiours...).
 
I find it puzzling why this algorithm wasn't created before chip design even started. It may not be the corner stone of the architecture, and it doesn't seem like it was much of a sweat to draw up, either.

How was the choice of 16-wide VPU made and was rasterisation, per se, an important factor? I suppose if you want "square" powers of two then between 4, 16 and 64 there really isn't much choice!

And it leads smack into the divergence penalty. Looking at fig 29:

http://www.ddj.com/hpc-high-perform...ionid=CTL4XDMAKYILUQSNDLOSKHSCJUNN2JVN?pgno=6

that triangle covers 5 tiles, 15 pixels and 7 quads. It's shading 80 pixels to produce 15 results, 19% utilisation. If the quads were packed (i.e. a bit of conditional routing) then it'd be shading 32 pixels for 15 results, 47% utilisation.
Actually this is quite simple to solve in software, you just have to pack 4 non empty quads in a qquad.
 
Yeah but by the time you've figured out which quads are non-empty you've already paid the price.
The rasterization algorithm already determines which quads are empty and which are not :) So it's just a matter of filling a qquad with non empty quads (as much as possibile) and to gather/scatter (or to perform multiple loads/store, one per quad, whichever is faster) the quads in the proper place. It could potentially big a big win if you have lots of small triangles.
 
Michael Abrash said:
For example, edge evaluations have to be done with 48 bits in the worst case. For those cases, being software, we have to use 64 bit because there is no 48-bit integer support in Larrabee. However, we don't have to do that it all for the 90+% of all triangles that fit in a 128×128 bounding box because, in those cases, 32 bits is enough.

Maybe someone can shed light on wheter this would also affect Multisampling since the render targets have to be sampled a lot more fine-grained and in course 32 bit might not prove enough. Or would the max. tile size just have to be reduced by a factor according to the MS-level?
 
I thought Jawed was still referring to the rasterization process when he said "shading". Since the 80 pixels in figure 29 obviously won't all be sent on to the pixel shading stage regardless of the rasterization algorithm employed. There should only be 7 quads or 2 qquads sent to the pixel shading stage for this triangle.
 
Back
Top