Larrabee at Siggraph

Jawed · Nov 25, 2008

3dilettante said:
That was my point. Cell has a contiguous area of the die devoted to the ring bus and its logic. I'm not privy to the details of the design, but I have a hard time accepting it is wholly made up of repeater blocks.

It's a centrally managed bus, so it's definitely more than repeaters.

http://www.ibm.com/developerworks/power/library/pa-expert9/

The dominant factor for that is area, and latency roughly scales with sqrt2 of the physical area of the cache.
If the SRAMs shrank, we'd expect better latency.
If the cache capacity were expanded to give roughly equivalent area, we'd have the same latency with more capacity.

Did Core 2's L2 latency improve 65nm->45nm?

http://www.extremetech.com/article2/0,2845,2208245,00.asp

Steve Fischer said:
The latency for accessing the L2 cache increased by 1 core clock cycle (from 14 to 15 clocks) due to the increase in size.

Though those L2s are so big in comparison with what we're talking about in Larrabee (or what's in Nehalem). Nehalem's 256KB L2 is 2 cycles faster than Conroe's 4MB L2, not much of an improvement considering it's 1/16th the size and on a smaller process. Obviously fiddly comparing these as other parameters have been adjusted at the same time.

It might require the redesign or rerouting of all the logic it flies over, possibly at the expense of poorer density in logic that already scales worse than SRAM.
Depending on how large an L2 tile is compared to its directly linked compute core, the penalty may be worse if the logic expands.

The SRAMs might not require too many additional layers for their signalling, the more complex logic of the cores might have uses for the interconnect at the altitude of the ring bus, plus whatever margin of safety is needed to keep both layers from interfering with one another.

Agreed with all that. I just suspect it's not a binary design decision, whether interconnects fly over non-interconnect logic.

If you look at the Cell die shot the 2MB of SPE LS covers considerably more area than the EIB. I reckon EIB is 17% of the area of this 2MB of memory.

This could indicate that the interconnects consume a tiny proportion of the area of L2 in Larrabee, particularly as the ring bus in Larrabee almost has "no protocol" so has little control logic associated with it.

But, obviously, we can't see the ring interconnect fabric itself on Cell, so who knows, maybe it covers 8x the area of the EIB logic :???:

Overall it seems the scaling question isn't a big deal. Famous last words.

Jawed

3dilettante · Nov 25, 2008

Jawed said:
Did Core 2's L2 latency improve 65nm->45nm?

Though those L2s are so big in comparison with what we're talking about in Larrabee (or what's in Nehalem). Nehalem's 256KB L2 is 2 cycles faster than Conroe's 4MB L2, not much of an improvement considering it's 1/16th the size and on a smaller process. Obviously fiddly comparing these as other parameters have been adjusted at the same time.

I admit the rule of thumb is very simplistic, correlating the cross section of a cache with its response time, neglecting fixed costs of implementation, tag checking, and access order. The base assumption is that when all else is equal, the time it takes for a signal to reach the core from the furthest part of the cache sets a floor value for the cache's response time.

Penryn's cache is 25% smaller in area, but it is not correspondingly faster.

There are a number of possible reasons, one being that Penryn's cache is nearly as long as Conroe's, its sleep transistors add latency, the fraction of the access time taken up by the L1 miss and L2 tag checks is not reduced, associativity was increased by 50%, and Penryn's pipeline targeted a higher clock rate.
Even if the wall-clock time for signals crossing the cache were reduced, the other fixed costs would not go away and the differing cycle target only roughly maps to wall-clock time.

Larrabee's more modest clocks might give it more slack to play with when it comes to fiddling with cache capacity.

Squilliam · Nov 26, 2008

How significant are Intels claims about off chip bandwidth? Does it have any significant bearing on GPGPU applications or is it just limited to computer graphics, and more importantly console applications which are limited to using much smaller buses than their computer counterparts?

bowman · Nov 29, 2008

Up until now everyone has assumed that the IGP on low-end Nehalems with IGPs on-package will be based on the current IGP architecture, including me, until I randomly stumbled over this tidbit from April 2007:

# The integrated graphics will be DirectX 10 and offer GPGPU functions if the software used is able to address it -- Intel is currently working to make the software available.
# Intel is basing the graphics core on a derivative of Intel Architecture (IA) that it uses on its CPUs since the general purpose processing now done on a unified shader GPU core is very similar to that of a CPU. Intel says it’s capable of not only DX10 but OpenGL and GPGPU as well, although performance information isn’t yet available, but expect it to be still quite basic compared to a discrete solution.

http://www.bit-tech.net/news/2007/04/17/further_details_on_nehalem_idf_spring_2007/1

Eh? Obviously it'll be a Larrabee derivative of some sort, they're not going to design two x86 graphics architectures are they.. So, one 16-wide Larrabee core then?

MfA · Nov 29, 2008

A bit thin evidence, could just be a misunderstanding. Any direct quotes or material from Intel?

Larrabee at Siggraph

Jawed

3dilettante

Squilliam

Beyond3d isn't defined yet

bowman

MfA

Similar threads