Larrabee delayed to 2011 ?

MfA · Nov 20, 2009

I know what the demo was too, SGEMM ... how is that relevant?

dkanter · Nov 20, 2009

MfA said:
I know what the demo was too, SGEMM ... how is that relevant?

What I'm saying is that I know what the demo hardware was. It wasn't an 80 core R&D experiment.

DK

keritto · Nov 21, 2009

fellix said:
Meddling with the ALU width in a VLIW architecture is not very good idea, especially if there is multi-million dollar R&D investment for the existing compiler base. It's like beginning from the very scratch.
AMD should push for symmetric 5D setup, e.g. make all the five units 40-bit devices (like the current T-unit) and lift the limitations on loading data from the reg file. This alone, with the usual set of new instructions will make it enough to extract even more ILP and better overall utilization, without dipping into fancy experiments, that would cost arm and leg.

I believe you when you say it ain't smart or easy to mess with engine itself, but it also ain't smart to just pushing transistor numbers up w/o looking what could be done inside a core. And ATi know's it with recent native tessellation engine pushing into core. We all saw how building around on old chip design ends up in few times renamed G300 performer now known as

Fermi.

ATi pretty much already exhaust original idea behind R600 with now fully upgraded rudimentary tessellator (that never lived on R600) finally being adopted in native TES engine in dx11. So for some further advancement beyond FMA they should widen their SIMD core to 256-bit (288) and 4 S-unit + 4 T-units could cope 4 32b FP+4 32b INT similar to nV. Somewhat reduction to only 4 SPs per SIMD (w/ only 2 T-unit -- "SFU") doesn't seem appealing. Btw. all developers need some upgrades from time to time, i know R600 is still rather young and still not paid out but i believe that just wide 160 bit bus to 288bit wouldn't hurt them much. Except for well known hard to develop proper compiler problems when it came out to AMD. But at least they had some time to development until Larabee is out.

Maybe simpler 5D idea is better but then again it's also needs wider 200bit bus instead todays 168bit and maybe is an utilization overkill to have 5 T-units, or even my 4T and original idea for R800 with 2 T-units SP is just a sweet spot. I just go along with nV-idea of having same number of INT+FP capable units

rpg.314 said:
LRB/GPUs/swift shader etc. have to win on better IQ/$/Watt @60fps, PERIOD. Nobody cares about features/programmability if they aren't in D3Dx.y. Unless it delivers on the above metric, it wont sell in enough volumes to sustain R&D on it, even if you are intel. Good luck sustaining it on HPC market alone.

So you think HPC market battle is not enough :mrgreen:

I just hope they won't kill nVidia on that ground an later aquire it. I don't recall if they until today claim that LB would support VT-X? And still don't see any new Itaniums looking few year back on horizon. So, as i already stated somewhere LB could be next great engine itself. They used to implement Pentium/D good stuff on Itanium 2 but since Core 2 every part of this work is vanished from public eye.

Lux_ said:
Let's look at it from another angle: if Larrabee derivate is integrated into every CPU starting from 2012 (providing DX11 support), then all game devs suddenly have new "minimum system requirements". It doesn't matter that the games don't look as great as with addin boards, because the installed base is suddenly huge.

Yep but i'm in disbelief that delayed Fusion at 32nm will have only basic dx10 support in 2011 :mrgreen:

I'd bet more for dx11 based on today's R800 (not R600 as amd originally claimed when they promised Fusion chips for beginning of 2009 no matter they basically same) with some 800SP as RV840/JuniperXT has just on 32nm and based on 32nm HighK it could consume only 40W at same speeds as today JXT (850MHz) and we might look at some dx11.1 (11+) compliance when R900(+FMA) arrives probably next year not too much later after Fermi introduction. If AMD is really AssetSmart. They have knowledge just if r&d on Cypress pays out quick enough, cause they need nowhere to rush considering roughly same performance for their HD5950 vs. Fermi counterparts.

So i'd say AMD if they're smart enough have just enough sys requirements with 4-core Bulldozer core and some 800SP by thier side on 32nm. The other thing is if programmers will avoid them as usually did in time they first introduced 3dNow (2yrs before iSSE in P3) and x86-64 (in reality 3yrs ... before real 64bit part came as Conroe aka.Core2)

CarstenS · Nov 21, 2009

Ailuros said:
Now where the 80 or 40 supposed cores came from that's something one should ask the author or in extension Rattner for clarifications.

"This Puppy I have here, this is Larrabee. Oh - wait. No. It's Polaris reloaded, which also had 80 cores." *SCNR*

keritto · Nov 21, 2009

Voxilla said:
Having more fill and texture sample rate isn't something bad to have. As bandwidth and transistor density continues to increase so will data throughput. At some point there will be a switch I think, from pixel to voxel rendering, where you sample from volume textures and render into volume textures. To enable this you need all the fill rate you can get. At the same time this will be linked to realistic volumetric physics on the GPU and there again you need all the bandwidth you can get.

Aren't voxelized object worse when there's need to render their interactions while with simple pixelized object we have simple and well developed techniques needed to easily add real life look on their interaction? Aren't just voxels just a quick patch for rendering complex 3D object on hardware available doing it in 2D space?

Scali said:
That wasn't the point though. The point was more: at what cost?
GPUs are getting larger, hotter, with more noisy coolers all the time, we get more GPUs on a card and all that... (because the 'natural' increase in bandwidth and transistor density isn't enough).

And Lara Bee is running cold as polar ice and not wasting any energy? :Mrgreen: Seems not only hype about proposed LB performance are flying around but also some hype about how greenish performance per watt LB has. Is that a reason why estimated LB TDP was 300W just not clearly stated is it for this 45nm revisions, or for older 65nm ones if there were any of them? And excellent tdp is the reason Intel waiting for straight to 32nm Lara Bee release while cleaning out some performance bugs?

Scali said:
What I'm saying is basically this:
Since G80 was introduced, 'nothing' happened. Most games still have DX9-level graphics, sometimes pimped up slightly with a handful of DX10 candy.

And the real reason for it is ....... G80 quasi DX10 support, or they support it as max as they could do on pretty advanceed and still relaively new G70 architecture (which was 9.0c btw). So the lack of features and all of that mumble of dx10.1 numbers (lacking from nV) meant that not everything is int the numbers itself, as Charlie try to remember but in nVidia's ability to release their G80 chip 6 month before competition and blackmail MegaCorps to kick out what is suppose to be real DX10 support from their future Vista release and put something that only goes by name dx10 in so much expected and bad performing future MegaCorps OS.

So in fact we're been deprived for all that DX10 has to offer onR600 architecture in favor of nV dx10 deal which was nothing more than eyecandicized dx9.0c as you state. In fact we should look something which should much like resemble todays dx11 only 2.5yrs before. And dx11 well it would brought never tessellation engine after all.

Scali said:
I don't see DX11 as a big step in flexibility.
The only thing it really adds is tessellation... but ironically enough it is again a fixed approach, not a flexible one.
Even on DX10 hardware you can program tessellation through Compute Shaders.

You coldn't just pass to point out how nothing new is really there in dx11

But you forgot to emboss Even on DX10 hardware you can program tessellation through Compute Shaders on ATi based cards, and dx10 as i mention in above reply was exact something that should introduce new hardware tessellation ability itself. While dx11 and R800 series itself brought some texture compression improvements and no drop in performance for old compression methods. And tessellation itself heavily reduces need for memory bandwidth to produce same rasterization setup. So i'd say that these are pretty big things ATi introduced in it's dx11 engine. I see you're still looking for a good reason to explain to your friends, to upgrade onto DX11

Voxilla said:
The up of the DX11 approach is that it needs very little transistors to be implemnted.

It's not nice to see how people wear polarized glasses for different weather. Why DX11 was so lightweight to implement for ATi was caused by preexist of rudimentary tessellation on ATI VLIW engine since first ATi DX10 chip (R600), so all basic tessellation setup was there and all they need extra r&d for improving it and implementing it in Compute Shading algorithms and get support by Microsoft dx11. It's ATi architectural design, maybe even to much in-situ, and that couldn't be provided barely from dx11 API by MegaCorps itself or any third party.

I just hope that same VLIW engine could carry additional capability to cope with next thing on horizon FMA. Hope that ATi wont make lack of implementing FMA into marketing war as Huang is dong for last year and a half with constant r&d troubles for GT200 & G300 now. And not to forget that all that indirection capability of G300 (Fermi renamed after Larabee->Cypress) doesnt look extremely appealing just if it ain't so silicon and research inexpensive as further tessellation or compression advancements are.

MfA · Nov 21, 2009

Why do you keep harping on FMA? Doing single cycle throughput FMA doesn't matter in FLOPS because it doesn't go any faster than single cycle throughput MAD which is what GPUs have been using since forever ... also ATI has said R800 supports FMA any way.

Bob · Nov 21, 2009

I've tried to look for a single true sentence in keritto's post, but I just can't find one. Can someone help me out?

PeterT · Nov 23, 2009

Bob said:
I've tried to look for a single true sentence in keritto's post, but I just can't find one. Can someone help me out?

You're ahead of me, I can't even find an intelligible sentence.

Gubbi · Nov 26, 2009

Larrabee at isscc 2010 ?

Apologies if this has been posted before.

From the ISSCC 2010 program:

5.7 A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS

567mm^2 in 45nm, 125W TDP.

It says 48 cores in the headline, but in the text it is said to be organized as a 6x4 mesh. What gives?

Cheers

rpg.314 · Nov 26, 2009

No it is unrelated to larrabee.

Groo The Wanderer · Nov 27, 2009

It is almost assuredly Polaris II, a research project into multi-cores.

-Charlie

Jawed · Dec 3, 2009

AnarchX · Dec 3, 2009

Gubbi said:
Apologies if this has been posted before.

From the ISSCC 2010 program:

5.7 A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS

567mm^2 in 45nm, 125W TDP.

It says 48 cores in the headline, but in the text it is said to be organized as a 6x4 mesh. What gives?

Cheers

http://download.intel.com/pressroom/pdf/rockcreek/SCC_Announcement_JustinRattner.pdf
http://www.youtube.com/user/channelintel#p/u/0/L_cXi7uyJU4

MfA · Dec 3, 2009

The SCC eliminates significant complexity & power by removing hardware cache coherency
Enables exploration of more scalable alternatives:
–Ultra-low latency HW-accelerated message passing
–Software-managed, page-level memory coherency

I like it, pity it's too late for Larrabee.

rpg.314 · Dec 3, 2009

MfA said:
I like it, pity it's too late for Larrabee.

I never liked full hw cache coherency especially when you have O(50) cores on a die. For >100 cores, forget it. Shared mutable state is the bane of parallel sw, then why keep support for it in hw?

nAo · Dec 3, 2009

Gubbi said:
Apologies if this has been posted before.

From the ISSCC 2010 program:

5.7 A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS

567mm^2 in 45nm, 125W TDP.

It says 48 cores in the headline, but in the text it is said to be organized as a 6x4 mesh. What gives?

Cheers

2 IA cores share the same mesh node/routing logic, see Intel press release material.

Simon F · Dec 3, 2009

rpg.314 said:
I never liked full hw cache coherency especially when you have O(50) cores on a die. For >100 cores, forget it. Shared mutable state is the bane of parallel sw, then why keep support for it in hw?

So, the Transputer/Occam model of 20 years ago was probably the correct one.

Gubbi · Dec 3, 2009

rpg.314 said:
I never liked full hw cache coherency especially when you have O(50) cores on a die. For >100 cores, forget it. Shared mutable state is the bane of parallel sw, then why keep support for it in hw?

Because in certain cases it is fantastically useful, think big shared data structures where reads/queries vastly outnumber updates.

The problem is of course when cache coherency is used as the sole mode of data transport between cores, - many core systems will choke with coherency probes and invalidate broadcasts.

We need both, IMHO.

Cheers

liolio · Dec 3, 2009

Does somebody notice that larrabee is no longer branded as a GPU but as a:
"computational co-processor for the INtel Xeon and Core families"

Jawed · Dec 3, 2009

Simon F said:
So, the Transputer/Occam model of 20 years ago was probably the correct one.

Sigh, if only...

I wonder how many transputer cores you could fit in 3 billion transistors

Jawed

Larrabee delayed to 2011 ?

MfA

dkanter

keritto

CarstenS

Moderator

keritto

MfA

Bob

PeterT

Gubbi

rpg.314

Groo The Wanderer

Jawed

AnarchX

MfA

rpg.314

nAo

Nutella Nutellae

Simon F

Tea maker

Gubbi

liolio

Aquoiboniste

Jawed