Tegra 3 officially announced; in tablets by August, smartphones by Christmas

Good point. Actually that table is at 1.1v despite the nominal voltage for both 28nm-SLP at GF and 28nm-HPL at TSMC being 1.0v, but then again 40G nominal voltage is 0.9v and Tegra 2 needs 1.0v for 1GHz iirc. And OMAP3630 runs at 1.26v on a 1.1v process (OMAP4 can run even higher).

It would be a bit weird for NVIDIA to go from 1.0v to 1.1v but then again maybe Kal-El is already higher than 1.0v and what matters in the end is total power... Talking of total power, is it just me or does that Global Foundries table not make any sense? Clock frequency multiplier by Dynamic Power per MHz added to Static Power is not at all the same number as Total Power. I assume they just copy-pasted the wrong numbers somewhere?
GF specifically mentions "overdrive", i.e. runnig he chip at 10% higher voltage, as a feature as they claim TSMCs gate last process has problems with that.

And you are completely right about the coherency of the numbers. They don't make much sense as written there. Definitely, some information is missing.
 
As per recent reports, the August date has slid to September for tablets. Im guessing the Xoom 2 will be among the first Kal El tablets. Asus is also working on a successor to the Transformer in time for a Q4 release..

However there have been no leaks about anyone working on Kal El Smartphones so far (unless ive missed them entirely). Possibly MSM8960 is close enough that a Kal El smartphone dosent make sense, so lets see what happens.
 
As per recent reports, the August date has slid to September for tablets. Im guessing the Xoom 2 will be among the first Kal El tablets. Asus is also working on a successor to the Transformer in time for a Q4 release..

However there have been no leaks about anyone working on Kal El Smartphones so far (unless ive missed them entirely). Possibly MSM8960 is close enough that a Kal El smartphone dosent make sense, so lets see what happens.

40nm quad-A9 with NEON plus a 50% size increase for the GPU over Tegra 2. Honestly, I would expect them to make a slimmer version for smartphones, possibly dual-core.
 
40nm quad-A9 with NEON plus a 50% size increase for the GPU over Tegra 2. Honestly, I would expect them to make a slimmer version for smartphones, possibly dual-core.

It's still quite a bit smaller than A5. On that note, do you think Apple will make a smaller variation of A5 for phones?
 
The A5 would fit fine the way it is (actually, I suspect part of the reason the cores were implemented that large was to accomodate the heat/power balance of its eventual use in the phone); Exynos in Galaxy S2 phones is almost the same size.

If/when Apple changes fab partners, that would be a good opportunity to revisit the implementation, though.
 
40nm quad-A9 with NEON plus a 50% size increase for the GPU over Tegra 2. Honestly, I would expect them to make a slimmer version for smartphones, possibly dual-core.

Well the die size is still only ~80mm2. As stated A5/Exynos are quite a bit larger.

Or did you mean lower power consumption when you meant slimmer? I doubt they'd be developing another dual core 40nm chip, maybe they'll sell Kal El with two cores disabled for smartphones (but there will still be loss due to leakage right?)
 
It's still quite a bit smaller than A5. On that note, do you think Apple will make a smaller variation of A5 for phones?

I've been wondering that. Apple has the luxury to eat the cost of a large die area but individual chip vendors don't. Mayhaps nVidia will try to push kal-el as is (with pop memory) but I think it'd be a smarter move to go dual core.
 
Well the die size is still only ~80mm2. As stated A5/Exynos are quite a bit larger.

Or did you mean lower power consumption when you meant slimmer? I doubt they'd be developing another dual core 40nm chip, maybe they'll sell Kal El with two cores disabled for smartphones (but there will still be loss due to leakage right?)

If they power gate each core, which just about everyone does, it shouldn't be worse for power. But I was referring to the chip size. I know Apple pulled it off with tablets and Samsung with their galaxy s 2. Still, it feels like a waste of area to have a quad.
 
If they power gate each core, which just about everyone does, it shouldn't be worse for power. But I was referring to the chip size. I know Apple pulled it off with tablets and Samsung with their galaxy s 2. Still, it feels like a waste of area to have a quad.
I think wasting <2.5mm² on a >80mm² SoC isn't really a big deal - it's certainly not worth another tape-out. The only real problem with a quad-core is peak power and it remains to be seen how they deal with that in smartphones. I'm sure they'll have a dual-core SKU based on the same chip as well but I have no idea how prevalent it will be.
 
I think wasting <2.5mm² on a >80mm² SoC isn't really a big deal - it's certainly not worth another tape-out. The only real problem with a quad-core is peak power and it remains to be seen how they deal with that in smartphones. I'm sure they'll have a dual-core SKU based on the same chip as well but I have no idea how prevalent it will be.

How in the world do you get < 2.5mm^2? The Tegra 2 floorplan makes the two A9 cores combined look at least 5mm^2, and that's without NEON. I can't really estimate NEON's impact given no other die shots of Cortex-A9s at TSMC's very dense 40nm process, but on the old Falcon floorplan it takes roughly 33% of core. But this presumably combines VFP, and I'd like to think that at least some of the execution resources between the two are shared. So let's say that VFP-alone is half that size. At 5mm^2 that'd make the NEON contribute an extra 0.83mm^2, or in two over 1.67mm^2. (in reality I expect NEON + VFP to be more than 2x the size of VFP alone, but I don't really know, maybe someone else has some numbers)

I'd expect the whole contribution to be around ~6-8mm^2 or so.

In the mobile space it seems more typical to let last-generation's parts be the lower tier rather than doing lower-end dies or even SKUs. But I can see how nVidia might want to encourage replacing the NEON-less Tegra 2.
 
How in the world do you get < 2.5mm^2?
[...]
I'd expect the whole contribution to be around ~6-8mm^2 or so.
Looking again now and my number was definitely too low, but yours is definitely too high. If you look only at the part that is copy-pasted two times on the die shot, that only takes 2.75mm². Add NEON and some extra overhead on the other parts and I guess you'd need about 5mm².
 
Looking again now and my number was definitely too low, but yours is definitely too high. If you look only at the part that is copy-pasted two times on the die shot, that only takes 2.75mm². Add NEON and some extra overhead on the other parts and I guess you'd need about 5mm².

Okay, I think I must be counting the L2 tags or something as part of the Cortex-A9 area. So looking at two smaller rectangles of 129x129 + 112x93 over a whole die area of 580*585 pixels I get 8%, or about 3.9mm^2. I must not be seeing what area you're using for the A9s. Should I not be counting the area directly underneath the left side of the L2 cache and directly to the right of the right A9 core? If that's not part of the A9s do you know what it belongs to?
 
How in the world do you get < 2.5mm^2? The Tegra 2 floorplan makes the two A9 cores combined look at least 5mm^2, and that's without NEON. I can't really estimate NEON's impact given no other die shots of Cortex-A9s at TSMC's very dense 40nm process, but on the old Falcon floorplan it takes roughly 33% of core. But this presumably combines VFP, and I'd like to think that at least some of the execution resources between the two are shared. So let's say that VFP-alone is half that size. At 5mm^2 that'd make the NEON contribute an extra 0.83mm^2, or in two over 1.67mm^2. (in reality I expect NEON + VFP to be more than 2x the size of VFP alone, but I don't really know, maybe someone else has some numbers)

Depends on the implementation. The biggest area-hog would be the DP MAC pipeline and that can be reused (somewhat) for SP quad MAC. Given the modular nature of the A9's NEON, I would wager the NEON arithmetic pipes are not shared. Integer multiply and quad SP MAC would probably equal a DP MAC, if not exceed it.

In the mobile space it seems more typical to let last-generation's parts be the lower tier rather than doing lower-end dies or even SKUs. But I can see how nVidia might want to encourage replacing the NEON-less Tegra 2.

I'm not sure how true this is. I suppose the market created by high-end smartphones has brought a PC-like "hand-me-down" effect. But traditionally, each generation of chips were brought out with various "tiers". For instance, in the future, there will be MSM7227A, MSM8x30 and MSM8x60, 8x70, etc. all for different markets and most of them based on Krait (with the exception of 7227A, Krait's too large for that).

This kinda makes sense when you consider the sheer volume and how creating a new chip on a smaller process node will make sense even for the low-end (or perhaps especially for the low-end) for profitability.

Should low and mid-tier smartphones begin to adopt the 6-month upgrade cycle that PC's have, however, we would likely see that model change to have the older chips serve as the mid and low tier.
 
Should I not be counting the area directly underneath the left side of the L2 cache and directly to the right of the right A9 core? If that's not part of the A9s do you know what it belongs to?
My assumption is that's part of the L2 cache interface which obviously wouldn't double in die size from dual-core to quad-core. Hmm, I suppose the test/debug blocks could be somewhere in there too, and that would presumably scale? Either way I don't think you could get up to 6mm².
 
Depends on the implementation. The biggest area-hog would be the DP MAC pipeline and that can be reused (somewhat) for SP quad MAC. Given the modular nature of the A9's NEON, I would wager the NEON arithmetic pipes are not shared. Integer multiply and quad SP MAC would probably equal a DP MAC, if not exceed it.

There's no quad SP MAC, just dual FPMUL and FPADD pipelines which can be chained. Double precision is chained as well, of course.

For integer there are 8 8x16bit MACs, and I don't think that they're shared with the floating point multipliers, or at least I can't think of a very good way to do this since you'd need to use 6 of those just to get one requisite 24x32 out of it.

Double precision VMUL actually has a throughput of two cycles on Cortex-A9 so I'd imagine the multiply part is split as two trips through two single-precision multipliers which have been extended from 23x23 to 27x27. It seemed to me that the modular approach would discourage this as well, but at the same time NEON always comes with VFP, while VFP itself is not required in any form for A9. You'd think if it were 100% separate they'd offer them separately, with maybe an A8-like VFP-lite option as well. Instead my guess is that if you get NEON you get VFP practically for free, while the opposite isn't nearly as true due to all of the integer stuff NEON needs.

My assumption is that's part of the L2 cache interface which obviously wouldn't double in die size from dual-core to quad-core. Hmm, I suppose the test/debug blocks could be somewhere in there too, and that would presumably scale? Either way I don't think you could get up to 6mm².

Okay, maybe we should agree on 4-5mm^2.
 
The VFP that comes with A9 with NEON is a separate unit, and is the same as the one that comes with A9 without NEON. And the NEON unit is the same as the A8 one (except for the memory interface). There's no functional unit sharing.
 
The VFP that comes with A9 with NEON is a separate unit, and is the same as the one that comes with A9 without NEON. And the NEON unit is the same as the A8 one (except for the memory interface). There's no functional unit sharing.

Thanks, that's good to know. Makes me wonder why ARM only sells the NEON unit with VFP. Can they really be 100% separate, with the shared register file?

I guess this would make it stand to reason that the NEON unit is over twice the size of VFP, since it includes VFP and the vector part is surely larger.
 
The VFP that comes with A9 with NEON is a separate unit, and is the same as the one that comes with A9 without NEON. And the NEON unit is the same as the A8 one (except for the memory interface). There's no functional unit sharing.
Hmm, but the RF has to be shared, doesn't it? Not that it's that big a part mind you.
 
Makes me wonder why ARM only sells the NEON unit with VFP. Can they really be 100% separate, with the shared register file?
FP performance is one of the largest improvements of A9 over A8, not having it would be a shame. As far as the register file goes, it's D16 for VFP alone while it's D32 when you have VFP+NEON, that gives a hint about sharing I guess ;)

I guess this would make it stand to reason that the NEON unit is over twice the size of VFP, since it includes VFP and the vector part is surely larger.
Hmm, NEON doesn't have VFP per se, it supports a subset of IEEE-754 FP and that's it.
 
Back
Top