So uhh, I haven't replied in a while and I see a lot of very good arguements and at least as many incredibly dubious ones (I'd say probably a bit more but that wouldn't be nice
) - anyway this took quite some time to write and I don't expect anyone to read it, but here goes!
BTW, for what it's worth once again: I don't know for sure whether this Tegra2 chip doesn't have NEON. I'm pretty sure it doesn't, but theoretically what I heard from multiple people might only have been for the other chip (I'm not sure about the codenames, but I assume AP20 vs T20).
What is your estimate of area of NEON (for a single core, ofc) at 40nm?
0.45-0.55mm² depending on the target frequency etc. if I based myself on a recent ARM presentation with a Osprey floorplan that I found a few weeks ago - and yes, that's not very big, it seems to have gotten strangely more efficient since the A8 where I've seen figures as high as >3mm² on 65nm. We're literally talking 2mm² on 40nm for a full implementation versus 9mm² for the A8 in OMAP3 on what is, frankly, a surprisingly low-density 65nm process (but still!)
I really need to poke around a bit to figure out how such a massive gap is possible. Maybe the new NEON, unlike the old one, has many fewer truly separate paths (i.e. multi-purpose ALUs versus many specializes ones), which on a low-leakage process would make it a bit less power efficient but given the huge area savings would clearly be worth the cost.
Isn't 256K cache for an embedded GPU too much,especially one that is an IMR. AFAIK, even beefy desktop gpu's don't have that much cache to go around.
It's not a cache per-se AFAIK; think of it more as a buffer. I've always speculated that it's dual-mode: it either owns Hier-Z info for a tile or, if the compression ratio is good enough, the entire Z data which would allow the chip to not even *write* to external memory. That's just how I'd use such a buffer, which probably means it doesn't work that way at all though
Why do you think that? 32bit LPDDR2 667 should offer 2.7GB/s bandwidth. That might be "enough" for the cpu, but it looks to me like it could be pretty limiting for a (non-TBDR at least) renderer. I think the last nvidia gpus with such low memory bandwidth on the desktop were about 5 years ago (GeForce 6100/6200) and those versions which only had that little memory bandwidth (the normal ones had about twice that) were very limited by the lack of bandwidth.
There is still this bizarre misconception that Tim Sweeney's claims are anything more than marketing... Apparently this is a 240MHz 2xTMU chip; that's less raw texrate than a GeForce2 MX! (2x2 TMUs @ 175MHz) - which also had 2.7GB/s bandwidth for the top SKU, but with no framebuffer compression or other bandwidth saving techniques.
Ailuros said:
Best in class in terms of what exactly? I'm not very good in estimating die area from die shots but the graphics part, but if my rather dumb estimate should be close to reality and the GPU on T2 is roughly over 10mm2@40LP and it's still only 2PS/2VS/2TMUs@240MHz, then the perf/mm2 isn't exactly what I'd call ideal.
Wow Ail, I guess you answered yourself there...
I really don't know how you got to 10mm² for the GPU part. There's very little to base any estimate there except our own subjectivity. My own biases led me to three possible partitionings for the GPU, with die size going from 3.5 to 6mm² iirc. 10mm² really seems like a massive stretch to me!
I suspect the less efficient blocks are video decode & encode, personally. Plus there's some weird I/O on the chip (e.g. vertical bar on the left) that I can't understand the meaning of at all, and I'm sure some of the things like LVDS and IDE didn't scale from 65 to 40nm - both of which are on the chopping block for the smaller chip. We'll see how that one turns out area-wise, since it'll be the high-volume part for phones in theory.
BTW, when is 28LP coming up? Late this year, mid next year, late next year....?
28LP/LPT has had advanced test chips taped-out from lead partners (i.e. Qualcomm, NVIDIA, etc.) and should have real tape-outs in Q2 or Q3.
Ailuros said:
iPhone 3GS is a smartphone; what you're seeing as Tegra2 for the moment is the highest end variant which isn't necessarily targetted at smartphones unless you intend to run around with a battery backpack for them
Given that it's in some ways more smartphone-like than OMAP4 (e.g. 32-bit memory vs 64-bit), good to know that OMAP4 won't compete in that market either then
More seriously, it's obviously a flagship product for phones, which will attract various design wins but maybe not that much volume. The hope is probably that customers that have experience with their driver/software stack decide to use the smaller chip on more mainstream models.
There's also the fact that baseband/connectivity/touchscreen/etc. costs have gone down quite a bit, so for a given price segment it's possible to afford a more expensive application processor. In terms of power consumption, there's little reason for this to take a lot more power for doing *the same task* (i.e. VGA video decoding) than a much smaller chip. Certainly the peak power would be high, but that's not unprecedented; OMAP3 HD video encode via DSP+A8, anyone?
unless of course you're willing to believe the fairy tales that Tegra3 might be a GF100 grandchild just as Tegra2 was supposed to be a GF9 grandchild. The reality check for the latter will tell you that T2 might not even reach GF6 capabilities in the end.
Only a complete fool would have believed that NV would put an entire GPU architecture to the trash after a single generation; this would be unprecedented in the company's entire history, if not the industry's history. We both know that's not a fair basis to judge things on...
from what's know so far, tegra2 does not seem to be any more OCL/CUDA friendly than tegra1 is, or, IOW, not friendly at all. combined with no SIMD, that would position tegra2 as one of the very few (the only?) pocket computing platforms without viable provision for running custom number-crunching software, whereas virtually all the competition has some means to do some heavy lifting, be that OCL, SIMD, proprietary vector engines or arbitrary combinations thereof. being the only one who can't do something which everybody else on the market can is not an advantageous position - there's no guarantee that tomorrow a killer number-crunching app for handhelds won't come out (say, hypotetically, a new AV codec). what will nv do then - come up with 'CUDA for tegra'? perhaps release a NEON-enabled SKU?
Okay, I think that's a very good point, and my understanding of what the answer is should also answer a lot of things.
NV's strategy is velocity, rapidly moving from generation to generation. That means they don't feel they need to be exceedingly future-proof; they can be just in time (no pun intended!) - so the answer (assuming once again that they do not have NEON in Tegra2 as I'm pretty sure but not entirely certain is the case) is that they will have CUDA for their next-gen embedded GPU arch, probably for Tegra3. Obviously they'd much rather push to devs than NEON given their background.
Of course there's the problem that design win cycles are long and customers don't change their phones every month; so during the chip's lifecycle in users' hands, it's likely to become a limitation. Probably they believe it'll remain a niche thing and practically nobody will realize anything so it doesn't matter, and clearly many people here don't believe that. Both positions are a bit too extreme to my tastes, but heh.
If you narrow it all down to media playback and webbrowsing I don't think even a GPU was necessary after all.
I think even IMG would gladly set you straight on the GPU not being useful for web browsing. Of course, if you literally meant 'necessary', I guess maybe you don't need more than an old microcontroller...
I wonder if SIMD can't be put to good use for various small graphics tasks for which sending commands to some 2d/3d hardware would be more expensive. That'd make Web browsing faster.
Intriguing, I don't know. My guess is it wouldn't save much given that the GPU is so close to the CPU on a SoC, especially on the newer ones (Tegra1 was the first) that have a L2 cache controller that allows you to bypass external memory to send GPU commands. On SGX the context switch would be very cheap, on Tegra probably not so much although I don't know the specifics. Either way I doubt it'd be worth the software complexity & possible bugs.
the network stack
image decode
font drawing / antialising
and many standard C lib functions (memcopy, sort, search)
see
http://www.freevec.org/
I've seen a few times on B3D people who appear to be labouring under the assumption that vector units are only good for media processing. This is untrue, you can use them for all sorts of things.
Nice examples, cheers. The standard C lib functions is something I don't remember as often as I should! For image decode, I guess on the web that's mostly jpeg & png. The former has a dedicated block on any modern SoC, the latter is mostly zlib I think; out of curiosity, is there actually any worthwhile SIMD implementation of that out there? Also I'm curious what you mean by 'network stack' - I assume processing the IP packets themselves? Obviously the entire MAC for not only 3G but also WiFi (unlike on PCs) is done in the connectivity chips for handhelds/smartbooks so I assume you meant IP, is that really still significant nowadays? (I'd assume not but I'm genuinely ignorant)
Also photo stitching and pixman are two nice/worthwhile ideas, cheers Simon/Laurent.
What differenciates nvidia from qualcomm or powervr is the software support I expect for the GPU side. i.e. what differentiate them on the PC market too.
The HTC thing is a fair bit more complicated than what even some industry insiders seem to believe; either way, hopefully Qualcomm GPU drivers will improve now that they've fully integrated the AMD Handheld GPU team.
Tegra 3 may turn out to have a single precision 32SP Fermi, and dual or quad ARM cores (probably two variants). In essence, a CPU similar to an AMD Fusion or dual core Sandy Bridge, but on low power market segment.
32 SP Fermi clearly seems too optimistic to me, but we'll see!