NVidia Tegra ULP GeForce Speculation

Anyone have any speculation as to NVidia Tegra ULP GeForce architecture and how this will compare in power and performance to PowerVR SGX?
 
Anyone have any speculation as to NVidia Tegra ULP GeForce architecture and how this will compare in power and performance to PowerVR SGX?
Ask me that on wednesday/thursday and hopefully I'll be able to answer a bit better than I can right now... :) I don't think it's possible to get truly objective *and* comparable power consumption data from handheld companies in practice, so for the sake of objectivity I don't think I'd ever want to comment on that though.

Here's what one of their presentation says architecture-wise:
Early-Z and fragment caching
- These are big computation and bandwidth savers
Ultra Efficient 5x Coverage Sampling Anti-Aliasing Scheme
- Mobile version of CSAA technology from GeForce
Not a tiling architecture
- Tiling works reasonably well for DX7-style content
- For DX9-style content the increased vertex and state traffic was a net loss
Not a unified architecture
- Unified hardware is a win for DX10 and compute
- For DX9-style graphics, however, non-unified is more efficient
And another page for performance:
Tegra APX can achieve:
- Over 40M triangles/sec
- Up to 600M pixels/sec
- Texture 240M pixels/sec
Run Quake 3 Arena
- 45+ fps WVGA (800 x 480)
- 8x Aniso Texture Filtering
- 5x Coverage Sampling AA
i.e. it's 2 TMUs @ 120MHz with single-cycle 5xCSAA (which is 2xMSAA with 3 extra coverage samples).
 
Is that highest end variant or will there be another one?

Anyway 5xCSAA is just fine for the screen sizes it's aimed for. The most interesting part is the 8x AF bit.

Before anyone says it, games on mobile devices hardly ever enable AA or AF in games (I think the latest q3a mobile version has an option for enabling AA though). I wonder if Tegra's TMUs are strong enough to handle AF or if it's just a bandwidth constrained scenario as Q3A on mobile devices typically is.

As for the critical comments in the first quote: I've heard better excuses in my lifetime than those LOL :D
 
Would 1x MSAA +4 Coverage-Samples save any significant portion of the ROPs - apart from being some kind of "lame"?
 
Is that highest end variant or will there be another one?
That's the APX 2500, but it looks like the SKUs for the chip are being shuffled around a little bit (type 'APX 2600' in Google, look at the cached entry, and go to 'Specifications' to see what I mean) and I have no idea what the clock speeds for all of them will turn out to be, especially not for the 3D part.

Chip-wise, this will likely be the highest-end 65nm chip before they go to 40nm. There was a lower-end chip in the pipeline according to what I heard several times, but I don't know what's happening to that. If it still exists, you'd expect it to have started sampling some time ago.

As for 1xMSAA + 4xCSAA, I don't think that makes theoretical sense given how CSAA works... :) The CSAA samples need to be able to choose between at least two colours, unless you're thinking of doing some weird semi-but-not-really-Quincunx stuff that uses the data of adjacent pixels? That'd seem very complex to me, and I hope nobody ever gets the idea of doing something so crazy...
 
The AA and AF seems a little insane to me given the relatively ultra tiny pixels of these portable screens, unless that feature set is for those looking to run regular desktop sized displays from portable devices.
 
The AA and AF seems a little insane to me given the relatively ultra tiny pixels of these portable screens, unless that feature set is for those looking to run regular desktop sized displays from portable devices.

The primary moot point for current mobile phone games is that textures are usually of abysmally low quality due to memory footprint constrictions. Anything over bilinear will improve things slightly, but it won't make a huge difference either.

I've played Q3A mobile on a 320*240 screen and I can't make out what the texture sizes are at in that one, yet they're definitely higher than in most other mobile games. I still noticed though the lack of AF.

With AA it gets a wee bit trickier, since one might say that it's not absolutely necessary for games for a small form factor device (which is highly debatable IMO also), but you will need it for stuff like advanced UIs amongst others as of course for things like OpenVG where you'd need something around 16x sample AA to get the content to a decent aa'ed level. A menu with rolling icons is about enough to show you that AA can make a significant difference even on a < 3" screen.
 
So any idea how it will perform against PowerVR SGX?

Looking at that feature table. It looks like it is a strip down version of Old Tech Geforce......
 
So any idea how it will perform against PowerVR SGX?

If their performance rates are close to reality and since Arun says its an APX2500 and they're largest core under 65nm, you could eventually compare it against SGX540.
 
SGX 530 in the OMAP3530/Pandora is a 110MHz 1xTMU GPU, GoForce in the 65nm Tegra is a 120MHz 2xTMU GPU. The 45nm OMAP36xx seems to have a 200MHz SGX 530, FWIW, while the OGL ES 2.0 Imageon in the Qualcomm & Freescale platforms seems to be a 133MHz 1xTMU GPU.

Of course, in the lot, the SGX 530 should be the one with the highest real-world efficiency out of that fillrate since it's a TBDR, doesn't need a Z-Pass, etc. - in theory it's plausible that it might handle very small triangles a bit better too, but that's not a given.

Things get more complicated when you consider AA & AF, since you'd assume that a TBDR could also be more efficient at AA. But then again, NV has 5xCSAA (i.e. 2xMSAA + 3 coverage samples); I asked for a demo of Quake3 at MWC, and at WVGA on NV's prototype I really couldn't see much if any aliasing with 5xCSAA. The demo moved a tad too fast to have a very clear idea of the AF, although it seemed OK and didn't shimmer. Talking of AF, obviously that's one of the cases where Tegra should theoretically do best against a SGX 530. It isn't clear to me whether their they'll increase their numbers of TMUs in the 40nm chip; if not obviously SGX 540 is pretty much superior in every way.

As always for handheld architectures, we know basically nothing about ALU performance (besides the fact Tegra obviously isn't unified and it's scalable further downwards in both VS & PS)... So in fact talking about performance like I did above doesn't make a lot whole of sense, but you gotta do with what you have. And of course it'd be interesting if we could have a clear die size number, but there isn't a single '3D block' in the Tegra die shot; it's at least 3 blocks (maybe more), and the display pipeline might be included in one of the three. So pretty damn hard to get any comparable figure, although I did estimate it in a pretty credible way once.

BTW, semi-OT: I know we had some discussions in the past about the performance of the Imageon core in the STMicro STn8820, and well, I'm still not 100% sure but I can safely claim it doesn't matter because it will probably never sell a single unit... ST-Ericsson didn't even show it as part of their line-up/roadmap at MWC despite including the STn8815 in it. I was pretty happy about the little that ST-Ericsson did show though, FWIW. Definitely a big player in the handheld world and worth watching for...
 
So in the end we have to wait for some real life testing on devices based on those h/w platforms and running the same OS. It will be the only fair way of telling how they stand against each other.

For now we only know about upcoming toshiba tg01 (snapdragon's imageon z430).
Palm Pre (SGX530) is running WebOS so comparing it against winmo will probably be impossible (lack of benchmarking tools).
And unfortunately not even 1 announced tegra based smartphone. It's not encouraging...

I think that those models that were supposed to be based on winmo7 weren't moved to winmo6.5 due to the older winCE (5 on wm6/6.1/6.5 and 6 on tegra prototypes and future wm7) version.
But I'd be happy to be proven wrong.
 
So in the end we have to wait for some real life testing on devices based on those h/w platforms and running the same OS. It will be the only fair way of telling how they stand against each other.
Don't hold your breath...

And unfortunately not even 1 announced tegra based smartphone. It's not encouraging...
Hmm? These are three ODM phones coming out in 2H09: http://www.engadget.com/photos/nvidias-tegra-in-the-flesh/1365027/
One slide mentioning those: http://www.engadget.com/2009/02/16/nvidias-tegra-jumps-on-the-android-bandwagon/

OEM phones are still slated for 2H09, although I guess at this point we're talking mid/late Q4 for availability for most (all?) of those. No idea if we'll see announcements at CTIA. MID/Netbooks are slated for Q3, should see announcements at Computex. WinMob7 delays probably hurt them, although they claim what hurt them the most was just the economic crisis in general reducing expenditures at their customers and project delays throughout the industry. I've heard Wolfson Micro (audio codecs & analogue etc.) basically say in a CC that the number of delays in the industry is substantially above average right now, so at least part of the problem isn't NV's fault.
 
Is that highest end variant or will there be another one?

Anyway 5xCSAA is just fine for the screen sizes it's aimed for. The most interesting part is the 8x AF bit.

Before anyone says it, games on mobile devices hardly ever enable AA or AF in games (I think the latest q3a mobile version has an option for enabling AA though). I wonder if Tegra's TMUs are strong enough to handle AF or if it's just a bandwidth constrained scenario as Q3A on mobile devices typically is.

As for the critical comments in the first quote: I've heard better excuses in my lifetime than those LOL :D

I remember the Tegra said to be based on geforce 6 architecture; filtering is cheap in transistors and fast on geforce 6/7.. though not 100% clean which annoys me for old games on my desktop PC; but that should look better on a mobile screen.

the texture rate disappoints me, 240Mtexels.. only 33% better than a voodoo2 :p
I'm not interested in smartphones though, obviously they would clock it higher on a netbook?
 
Of course, in the lot, the SGX 530 should be the one with the highest real-world efficiency out of that fillrate since it's a TBDR, doesn't need a Z-Pass, etc. - in theory it's plausible that it might handle very small triangles a bit better too, but that's not a given.
Why do you think it might have an advantage with small triangles?
 
I remember the Tegra said to be based on geforce 6 architecture; filtering is cheap in transistors and fast on geforce 6/7.. though not 100% clean which annoys me for old games on my desktop PC; but that should look better on a mobile screen.

Based is quite a vague term; you're die area is as limited as it can be on chips for the handheld/mobile market and each IHV has to set it's priorities. Filtering is anything but "cheap" for such miniscule cores and if you'd look at any mobile chips floorplan it would be quite obvious how much die area alone 1TMU can capture.

That said when you're bandwidth limited AF can come eventually for free.

the texture rate disappoints me, 240Mtexels.. only 33% better than a voodoo2 :p
I'm not interested in smartphones though, obviously they would clock it higher on a netbook?

They can scale frequencies as much as units for a higher end core. That said considering Ion already has a 9400 in it, I'd rather think that the next generation Ion2 would come rather from the future IGP corner than handheld chip area.
 
I remember the Tegra said to be based on geforce 6 architecture
Yup, as Ailuros said though, 'based' doesn't mean all that much in the handheld world.

the texture rate disappoints me, 240Mtexels.. only 33% better than a voodoo2 :p
But with Early-Z, and I'm not sure how much bigger than a Voodoo2 the die size actually is. Let's take a random number and say it's 8mm2 (I estimated it once, but that's not even the right number because I forgot about it; at least this way it also applies to SGX) - that's on 65nm, and on 350nm that'd become 256mm². Voodoo2 was implemented as 3 chips on 350nm, each with a 64-bit memory bus IIRC. Surely that couldn't be less than 128mm², and probably more. Given the difference in programmability, I don't think it's all that surprising sadly.

I'm not interested in smartphones though, obviously they would clock it higher on a netbook?
In ARM-based netbooks? Theoretically they could. I'm not sure it really matters though; 240MPixels/s is enough for a pretty 3D user interface even at 1280x1024, but overclocking it by 20% isn't magically going to give you enough performance to do anything useful gaming-wise at that resolution.

3dcgi said:
Why do you think it might have an advantage with small triangles?
Well, TBH I was only thinking about pixel shading intensive cases (where you're probably going to be the most GPU-limited anyway) because of the shader's MIMD nature, but I couldn't get any info about whether they do anything interesting there or what, so I have no idea if they do anything interesting there. They keep most of their MIMD marketing centered around branching, obviously.
 
But none of them seems to be running WM. Instead they will probably use android and that means that we won't be able to test them thoroughly due to lack of benchmarking tools.
I think the Compal is probably WM, but I'm not sure. Ironcially, I know which basebands are in them but I don't know the OS - oops? :D Either way you'll have WM/WinCE devices coming out in 2H09...
 
Well, TBH I was only thinking about pixel shading intensive cases (where you're probably going to be the most GPU-limited anyway) because of the shader's MIMD nature, but I couldn't get any info about whether they do anything interesting there or what, so I have no idea if they do anything interesting there. They keep most of their MIMD marketing centered around branching, obviously.
I'm not sure a MIMD advantage becomes any greater with small triangles. I guess it depends on the shader. A chip like the one Qualcomm bought from AMD probably branches on a quad granularity anyway so the point is probably moot. Though I don't know for sure what the branch width is.

If someone packs pixels from different quads they're likely to have an advantage.
 
Back
Top