NVIDIA Tegra Architecture

silent_guy · Jun 18, 2014

Picao84 said:
Just as they have piss poor GPU SKU's, they should have planned for less overpowered SoC as well. This way, as the overkill SoC did not gain track, they could at least manage to keep a presence on the market.

With TK1 as is, they have something that nobody else has. A TK0.5 with dumbed down CPU and GPU would be an also ran among many others where you compete only on price. How would that be more successful?

Picao84 · Jun 18, 2014

silent_guy said:
With TK1 as is, they have something that nobody else has. A TK0.5 with dumbed down CPU and GPU would be an also ran among many others where you compete only on price. How would that be more successful?

And what exactly did they achieve in the mobile phone market by having "something that no else has"? Almost nothing. As far as I know, nVIDIA is retreating completely from mobile phones precisely because they cannot compete on price with TK1

Competition on the mainstream mobile phone market is done through pricing mostly, features are barely relevant there, excluding niches like photography.

ams · Jun 18, 2014

Picao84 said:
Isn't that kinda of a pointless point since nVIDIA will not compete on the mainstream mobile phone/tablet market anymore? Whatever the T760 Mali GPU offers is good enough for the market it targets, while GK20A is just overkill on the same market. Which ultimately was a result of terribly bad judgement and planning by nVIDIA when developing Tegra, IMO.

The mainstream "smartphone" market has commoditized to the point where low pricing for SoC+LTE modem is more important than anything else. It makes zero sense for NVIDIA to pursue this and devote engineering resources to this. Note that Tegra K1 is expected to find it's way into some high smartphones. On the other hand, the "tablet" market has not commoditized in the same way, nor is LTE modem needed or desired for base tablet models. The Xiaomi Mi Pad alone will sell millions of units per year, and I'm sure there are various other resourceful companies that will use Tegra K1 too. In the markets and segments that Tegra K1 targets (ie. tablets, high end smartphones, micro gaming devices, automotive, high res monitors and TV's), the GK20A GPU performance and feature set is most certainly not overkill. In fact, the opposite is true, because it's stellar performance/feature set/perf. per watt is very desireable in these target areas. And having the Tegra GPU architecture in step with the main desktop architecture is huge because resources can be much better leveraged throughout the company. NVIDIA absolutely made the right decision here.

silent_guy · Jun 18, 2014

Picao84 said:
And what exactly did they achieve in the mobile phone market by having "something that no else has"? Almost nothing.

Did I deny that?

With TK1 they have something to differentiate other than price. With TK0.5 they have nothing. So why bother if you don't want to participate in this race to the bottom?

It's too early to see if TK1 will be popular in the mobile space, but at least it has a chance in the automobile space (though we'll only know 5 years from now.) With TK0.5 much less so.

Clearly Nvidia hasn't found its profitable niche yet, but beating the dead horse of mid-end SOCs didn't work, so why would they continue to try?

wco81 · Jun 18, 2014

Really? iPhone and Samsung Galaxy and Note are not commoditized. These models move well over 100 million units a year.

Of course the lower priced phones are commoditized but the sheer volume of the market dwarfs the tablet market.

Meanwhile, tablets start from $99 up.

Ailuros · Jun 18, 2014

ToTTenTranz said:
Wut?

1080p Manhattan:
TK1: 28.1 FPS
Exynos 5422: 8.6 FPS

Assuming 650MHz for that TK1 and a linear scaling with clocks, performance at 150MHz would be (150*28.1)/650 = ~6.5 FPS.
This doesn't look like running circles..

You know the world doesn't spin around 3D only especially in that market; now make the same exersize again with something compute oriented for instance and try quirking around endlessly with weird compiler optimisations to keep half of those Vec4 ALUs in Malis busy, while GK20A due to its scalar ALUs will come as close as possible to its peak FLOP values.

Besides it was just a case example; no IHV would be as idiotic to develop such a GPU as wide to clock it at just 150MHz under 28HPm. Likelier case would be a fraction of the 192SPs the GK20A right now has with an =/>700MHz frequency, which of course doesn't change a bit the above.

For the record's sake Xiaomi seems to be experimenting with different frequencies since the results are constantly bouncing up and down; look at those here http://gfxbench.com/subtest_results_of_device.jsp?D=Xiaomi+MiPad&id=555&benchmark=gfx30

At theoretical 150MHz GK20A gives 57.6 GFLOPs FP32; how many GFLOPs has the MaliT628MP6 in the Galaxy S5 exactly?

http://community.arm.com/thread/5688 (no it isn't clocked at "just" 533MHz afaik)

Sure, Samsung's SoC division hasn't been terribly successful ever since they lost access to Intrinsity's services, but I think you're being a bit too melodramatic.

Melodramatic? How about you give me a viable persentage of how many smartphones with Exynos Samsung actually sells in total and how many with Qualcomm's SoCs. They're not using less S8xx SoCs lately but increasingly more and it sure must be because of their engineering marvels no one wants to have not even Samsung mobile.

Whatever made you believe that tesselation is a power hog by itself, is probably wrong.
If the final visual result is the same and both target the same framerate, a model with lots of geometry will probably consume more power than the same model with less geometry+tesselation.
Tesselation can - and should - be used as a power/performance saving feature in mobile devices.

There's a comment from JohnH here in the forum about it and him as an experienced hw engineer I trust more than you, for reasons I hopefully don't have to explain.

***edit: to save you from searching

http://forum.beyond3d.com/showpost.php?p=1834992&postcount=23
http://forum.beyond3d.com/showpost.php?p=1835124&postcount=26

Besides that I mentioned die area for the tessellation unit alone, which obviously flew completely over your head.

Qualcomm's (and all others') pursuit for full DX11 compliance are probably related to Microsoft's plans for Windows 9.

Good for them. For windows I'll personally take a real windows machine with a proper desktop GPU (even low end) instead. The S805 doesn't sound like a solution I'd want for such a case.

Ailuros · Jun 18, 2014

Picao84 said:
I understand that, but like you said Tessellation in a phone is pointless. nVIDIA followed the trend mindlessly. It could have been a more sensible strategy to have a good enough SoC, without all the bells and whistles of DX11 for the mainstream 200 euros market, along with the TK1 powerhouse for other segments. I guess Tegra4i was supposed to be that, but it came too late. They aimed for the jugular in a market they had fearful competitors in and barely any experience at. Wrong strategy in this case, IMO.

Even if T4i would had been successful its GPU would had been criticised in the longrun for the lack of compute capabilities.

You have to think that between taking the Aurora/T4 GPU change the entire thing to USC ALUs, give it extensive compute capabilities that OGL_ES3.1 and future relevant APIs would support etc etc and taking an existing design like Kepler and shrinking it down to ULP mobile SoC levels sounds times easier and cheaper to me. One could argue that they could had removed tessellation from the Kepler design, but I as a layman suspect that it's too "deep" integrated in the pipeline that removing it would cost more resources than leaving it be.

Besides they won't have to change much at the featureset in future iterations and can freely mostly concentrate on efficiency increases; eventually there's no escape for everyone from DX11 either.

Picao84 · Jun 18, 2014

Ailuros and silent_guy, my point is not around what TK1 is or should have been. My point is that they bet the house on one design per generation only (only exception being the T4i) all the while shooting for the roof. I'm probably simplifying it, but they used mostly off the shelf ARM designs if we don't count the companion core. Would it have been so hard to make a SoC without it, less beefier GPU, all the while still supporting Tegra enhanced titles?

In any case they should have known from the start that phones are a low price market compared to desktop. What were they expecting? Mobile phones for 1K dollars/euros? Selling like hot cakes? The mobile phone market could never command the sort of prices nVidia is used to. It was blatantly foolish if they believed so. That's why I stand by my opinion that if they were expecting to make a dent in the market they needed a lower cost, more humble option. Don't blame me, I'm just poiting the obvious flaw in their initial strategy

ams, of course it doesn't make sense now and they did the right thing in admitting defeat. I'm talking about years ago when they started Tegra. Their vision of the future was too optimistic for themselves and they under evaluated the market evolution.

ams · Jun 18, 2014

Picao84 said:
ams, of course it doesn't make sense now and they did the right thing in admitting defeat. I'm talking about years ago when they started Tegra. Their vision of the future was too optimistic for themselves and they under evaluated the market evolution.

Keep in mind that the only area in ultra mobile that NVIDIA is not pursuing right now is mainstream smartphones (for a variety of reasons including price sensitivity, time to market for certifications, bundled modems available from competitors). Tegra's focus and breadth has shifted and evolved as the ultra mobile market has evolved. A very low cost "mainstream" SoC such as T4i was envisioned many years ago (and the existence of that SoC was likely due in part to Samsung and Apple absolutely dominating the higher end smartphone space), but that strategy just didn't pan out for the reasons mentioned above. Anyway, Tegra is about much more than simply smartphones, and Tegra as a whole is growing again after the lull with the T4 generation.

And FWIW, just because Tegra K1 is an ambitious SoC design (with 32-bit Cortex A15 R3 and 64-bit fully custom Denver CPU variants to boot!) doesn't mean that it is not a cost-effective SoC design (remember that the SoC die size is no bigger than Snapdragon 800, and there is no additional cost associated with a built-in LTE modem, and the fabrication is done on a now mature and high yielding 28nm HPM process). The Xiaomi Mi Pad will be selling in China for less than $300 USD. Tegra is obviously not in any way targeted at $1000 USD devices (other than some super premium and limited edition 128GB Google 3D Dev Tablet with advanced sensors on board).

Anyway, creating two or more totally different SoC's solely for use in the ultra mobile space is easier said than done, and is a very resource-intensive and time-intensive task that is only justifiable if there is a high probability that volumes will be consistently high.

Ailuros · Jun 18, 2014

Picao84 said:
Ailuros and silent_guy, my point is not around what TK1 is or should have been. My point is that they bet the house on one design per generation only (only exception being the T4i) all the while shooting for the roof. I'm probably simplifying it, but they used mostly off the shelf ARM designs if we don't count the companion core. Would it have been so hard to make a SoC without it, less beefier GPU, all the while still supporting Tegra enhanced titles?

Yes it would had been possible, however I was and am amongst those that was striving all those years up to Logan for NV to concentrate more on the GPU side of the things.

In any case they should have known from the start that phones are a low price market compared to desktop. What were they expecting? Mobile phones for 1K dollars/euros? Selling like hot cakes? The mobile phone market could never command the sort of prices nVidia is used to. It was blatantly foolish if they believed so. That's why I stand by my opinion that if they were expecting to make a dent in the market they needed a lower cost, more humble option. Don't blame me, I'm just poiting the obvious flaw in their initial strategy

When they started with Tegra1 around half a decade ago they could have tried a Mediatek alike business scheme from the get go that much is true. And yes they might had been successful with it. Tegra4i was way too late for that, while in the meantime low cost SoC manufacturers like Rockchip, Allwinner, Mediatek and Samsung amongst others established themselves in the Chinese market first above all.

Now the recepies for possible past success or a bunch of "might have beens" could be many, reality is that there and today they're still struggling for marketshare. If something like GK20A cannot turn any heads in any of the consumer markets then I'm not so sure they'll ever manage to gain any worthwhile marketshare there.

Take the MiPad as just one example; if it ends up at =/>$240 as its rumored, then you have next generation tablet performance at a much more affordable price then from Apple or any other tablet vendor.

By the way if part of your reasoning should also affect the GK20A frequencies: NV never stated that we'll get a 950 or 850MHz GPU in an ultra thin tablet form factor. Under conditionals even 950MHz should be reachable, in markets where extensive active cooling measures aren't taboo.

3dcgi · Jun 18, 2014

Ailuros said:
In that regard I'm more puzzled about why QCOM chased its heels to get DX11 into its Adreno4x0 family of GPUs than anything else; NV to its defense had Kepler already developed and it's sounds like less work to try to shrink an existing design then start from a tabula rasa with a somewhat ~DX10 full compute capable design.

WTF do I need tessellation for in the next GalaxyS5 exactly? Unless someone has the superdumb idea to create a live wallpaper for Android with a shitload of tessellation in order to empty the phone's battery in a blink of an eye, I don't see it being of any meaningful use there.

You shouldn't assume QCOM added tessellation for phones. They thought WinRT would be more successful and were not happy when Microsoft released the Surface. Depending on the implementation supporting tessellation isn't a huge area adder. It's something, but a small percentage of the overall GPU.

Ailuros · Jun 19, 2014

3dcgi said:
You shouldn't assume QCOM added tessellation for phones. They thought WinRT would be more successful and were not happy when Microsoft released the Surface. Depending on the implementation supporting tessellation isn't a huge area adder. It's something, but a small percentage of the overall GPU.

Compared to DX10 or even DX9L3? I severely doubt that, especially for the latter case. Except for Qualcomm and NVIDIA I'm not aware of any other ULP mobile DX11 GPU shipping yet. Considering Qcom managed to give the Adreno420 barely 40% more arithmetic efficiency compared to Adreno330 I'd have to wonder where all the additional die area went for.

Speaking of, Microsoft should skip the nonsense and lighten up a bit with whatever GPU requirements they have in mind for the future and the ULP SoC space; Tegras were able to make it into winRT devices with barely DX9L1 "only" after all. It's Microsoft that needs a ticket to establish itself in that market and not the other way around.

That said windows mobile is an excellent mobile OS; tried it and like it as a sw platform. If it would have more extensive support from 3rd parties I wouldn't mind going for it at all.

3dcgi · Jun 19, 2014

Compared to DX10 or even other DX11 features. If you don't have LDS and the ability to run sophisticated shaders you shouldn't bother with DX11 tessellation yet. A hardware tessellator is minimal additional area for a GPU greater than 50 mm^2

Ailuros · Jun 19, 2014

3dcgi said:
Compared to DX10 or even other DX11 features. If you don't have LDS and the ability to run sophisticated shaders you shouldn't bother with DX11 tessellation yet. A hardware tessellator is minimal additional area for a GPU greater than 50 mm^2

Where do we have any GPU blocks in SoCs yet that account for even as much as 50 square mms? And no I still can't believe that the difference between a DX10.0 and DX11.x ALU is anywhere close as being "miniscule".

Or else IMG is simply lying that even "just" improved rounding support has a significant hw cost: http://blog.imgtec.com/powervr/powervr-gpu-the-mobile-architecture-for-compute

I as a layman understand under "significant" at least 10%. Now if my original estimate of at least +50% from 10.0 to 11.x should be in your book "minimal" additional area so be it.

silent_guy · Jun 19, 2014

I always thought that tessellation in DX11 was mostly a shader affair and that the pure tessellation hardware functionality is actually very minimal: you have all these hull and (name escapes me) shaders with some HW expander in the middle that creates indexes and some attributes. Basically, HW that may be quite tricky from a design point of view, but not in terms or area.

Ailuros · Jun 19, 2014

silent_guy said:
I always thought that tessellation in DX11 was mostly a shader affair and that the pure tessellation hardware functionality is actually very minimal: you have all these hull and (name escapes me) shaders with some HW expander in the middle that creates indexes and some attributes. Basically, HW that may be quite tricky from a design point of view, but not in terms or area.

And who says that hull & domain shaders and all the other logic you need for programmable tessellation are for free?

3dcgi · Jun 19, 2014

If you support compute shaders the additional area for tessellation is a fraction of a percent. I was not referring to modifying the alus which is why I'm saying if you already support compute shaders. If you're talking about a 5 mm^2 GPU then anything can be considered significant.

Ailuros · Jun 19, 2014

3dcgi said:
If you support compute shaders the additional area for tessellation is a fraction of a percent. I was not referring to modifying the alus which is why I'm saying if you already support compute shaders. If you're talking about a 5 mm^2 GPU then anything can be considered significant.

With most high end ULP SoCs ranging from 100 to 130mm2 how can GPU blocks in those be anywhere near 50mm2? Not that's it's impossible, but I wouldn't imagine that the worst case analogy for a GPU block is higher than 1/4th the entire SoC area estate even on K1.

As for the rest I wish one of the ULP GPU hw engineers would clarify that story, but hardly likely since they'd reveal too much they probably shouldn't.

3dcgi · Jun 19, 2014

I just threw 50 mm^2 out there thinking about SoCs that are over 100 mm^2. I'm not going to bother since I'm on a phone in an airport but if someone is interested in a comparison they can look at the ATI rv610 for an example of when ATI added a fixed function tessellator.

There have been other AMD designs that are tablet sized which contain multiple tessellation blocks. It's hard to remove features one they've been added...

Anyway my point in entering the conversation was to state my understanding of Qualcomms mindset in supporting DX11 as learned from talking to Qualcomm employees.

Ailuros · Jun 19, 2014

3dcgi said:
I just threw 50 mm^2 out there thinking about SoCs that are over 100 mm^2. I'm not going to bother since I'm on a phone in an airport but if someone is interested in a comparison they can look at the ATI rv610 for an example of when ATI added a fixed function tessellator.

Not in the least relevant since it was never designed for a ULP SoC.

There have been other AMD designs that are tablet sized which contain multiple tessellation blocks. It's hard to remove features one they've been added...

That's what I told a few posts above Picao about the GK20A in K1.

Anyway my point in entering the conversation was to state my understanding of Qualcomms mindset in supporting DX11 as learned from talking to Qualcomm employees.

Well ask them then off the record what the hypothetical result would had looked like if they would had designed an Adreno330 successor only OGL_ES3.1 compliant but with the exact same die area as the Adreno420. Something tells me that the difference between those two wouldn't had just been at 40% arithmetic efficiency peak.

NVIDIA Tegra Architecture

silent_guy

Picao84

ams

silent_guy

wco81

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Picao84

ams

Ailuros

Epsilon plus three

3dcgi

Ailuros

Epsilon plus three

3dcgi

Ailuros

Epsilon plus three

silent_guy

Ailuros

Epsilon plus three

3dcgi

Ailuros

Epsilon plus three

3dcgi

Ailuros

Epsilon plus three

Similar threads