NVIDIA Tegra Architecture

Going ARM and being custom designs are not mutually exclusive, see Qualcomm, Apple, etc....

They're already ARM. The internal instruction format they use isn't really any more material to the CPU architecture than the format of the uops in Intel's uop cache. Besides, they can already decode and execute ARM code without caching it as something else first.
 
Going ARM and being custom designs are not mutually exclusive, see Qualcomm, Apple, etc....

And it is axiomatic that they would be better served with using a reference design, if that reference design is just better in every way.

If they'd use only custom CPU cores based on ARM ISA (Denver-whatever) in their SoCs then I would understand better the point of going through the quite high added cost for a custom core. Apple has its own custom cores based on ARM ISA, however they use them exclusively in their SoCs (no mix & match with garden variety ARM CPU IP) and they also have their own OS. I'm not sure, but I've heard that Qualcomm might plan to abandon KRYO down the line again....I wouldn't be in the least surprised if that's true....

For the record (and that's only hearsay) the original Denver core supposedly could not get ISO26262 & ASIL certifications, hence the former question if the most important changes for Denver2 are in that direction.

Overall I agree with you: there's no SOUND reason for NV to have a unique custom CPU core if they're going to use it only for the minority of demanding CPU tasks and since there's no unique underlying software platform present either. And before someone says it: yes NV's diagrams show Parker to be roughly 35% ahead in efficiency in one synthetic benchmark against other SoCs, but fortunately for marketing they didn't have to include future SoCs that will appear in devices at the same time Parker will in automotive implementations.

Overall this setup is very close to big.LITTLE, except with the Denver cores seemingly encompassing parts of both "big" and "little" depending on the task. With all of that said however, it should be noted that NVIDIA has not had great luck with multiple CPU clusters; Tegra X1 featured cluster migration, but it never seemed to use its A53 CPU cores at all. So without having had a chance to see Parker's HMP in action, I have some skepticism on how well HMP is working in Parker.

http://www.anandtech.com/show/10596/hot-chips-2016-nvidia-discloses-tegra-parker-details

I don't see why HMP shouldn't work as advertised, but I can also understand the scepticism.

Good news: the GPU actually clocks at 1.5GHz, meaning =/>1.5 TFLOPs FP16 or =/>750 GFLOPs FP32.

Same writeup at anandtech for the GPU efficiency:

As the unique Maxwell implementation in TX1 was already closer to Pascal than any NVIDIA dGPU - in particular, it supported double rate FP16 when no other Maxwell did - the change from Maxwell to Pascal isn't as dramatic here. However some of Pascal's other changes, such as fine-grained context switching for CUDA applications, seems to play into Parker's other features such as hardware virtualization. So Pascal should still be a notable improvement over Maxwell for the purposes of Parker.
 
Last edited:
Overall I agree with you: there's no SOUND reason for NV to have a unique custom CPU core if they're going to use it only for the minority of demanding CPU tasks and since there's no unique underlying software platform present either. And before someone says it: yes NV's diagrams show Parker to be roughly 35% ahead in efficiency in one synthetic benchmark against other SoCs, but fortunately for marketing they didn't have to include future SoCs that will appear in devices at the same time Parker will in automotive implementations.

Denver 2 should be a less expensive product for Nvidia in ARM licensing. Also it should not have cost that much to port Denver 1 from 20nm.

The DCO of Denver 2 should really work well with the static code base that would be in automotive.
 
Last edited:
They're already ARM.
Eh..... It is more like "they're already ARM and anything else they could potentially want to be"...

Besides, they can already decode and execute ARM code without caching it as something else first.
Yes, and performance goes down when it does that. I mean if the ARM decoder was already optimal in the first place then....

Denver 2 should be a less expensive product for Nvidia in ARM licensing.
Well, if you ignore R&D costs maybe....
 
Last edited:
Eh..... It is more like "they're already ARM and anything else they could potentially want to be"...

A couple things:

1) The ARM decoders are an important part of the design and can't just target something else.
2) They almost certainly steered many aspects of the design to allow as lightweight transformation to ARM as possible.

For example, there are a lot of features in NEON that are not in various other SIMD instruction sets that they absolutely need to support at a low level in order to provide a sane mapping.

This is the difference between binary translation targeting some random other ISA and designing a translation platform with one architecture in mind. Which is why something like Houdini tanks performance vs native code, while Denver can at least pull off respectable performance most of the time. I know they're said to have targeted x86 originally but they a lot of time to redesign things to be better suited for ARM. If they tried to target something else now it wouldn't work very well, even ignoring the lack of decoders.

The whole translation design is integral to many parts of the uarch, there are good reasons why they don't just run ARM. You can't really extract that much ILP from an in-order uarch without doing run-time translation that heavily reorders and renames the code, while also providing low-cost speculation and assertions. It also helps that they can expose the uarch more in the ISA and evolve it with time.
 
1) ????

2) Ok.... but real-world performance results are real-world performance results

You can't really extract that much ILP from an in-order uarch without doing run-time translation that heavily reorders and renames the code, while also providing low-cost speculation and assertions.
I understand that, and that was more or less what I was getting at. If they want to do a custom core, fine. But just go ARM only and OoO. I mean I'm sure that DenverX is very good and perhaps unbeatable some of the time, but that is generally not what one wants or needs out of a CPU. Basically, you want the highest IPC + lowest latency arch possible. And Denver (in its current form) gives you that sometimes.
 
Denver 2 should be a less expensive product for Nvidia in ARM licensing. Also it should not have cost that much to port Denver 1 from 20nm.

The DCO of Denver 2 should really work well with the static code base that would be in automotive.

Must be the reason why they hired additional engineers for Denver2 development.

--------------------------------------------------------------------------------------------------------------------------------

Exophase,

That's all probably true for Parker/PX2/Denver2 and onwards only; if the story for Denver1 failing certain certifications is true, the Denver cores won't be used in any Tegra K1 64bit automotive module for probably more than infotainment. Bears the question really if Denver is really a necessity today for automotive needs or if they've just kept on investing in it because they had no other choice. Because if anyone tries to convince me that Denver was only meant for automotive from the get to, don't expect me to believe it. If the story of Denver1 failing certifications is even true, then it was originally meant for anything BUT automotive.

That said I would figure (for which I'd like to stand corrected) that margins for Parker or Parker + Pascal modules should be as high to justify the investment in a custom CPU core. I don't know what they're selling the 2+2 module at, but I can easily imagine it's a quite obscene price per module.

---------------------------------------------------------------------------------------------------------------------------------

OT but since I just read it, a few interesting indicative perf/W figures per SoC and the according manufacturing process used:

http://www.anandtech.com/show/10545/the-meizu-pro-6-review/7
 
Last edited:
1) ????

2) Ok.... but real-world performance results are real-world performance results

I understand that, and that was more or less what I was getting at. If they want to do a custom core, fine. But just go ARM only and OoO. I mean I'm sure that DenverX is very good and perhaps unbeatable some of the time, but that is generally not what one wants or needs out of a CPU. Basically, you want the highest IPC + lowest latency arch possible. And Denver (in its current form) gives you that sometimes.

Basically what you're saying is their approach failed, scrap everything and start over doing a core that's more like the other custom ARM cores.

But I don't think the verdict is really out yet as to whether or not their approach makes sense. There isn't good data on what the actual power consumption is like, and we don't really know how far they can improve what they have. I would say that as far as performance is concerned Denver (in the like, single device we got it in) did very well almost all of the time. When it didn't do as well it still did pretty good. It's certainly not the only custom ARM core you could describe this way, I would say it tended to do better than Krait which didn't do so hot on a variety of things.

Most benches Denver didn't do as well on were well threaded, so it suffered from having only two cores. The only single threaded one it did badly on that I can remember was Sunspider, but that's a useless set of microbenches.
 
Basically what you're saying is their approach failed, scrap everything and start over doing a core that's more like the other custom ARM cores.
What the actual fuck? No. I did not call Denver 1 a failure. Please do not put words in my mouth or engage in strawmen. If you do that, I will not have a conversation with you.

For the record, I think Denver is/was pretty good. Not outstanding, but certainly not bad. In fact, it is/was pretty unique and a remarkable achievement, IMHO. It is certainly no P4 or Bulldozer. Maybe a good comparison would be Core2. But as good as Core2 was Nehalem was that much better. Usually in the tech industry you are either moving forward or being left behind. And in business in general, the bottom line comes before ideals or else you won't be in business very long to practice your ideals.

If Nvidia is able to improve upon Denver and make an in-order CPU competitive or even better than out-of-order ones, then great. But would you really bet on that being the case? Zen is coming, whateverLake from Intel is coming, super-mega-typhoon from Apple is coming, the A72 is already pretty damn solid and I doubt ARM is going to stop there. Winter is coming. And I don't have faith in in-order to keep me warm. But that this just my opinion. You don't have to like it or agree with it.
 
What the actual fuck? No. I did not call Denver 1 a failure. Please do not put words in my mouth or engage in strawmen. If you do that, I will not have a conversation with you.

For the record, I think Denver is/was pretty good. Not outstanding, but certainly not bad. In fact, it is/was pretty unique and a remarkable achievement, IMHO. It is certainly no P4 or Bulldozer. Maybe a good comparison would be Core2. But as good as Core2 was Nehalem was that much better. Usually in the tech industry you are either moving forward or being left behind. And in business in general, the bottom line comes before ideals or else you won't be in business very long to practice your ideals.

If Nvidia is able to improve upon Denver and make an in-order CPU competitive or even better than out-of-order ones, then great. But would you really bet on that being the case? Zen is coming, whateverLake from Intel is coming, super-mega-typhoon from Apple is coming, the A72 is already pretty damn solid and I doubt ARM is going to stop there. Winter is coming. And I don't have faith in in-order to keep me warm. But that this just my opinion. You don't have to like it or agree with it.

You said that they need to start executing ARM directly all of the time (what I guess you must mean by "ARM only") and go OoO.

You might not be aware of this but that means a total scrapping of the design of they have and practically starting over from scratch. It's nothing like the difference between Core 2 and Nehalem, it is not such an evolutionary, incremental change. It is closer to the difference between Netburst and Core 2, although I'd say probably greater. That was a total scrapping of a uarch family and probably not something they should do for anything less than a failed approach - something I would also say applied to Netburst.

So while you weren't actually saying that they need to scrap the design because it was failed, those are the implications of your recommendations, hence why I said it's basically or effectively what you're saying.
 
You might not be aware of this but that means a total scrapping of the design of they have and practically starting over from scratch.
And you continue to be condescending, and repeat the same strawman. We are done. Bye.
 
I see you forget to include a link.

I don't need one. You don't have any obligation to believe others either; you wouldn't anyway.

--------------------------------------------------------------------------------------------------------------------------

On the flipside of things irrelevant if Denver is a masterpiece or just another garden variety design on its own merits, it should be noted that Tegra X1 was already qutie a bit ahead many competing solutions for automotive, so there's really no BIG need to over-design anything, NVIDIA has the luxury now to even slow down its roadmap for Tegra and create it less aggressive, because updates aren't as frequent or cut throat as in the consumer markets.

Need more FLOPs? Add another SoC oin the module or go up to 4 chips from which two dedicated GPU chips. NVIDIA is well positioned for the high end automotive market and that thanks to its GPU IP primarily; the competitiveness of their solutions doesn't change dramatically whether they've "just" A57 cores in there or Denver cores on top of those.
 
Last edited:
No I didn't make anything up, but as I said you are free to believe anything you want, always within your highly biased perspective. Just because I make things up if my phantasy serves well Parker doubles Manhattan 3.0 and TRex scores compared to X1 and while in geekbench its by >20% ahead of the Exynos8 for single threading, tables turn by a much higher persentage for the latter for multicore.
 
Last edited:
I might be wrong, but given the data provided by the author it sounds more like a Shield TV successor.
 
Back
Top