NVIDIA Tegra Architecture

These days nvidia die shots are highly implausible, pasting 192 little squares next to each other is really not how a Kepler gets made. It is like artist's vision of an exoplanet.
 
These days nvidia die shots are highly implausible, pasting 192 little squares next to each other is really not how a Kepler gets made. It is like artist's vision of an exoplanet.

I don't think that's fair. Artistic depictions of exoplanets tend to be based on as much concrete information as is available.
 
Erinyes is usually pretty spot on on things, but it seems it's only you and I that notice ;)

;) I guess I was right about most things regarding Erista..even the stuff I didn't have info on and speculated. But I was wrong about one thing..I said I did not think we would see a 16FF Denver SoC this year. But from what I gather reading the AT article and from my own info leads me to believe that we will (i.e. what appears to be the delayed Parker).

Oh and we were speculating 8 ROPs for Erista but they've gone straight for 16 which is the same as GM107! Thats quite an increase but given the big 4K push its not a bad move IMHO. The FP16 stuff was a bit of a surprise as well. I'm sure Rys is really happy about that :D
Erista is on 20SoC TSMC from what I've heard and I haven't seen anything planned yet (no matter what) for 16FF for 2015. As for the rather interesting power measurements they have in that link, is that with or without throttling?

Erista is most definitely on 20SoC. However as stated above, seems like there could be a Denver based SoC on 16FF by the end of the year.
I need time to study that article since especially the CPU integration sounds quite interesting. As for FP16 related optimisations let's hear what the usual suspects have to say NOW about it. :rolleyes:

Nvidia always seems to have done things differently with regard to their CPU configuration..though I dont know if thats always been to their benefit. How much of an effort is it to design your own interconnect as opposed to using ARM's?
Eventually someone will have a die shot and we'll find out. Without knowing though the exact transistor density, die estate is only half the useful information. If they've used a comparable density to the A8X (estimated 24Mio/mm2) and if your estimate is true it could be a healthy bit over 2b transistors.

Well I would have expected a die similar in size to A8 so yea a transistor count around 2b seems about right. I'm surprised Nvidia haven't mentioned it in the marketing slides.
You could drink shots of coloured paints and then hover over a sheet of paper and wait for nature to take its course, and you'd end up with a more accurate die shot than what's in NV Tegra marketing.

:LOL:
 
Nvidia always seems to have done things differently with regard to their CPU configuration..though I dont know if thats always been to their benefit. How much of an effort is it to design your own interconnect as opposed to using ARM's?
TK1-32 and TK1-64 also have a custom interconnect. I imagine that there is quite a sizeable engineering investment, especially since they claim it's "cache coherent". The fact that they say it's cache coherent but still remains a cluster migration SoC is extremely eyebrow raising. I hope we'll find out more in the comings months.
 
Saw a press release where the X1 was touted as advanced car tech. How it would recognize signs and other road objects.

But Nvidia isn't saying they're introducing a self-driving car platform, are they?
 
I am glad to see that first AMD and now Nvidia joined the ranks. Now everybody has FP16 ALU support in their forthcoming chips.
I haven't seen much detail on what AMD is doing about FP16, unfortuntely.
The packed operation method Nvidia is using is at least conceptually similar to what was discussed for Nvidia's future Echelon design. The increasingly software-scheduled architectures of late show shadings of this as well.

TK1-32 and TK1-64 also have a custom interconnect. I imagine that there is quite a sizeable engineering investment, especially since they claim it's "cache coherent". The fact that they say it's cache coherent but still remains a cluster migration SoC is extremely eyebrow raising. I hope we'll find out more in the comings months.
It looks to be a major undertaking, going by how infrequently major shifts happen for coherent interconnects.

AMD, for example, has been sporting variations of an interconnect since Llano, with the coherent northbridge interconnect being a crossbar/system queue arrangement that dates back to the K8. Assuming AMD manages finally replace it, Jim Keller would have been there at the birth of the interconnect and present for its sunset.

Intel notably shifted from a crossbar with Nehalem to a ring bus setup for Sandy Bridge and multiple generations afterward. That might not be replaced for a generation or two more with the mesh network pioneered with Knights Corner.
 
So nVidia says that Tegra X1 is using Cortex-A57 + A53 instead of Denver because this was faster and simpler to implement in 20nm. But Anandtech says that they have a completely custom physical implementation. In that case, how would "hardening" A57 and A53 - two CPUs they've never used before, especially the latter - be faster or simpler than using Denver, which they already have a custom implementation of and would require a more straightforward shrink? The only way this makes sense is if A57 was ready long before Denver, but the release timescale of devices using each respective CPU make this seem unlikely. So I'm skeptical that both of these claims are completely accurate.
 
Nvidia always seems to have done things differently with regard to their CPU configuration..though I dont know if thats always been to their benefit. How much of an effort is it to design your own interconnect as opposed to using ARM's?

Doing it their way instead of ARM's way has benefits:

However, rather than a somewhat standard big.LITTLE configuration as one might expect, NVIDIA continues to use their own unique system. This includes a custom interconnect rather than ARM’s CCI-400, and cluster migration rather than global task scheduling which exposes all eight cores to userspace applications. It’s important to note that NVIDIA’s solution is cache coherent, so this system won't suffer from the power/performance penalties that one might expect given experience with previous SoCs that use cluster migration.

http://www.anandtech.com/show/8811/nvidia-tegra-x1-preview
 
Cache coherency between the two clusters only means anything if they're both running the same time. Which should mean that they're capable of HMP, even if the normal use model is to only keep one on to let it wind down for a while during migration. Or maybe only the cache is kept on.

An advantage to having only one cluster on at a time could be that it can mux the same voltage and clock domains between both clusters, assuming the former has enough dynamic range. But this again doesn't make sense if they're cache coherent.

Maybe the cache coherency really refers to communication between the CPU clusters and the GPU, which is the only reason nVidia would have had to develop it for last generation.
 
According to Damien from HardWare.fr, Tegra X1 does not support HMP, but features a simpler and more energy-efficient interconnect.

So nVidia says that Tegra X1 is using Cortex-A57 + A53 instead of Denver because this was faster and simpler to implement in 20nm. But Anandtech says that they have a completely custom physical implementation. In that case, how would "hardening" A57 and A53 - two CPUs they've never used before, especially the latter - be faster or simpler than using Denver, which they already have a custom implementation of and would require a more straightforward shrink? The only way this makes sense is if A57 was ready long before Denver, but the release timescale of devices using each respective CPU make this seem unlikely. So I'm skeptical that both of these claims are completely accurate.

I had similar thoughts when I read NVIDIA's explanation for using standard ARM cores. Combined with Nebu's comments about Denver, I suspect the latter just wasn't good enough.

That said, I think that NVIDIA is finally on to something. They seem to have a good implementation of a 4+4 A57/A53 setup (i.e. pretty much the best available IP at the moment) plus a big Maxwell GPU, which we know to be very efficient, plus support for fast FP16. There's not reason for TX1 to have power-efficiency issues anymore.

The problem is that I'm not sure there are many use cases where TX1 would be all that preferable to, say, a Snapdragon 810 with integrated LTE, or a cheaper Mediatek design.
 
I'll add my name to the list of people looking forward to Anandtech's Nexus 9 review and (I expect) Denver dive.

From what I've seen so far, it's indisputable that Denver does very, very well on quite a few benchmarks. Maybe it also does very poorly on enough I don't know about - not counting useless tiny ones like Sunspider that don't matter (or at least shouldn't matter, but get thrown around anyway)

But it does seem to also come at an area premium, which is direct dollars lost for nVidia if they want to match on core count, and so far they don't want to even come close. And who knows what perf/W is really like, outside of nVidia's claims anyway.
 
Cache coherency between the two clusters only means anything if they're both running the same time. Which should mean that they're capable of HMP, even if the normal use model is to only keep one on to let it wind down for a while during migration. Or maybe only the cache is kept on.
Cache coherency as per how we questioned and got a response from Nvidia was that a cluster migration is done without any DRAM intervention. Again, I fail to see how this could be more efficient than just migrating via ARM's CCI, even if it's just limited to cluster migration. I really think most of their power efficiency claims just come from the process advantage and probably better physical libraries compared to the 5433 (I have an article on that one with very extensive power measurements... the 5433 is a comparatively bad A57 implementation if compared to what Samsung has achieved on A15s now on the 5430, many would be surprised).

I think their interconnect just can't do HMP and this is PR spinning.

An advantage to having only one cluster on at a time could be that it can mux the same voltage and clock domains between both clusters, assuming the former has enough dynamic range. But this again doesn't make sense if they're cache coherent.
That's very far-fetched.
 
Cache coherency as per how we questioned and got a response from Nvidia was that a cluster migration is done without any DRAM intervention. Again, I fail to see how this could be more efficient than just migrating via ARM's CCI, even if it's just limited to cluster migration. I really think most of their power efficiency claims just come from the process advantage and probably better physical libraries compared to the 5433 (I have an article on that one with very extensive power measurements... the 5433 is a comparatively bad A57 implementation if compared to what Samsung has achieved on A15s now on the 5430, many would be surprised).

I think their interconnect just can't do HMP and this is PR spinning.

But then why wouldn't they just use ARM's CCI?
 
But then why wouldn't they just use ARM's CCI?
¯\_(ツ)_/¯

Maybe they wanted to continue using what they had instead of switching over to a new IP. There's no real use of ARM's CCI in the context of Denver, keeping vanilla and custom cores on a same interconnect would pose less effort I imagine. Both TK1 variants use the same SoC architecture for example.
 
¯\_(ツ)_/¯

Maybe they wanted to continue using what they had instead of switching over to a new IP. There's no real use of ARM's CCI in the context of Denver, keeping vanilla and custom cores on a same interconnect would pose less effort I imagine. Both TK1 variants use the same SoC architecture for example.

This seems plausible to me. If the system were laid out early enough, ARM's interface might not have been ready.
If Nvidia's plans were as aggressive as some suggest, it's possible that they were designing and implementing earlier.
Once that happens, switching interconnects mostly just adds risk and delays if the one they were setting down was already compatible, and it would be prudent to design a compatible interface just in case core designs need to swap. CCI, on the other hand, doesn't need to care about swapping someone else's cores.

If the allegations of past non-ARM emulation were accurate, then using CCI wouldn't be useful either.
 
ARM's way is HMP, not old style cluster migration as on the 5410/5420. I really doubt Nvidia's claims on any kind of benefit of their own CM.
What is HMP? Running both A57 and A53 cores simultaneously? If so, I doubt ARM's claims on any kind of benefit to HMP. If not, enlighten me. =)
 
That's very far-fetched.

IF the clusters couldn't run simultaneously (and I'm not saying they can't, so this is rhetorical) why would it be far fetched for them to share a VRM? There are already CPU cores with very broad voltage ranges to support sleep modes with low power retention. Denver itself had such a mode. I doubt it used a different VRM for this purpose. If there's enough dynamic range to support nothing but keeping retention in SRAMs then there's enough dynamic range to support the low clock on the Cortex-A53s through the high clock on the Cortex-A57s. So why have multiple VRMs?

It goes without saying that the same PLL/DLL could support both as well.
 
What is HMP? Running both A57 and A53 cores simultaneously? If so, I doubt ARM's claims on any kind of benefit to HMP. If not, enlighten me. =)
Let me twist that one: if there aren't any benefits why not stick with a 4+1 config in the first place?
 
Last edited:
Back
Top