NVIDIA Tegra Architecture

OlegSH · Aug 12, 2014

trinibwoy said:
The whole thing will still stall on a cache miss right or am I missing something?

I wonder if DCO or sleep states are executed during cache stalls

trinibwoy · Aug 13, 2014

Some guys at Ars noticed that NVidia has filed multiple patents in the past few years related to "runahead" operation of a microprocessor. Sounds like voodoo but is probably a good fit for a wide in-order processor like Denver.

In the event of a cache miss the processor doesn't stall but keeps executing instructions. The results of those "future" instructions are thrown away once the original miss is resolved but the benefit is that the cache is now warmed up for when those instructions are executed for real.

Related paper => users.ece.cmu.edu/~omutlu/pub/mutlu_hpca03.pdf

Patent => http://www.google.com/patents/US20140189313

In some microprocessors, a stall may trigger entrance into a runahead mode of operation configured to detect other potential stalls. In other words, a microprocessor may detect a long-latency event that could cause the microprocessor to stall. While attempting to resolve that long-latency event (e.g., the runahead-triggering event), the microprocessor may speculatively execute additional instructions to attempt to uncover other possible stalls. By uncovering other possible stalls, the microprocessor may begin resolving the long-latency events underlying those possible stalls while resolving the runahead-triggering event, potentially saving time.

Once a runahead-triggering event is detected, the state of the microprocessor (e.g., the register values and other suitable states) may be checkpointed so that the microprocessor may return to a pre-runahead state once the runahead-triggering event has been resolved and runahead operation ends. Checkpointing saves the current state of the microprocesor, allowing such state to be later resumed. Checkpointing may include, for example, copying the contents of registers to duplicate registers. During runahead operation, the microprocessor executes in a working state, but does not commit the results of instructions to avoid altering the microprocessor's state.

3dilettante · Aug 13, 2014

sebbbi said:
The commenters in the tech report article seem to be really pessimistic about the code translation.

The skepticism of those with some familiarity attempts at wonder-translators coupled with simple hardware reminiscent of Transmeta's efforts wouldn't be without some justification. It would be a pleasant surprise to see something like this take off, but the disclosures thus far are not very in depth at all.

On that note, what's with AMD, Nvidia, and ARM disclosures coming out as branded Tirias papers?
I'm not familiar with this particular site, and a big chunk of what they advertise is marketing services. It seems odd to me, and the public stuff doesn't seem that in-depth.

However most of them doesn't seem to understand that all modern CPUs perform some kind code translation (especially the x86 CPUs, since x86 instructions are variable length and thus inefficient to directly execute). Denver translates the code once.

I agree on the description of standard decode and uop sequencing, but I'm not certain about Denver.
There's already an unknown threshold for dynamic analysis and reoptimization, and no significant details on the internal format or method of dispersal for code fetched from the translation cache.

Denver software does the register renaming and reordering once (likely adjusting the results based on CPU feedback slightly all the time), while a traditional CPU does it also again and again for the same code.

This is can very likely cause dynamic behavior of execution traces that persist in one form for thousands of iterations to stray from the ideal on a per-iteration basis.
The cost of this is likely to rely on a number of architectural features and real-world system behaviors not discussed.

The allocation of register IDs in loop unrolling makes it sound like the uops expose to the translation software a full set of registers larger than what is visible in ARMV8. I'd be curious how this is handled in terms of where registers used by instructions in the standard path are stored, and whether the IDs are static, added to a base register, or subject to a stack engine which some architectures like Itanium put in place to uncertain levels of success.

Denver has a bigger instruction cache, perhaps to counter code density concerns, and maybe to handle that there are now two dueling instruction streams for the same program hitting the Icache.
The balance in optimization versus holding onto a somewhat less ideal trace is an interesting one, and knowing when to eat the cost of reoptimization from a power and performance standpoint would be a challenge.

I think a mild OoO engine could help in some ways, such as in the case of loop unrolling and its instruction cache footprint. An optimized subroutine for an OoO Denver might be significantly more compact than the one for an in-order wide core.
Another potential concern is how this might influence later architectural design. The solution allows for a simpler core, but at the same time this can serve as a bound on complexity even when it is warranted. Wider designs and more complex unit behaviors might lead to more dynamic variance in traces, which mild reordering could compensate for.

Denver tackles the right problem, code that is running frequently again and again.

Reentrant code is a particular case for which the Pentium 4's trace cache was optimized for, and for which it provided significant benefits.
The glass jaws for that approach were significant, and while Nvidia's scheme should have more margin because of its software control and cache size, there are probably corner cases that have not been discussed.
Being software and being held in memory, the corner cases may be fewer but more expensive depending on architectural decisions not discussed.
(A lot of things aren't discussed.

)

I've gone back to read some of the following to see if there are similarities:
http://www.realworldtech.com/crusoe-intro/
http://www.realworldtech.com/crusoe-exposed/

The presence of a standard ARM decoder indicates that Denver should be more robust internally than Crusoe, which at a low level was incapable of executing code outside of the translated region and lacked significant protections if anything delved below the level of the CMS.

Random questions and a rose-tinted remembrance of days past when architectures were actually disclosed to follow:

I have a lot of questions, although I worry that the one thing that Denver could inherit is the lack of transparency that Transmeta employed and which is all too common for a lot of mobile products as well.

Is the 7-wide superscalar and the 7 execution units more than coincidence as to how the optimized internal ISA is mapped to the hardware?

These days, the emphasis on security has hopefully plugged a likely security hole in the digital signing of the loaded CMS image, which is one detraction of the software method versus a hardware core. I'm curious if there are possible exploits and what sorts of defenses Nvidia has created to prevent the compromise of its software engine or the integrity of the encrypted code store.

The handling of exceptions is another area of interest to me. A hardware core with speculative state rolls all its optimizations and precise exception state into one package (for integer work, at least), which is not the case for CMS or potentially Denver. Transmeta had a forced step by step mode in the case of an exception, while Denver may be able to flip over to the standard ARM front end and set the IP to the last known committed value before trying again.

The memory pipeline needs to be significantly better and actually closer to an OoO core's pipeline than most. For example, Itanium's in-order cores had software-driven load/store speculation as well as a very expansive and out of order memory pipeline. Denver would likely need such a memory pipeline if the cache subsystem is going to handle OoO-like traffic patterns.
Since the chip must also be an ARMV8 design does inform that at least that part of the chip needs to behave in a certain manner, so does that leave room for the more exotic software speculation measures?

There are some additional resources compared to Efficeon and especially Crusoe. Perhaps one key element is the presence of a second core that can be leaned on for running the translation software, although that actually brings up questions of how such a design handles higher core counts.

What's the throughput of the ARM decoder?
What are the policies of the translated code cache (length of an entry, privileged instructions, self-modifying code, etc.)?
What are the policies for allocating optimized traces, and can the same set of ARM instructions wind up in more than one microcoded routine? Can the CPU directly branch from within one of these routines to another routine in the optimization cache?
Can one routine with its set of renamed registers be called while a different routine's renamed registers are in-flight?
CMS had translations dependent on the privilege level of the process at the time its code was being translated, what does Denver do?

Etc.

constant · Aug 13, 2014

trinibwoy said:
the benefit is that the cache is now warmed up for when those instructions are executed for real.

[/I]

Yes the cache will very likely be filled with data for processing after the instruction that was just stalled upon. Very clever indeed.

I guess you might call this some sort of prefetching?

constant · Aug 13, 2014

OlegSH said:
I wonder if DCO or sleep states are executed during cache stalls

Given that it's able to enter the CC4 power saving state in ~150 us (compared to 10s of ms for similar states on other CPUs) it's very likely that it could do that during a major cache stall.

CC4 has the advantage of not flushing the caches or the register files.

Gubbi · Aug 13, 2014

constant said:
Given that it's able to enter the CC4 power saving state in ~150 us (compared to 10s of ms for similar states on other CPUs) it's very likely that it could do that during a major cache stall.

Unlikely. 150 us is 375000 cycles at 2.5GHz.

Cheers

Nebuchadnezzar · Aug 13, 2014

constant said:
Given that it's able to enter the CC4 power saving state in ~150 us (compared to 10s of ms for similar states on other CPUs) it's very likely that it could do that during a major cache stall.

CC4 has the advantage of not flushing the caches or the register files.

CC4 is not meant to be a low latency idle state, it's supposed to be a state much like ARM's AFTR without the limitations.

RecessionCone · Aug 13, 2014

3dilettante said:
On that note, what's with AMD, Nvidia, and ARM disclosures coming out as branded Tirias papers?

I'm not familiar with this particular site, and a big chunk of what they advertise is marketing services. It seems odd to me, and the public stuff doesn't seem that in-depth.

I wondered that too, and decided it might just be a random person attending hotchips and throwing up hastily written summaries of the talks in order to advertise a PR/consulting business. I doubt NVIDIA paid for that white paper.

ams · Aug 13, 2014

Presentation slides on Denver from Hot Chips 2014: http://www.hotchips.org/wp-content/...6.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf

A1xLLcqAgt0qc2RyMz0y · Aug 13, 2014

ams said:
Presentation slides on Denver from Hot Chips 2014: http://www.hotchips.org/wp-content/...6.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf

Thanks. Is there a link to the actual talk at Hot Chips?

One additional detail not talked about in the White Paper:

Greater dynamic sharing with GPU

Any ideas on this.

silent_guy · Aug 13, 2014

So it uses a real ARM HW decoder when it can't find a piece of optimized code in its cache? I don't think Transmeta had such a thing, and if it's reasonably fast, it could be quite a big help for perf/W in that it avoids optimizing code that gets executed only once.

3dilettante · Aug 13, 2014

Transmeta's solution was incapable of running non-VLIW instructions, and even beyond that was incapable of executing anything that was not in the code morphing cache.

My next question is what beyond the ARM decoder block is conventional in that core, or if the architecture is more exotic behind it.
The front end part sounds like it can at least fetch from outside the translation cache, which is a departure from Transmeta already.

trinibwoy · Aug 13, 2014

Discussion of Tegra's roadmap

Another question is what's the max achievable ILP for instructions coming through the hardware decoder. Given ARM is already RISC extracting ILP might require some sort of buffering in front of the scheduler.

ltcommander.data · Aug 13, 2014

3dilettante said:
On that note, what's with AMD, Nvidia, and ARM disclosures coming out as branded Tirias papers?
I'm not familiar with this particular site, and a big chunk of what they advertise is marketing services. It seems odd to me, and the public stuff doesn't seem that in-depth.

I noticed that too. They have 5 papers available: 2 AMD, 2 nVidia, and 1 ARM. The AMD and ARM papers explicitly state they are sponsored by AMD or ARM in the headers, but the nVidia papers do not. Should we take that to mean the 2 nVidia papers are their first independent work or that they are sponsored by nVidia but not disclosed?

http://www.tiriasresearch.com/research/

The following is a list of recent research from the staff at TIRIAS Research. There are reports for sale and free sponsored reports and white papers.

Their overall description of the papers seem to indicate all the free content is sponsored.

3dilettante · Aug 13, 2014

trinibwoy said:
Another question is what's the max achievable ILP for instructions coming through the hardware decoder. Given ARM is already RISC extracting ILP might require some sort of buffering in front of the scheduler.

The standard ARM path may also have an influence on how quickly it can process exceptions. If an exception was raised for Transmeta's architecture, it would roll back and emulate one instruction at a time.
If an exception occurs in Nvidia's optimized code, a uop that could have been from an ARM instruction that was hoisted or replaced may not have a direct relationship to where it would be in the standard instruction stream.
Nvidia could roll back to the last known good point and then have the standard decoder run through until it hits the exception for real. The time it takes to handle it would be based on how performant the non-optimized execution path is.

Alexko · Aug 14, 2014

In unrelated news, lots of people are reporting cracking problems on the Shield tablet's casing: https://forums.geforce.com/default/topic/766370/shield-tablet/cracked-edges/1/

It seems to be related to heat, but NVIDIA says it hasn't been able to identify the exact cause(s) yet. However the company is doing the right thing and offering replacements to anyone who wishes it, no questions asked.

ams · Aug 14, 2014

No, I don't think it has anything to do with heat, because many people with those issues have used the tablet very lightly. There are four very small seams in each of the four corners of the tablet, and for some people, very small cracks have developed near the seams in one or more corners (typically close to where the stylus is). I recently received my tablet and have no issues with any cracks developing even after extended periods of Trine 2 game play. Anyway, for anyone who does have the issue, NVIDIA will naturally replace their unit.

itsmydamnation · Aug 14, 2014

ams said:
Presentation slides on Denver from Hot Chips 2014: http://www.hotchips.org/wp-content/...6.11.234-Denver-Darrell.Boggs-NVIDIA-rev4.pdf

anyone have a copy it now seems to be password protected.

cheers

A1xLLcqAgt0qc2RyMz0y · Aug 14, 2014

itsmydamnation said:
anyone have a copy it now seems to be password protected.

cheers

Try this link:

https://drive.google.com/file/d/0B8mBa_eA8Zf2anBYZHBzX3FueGc/edit?usp=sharing

Alexko · Aug 14, 2014

ams said:
No, I don't think it has anything to do with heat, because many people with those issues have used the tablet very lightly. There are four very small seams in each of the four corners of the tablet, and for some people, very small cracks have developed near the seams in one or more corners (typically close to where the stylus is). I recently received my tablet and have no issues with any cracks developing even after extended periods of Trine 2 game play. Anyway, for anyone who does have the issue, NVIDIA will naturally replace their unit.

I was skeptical that it was heat related as well, but one member said his tablet had two cracks, so he launched a game, put his tablet down on a table and left it there for two hours. When he came back, the two cracks had become bigger and a third one had formed.

My guess would be that it's heat related but doesn't affect all units. Something wrong with the plastic casing on some units, or a faulty heat shield, perhaps.

NVIDIA Tegra Architecture

OlegSH

trinibwoy

Meh

3dilettante

constant

constant

Gubbi

Nebuchadnezzar

RecessionCone

ams

A1xLLcqAgt0qc2RyMz0y

silent_guy

3dilettante

trinibwoy

Meh

ltcommander.data

3dilettante

Alexko

ams

itsmydamnation

A1xLLcqAgt0qc2RyMz0y

Alexko

Similar threads