nVidia Denver discussion

Raqia

Regular
The Nexus 9 is due soon and its processor is an interesting one. Looks like a repurposed, originally x86 targeted core from an acquisition nVidia made in 2006, and resembles the techniques used by Transmeta's CPUs.

A blog post:

http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/

Link to white paper:

http://www.tianna1121.com/wp-content/uploads/2014/08/NVIDIA-Project-Denver-White-Paper.pdf

Dynamic Code Optimization

The most unique aspect of Denver is the dynamic code optimization. The core microarchitecture
of the CPU is unique in that it has an in-order pipeline, but uses special software to reorder and
optimize instruction traces. During repetitive code sequences, the Denver CPU collects dynamic
runtime information during code execution and passes this information to the dynamic code
optimizer; enabling the optimizer to assess more optimized ways for the code to be executed.
The CPU uses hidden time slices to run the optimizer or can use the second core for
optimizations for the active core.

The dynamic optimizer runs in its own private and protected state and is not visible to the
operating system or any user code. The signed and encrypted dynamic optimizer code loads at
boot into a protected part of main memory. By performing the reordering and register renaming
in software, Denver eliminates the power hungry out-of-order control logic and yet it can achieve
comparable results.

The profiler gathers info on program flow such as branch results (such as taken, not taken,
strongly taken, and strongly not taken) and other hardware statistics tables and counters. The
optimizer (Figure 1) recognizes opportunities to improve execution and then can rename
registers, reorder loads and stores, improve control flow, remove redundant code, hoist redundant
computations, perform loop unrolling, and other common optimizations. Because the run-time
software performs optimization, the profiler can look over a much larger instruction window than
is typically found in hardware out-of-order (OoO) designs. Denver could optimize over a 1,000
instruction window, while most OoO hardware is limited to a 192 instruction window or smaller.
The dynamic code optimizer will continue to evaluate profile data and can perform additional
optimizations on the fly.

Once the ARM code sequence is optimized, the new microinstruction sequence is stored in an
optimization cache in main memory and can be recalled when the ARM code branch target
address is recognized in a special optimization lookup table on-chip. A branch target hit in the
table provides a pointer in the optimization cache for the optimized sequence, and that sequence
is substituted for the original ARM code (Figure 2). This optimization lookup table is a 1K 4-
way memory which holds ARM branch to optimization target pointer pairs, and it is looked up in
parallel to the instruction cache. The optimization cache is stored in a private 128MB main
memory buffer, a minor set aside in systems with 4GB (or more) of main memory. The
optimization cache also does not contain any pre-canned code substitutions design to accelerate
benchmarks or other applications.

It sounds like you could do things like cache-line compression (provided you have some kind of fast hardware decoder) as well during down time with a rewrite of the optimizer. I wonder what the time frame in CPU cycles of the optimization is like, what the most hard to optimize cases are, and how poorly they perform. Upcoming benchmarks for this thing should be interesting.

EDIT Early benchmarks:

http://www.phonearena.com/news/Nexu...gra-K1-outperforms-Apple-iPhone-6s-A8_id61825

These look good for single-threaded tests but the results don't scale as well to multi-threaded tests as its competitors; perhaps it's from the overhead of using the non-executing core to do code optimization.
 
Last edited by a moderator:
EDIT Early benchmarks:

http://www.phonearena.com/news/Nexu...gra-K1-outperforms-Apple-iPhone-6s-A8_id61825

These look good for single-threaded tests but the results don't scale as well to multi-threaded tests as its competitors; perhaps it's from the overhead of using the non-executing core to do code optimization.

It depends on what data point you look at. There is one other HTC Volantis Nexus 9 data point ( http://browser.primatelabs.com/geekbench3/1014788 ) that lists Single-core 1807, Multi-core 3220 , which is very similar [1.782x] Multi-core scaling compared to Cyclone enhanced [1.796x] Multi-core scaling. That said, you could be right about the overhead (although the dynamic code optimization can occur in hidden time slices rather than with the extra core if need be).

FWIW, as noted on the forum, these are AArch32 results (compared to AArch64 results for iPhone 6/6+). The floating point numbers for Nexus 9 in particular should receive a nice boost with AArch64.
 
Exciting scores given their (I presume) domination on the graphics side aswell. Can't wait to see what a less power constrained version will be able to do on a dGPU - and more importantly what workloads it will be running in that space.
 
An interesting comparison pulled from realworldtech of a pair of benchmarked Denver systems:

http://browser.primatelabs.com/geekbench3/compare/1014788?baseline=1014854

Looks like its optimization creates substantial amounts of fluctuation in performance:

It looks very strange. Denver's geekbench3 multicore score more than twice of single-core score, and the performance fluctuate dramatically.
I guess some optimizations such as profiling take effects... The Denver looks like dedicated benchmarking processor:) the real world application performance is another story.

http://browser.primatelabs.com/geekbench3/compare/1014788?baseline=1014854
AES 1343 3097
AES Multicore 6357 2391
Twofish 3200 2550
Twofish Multicore 6374 6371
...
JPEG Compress 1612 2145
JPEG Compress Multicore 3930 4304
PNG Compress 1686 2243
PNG Compress Multicore 4254 4498
 
There are only three data points available so far for Nexus 9, so hard to say exactly what the numbers would end up being and how much fluctuation there will be (and we still don't have AArch64 data yet for Denver).

I have seen dual-core Intel processors that have multi-core scores that are more than 2x higher than single core: http://browser.primatelabs.com/geekbench3/922000
 
There are only three data points available so far for Nexus 9, so hard to say exactly what the numbers would end up being and how much fluctuation there will be (and we still don't have AArch64 data yet for Denver).

I have seen dual-core Intel processors that have multi-core scores that are more than 2x higher than single core: http://browser.primatelabs.com/geekbench3/922000

A difference of that size is probably due to noise unless the algorithm is very bizarre...
 
It really sounds too good to be true but why not (it's extravagant to claim the highest single-thread IPC in the world isn't it?)

I wonder about licensing : is the ISA software or hardware? It would be fun to run semi-modern x86 and pretend it's "emulation" or "virtual", so it's fine. Or if the product doesn't include x86-on-VISC microcode/firmware but the user can supply it unofficially.
Also fun would be to run the main OS on ARMv8 or MIPS or other, and have Wine or Qemu be able to use a virtual x86 core. Might be complicated..

Without resorting to speculations, unproved hardware and magical things : what about transparent running of Wine and x86 on Denver Tegra? (there's already the free software tech for Wine + CPU emulation, not sure if it's easy to run and used by somewhat many people)
If the performance is remotely good, that would be enough to run ancient games and undemanding apps. Without virtual cores astrophysics.
To think of it I would need a good, graphical and easy front-end that works for both Dosbox and Wine games, and is cross-plaform (even Windows lol, simply launching stuff instead of using Wine). A comprehensive solution at last to run the legacy games and software (with an internet database of "metadata" i.e. pre-hashed configurations for programs and games so they do work, and why not artwork, title, company, year though that's less critical)
 
There seems to be a resurrection of Transmeta-style emulation based CPU:

http://www.softmachines.com/

I think it's an interesting trend as performance per watt matters more than ever.

It likely wouldn't be the first resurrection, if we consider a prior startup: Montalvo Systems. There were indications of a core virtualization component and a probably a code translation component.
Every now and again we see a CPU architecture startup, but then the reality that this isn't a cheap business to get into comes into play.
This one seems to be numerically better funded than Montalvo. Maybe Softmachines is trying to license the architecture? I'm a fuzzy on those details.

There are elements of the concept that I think are going to be found more often in the future, but there's plenty of historical precedent for cotton-candy performance estimates and carefully crafted benchmarking.
There's heavy use of buzzwords in their presentation. There are canards concerning OoO and silicon scaling that are generally true--if lacking full context, but so rote for so long that it's hard to tell if the philosophy or mindset is one that is derived from an awareness of the present's particular challenges, or that it was fossilized in the previous decade when those became obvious.

Not knowing their methodology, I also worry that their low clock speed is overinflating the effectiveness of the scheme, on top of heavy reliance what looks to be a media converter benchmark.
A lot of architectures can get higher IPC when they're clocked roughly the same as their DRAM, similarly, going that low can yield dividends in power consumption as well.
That doesn't rule the design's utility in the spheres where this does work, but that's not the actual peformance, workload, and power regime its PR is targeting.
Not knowing the details, I don't know if I can trust the various portions where they asterisk poor outcomes with a "we're totally holding back, honest".
 
I'm generally pleased that the iPhone liberated the world of CPU design, and because of ARM licensing, we're now seeing very interesting and diverse CPUs from many companies: Qualcomm, Apple, nVidia just to name a few.

Of course, the downside is that many owners of these new CPU dynasties (Apple most notably) are substantially less transparent than Intel was about divulging technical details as now the focus is much more on the end product which is usually a smartphone or a table than making any kind of margin from the CPU itself.
 
Since when Did qualcomm and others start working for free? They have some ties with phone makers but their bread and butter is still with selling soc's.

Btw how exactly Did apple liberate cpu design?..
 
I doubt Qualcomm's margins are nearly as nice as Intel's were in the 90's and early 2000's, and the CPU itself is a small part of the SoCs they sell for a profit, they also bring to the table a modem, dsp, gpu, in addition to that CPU on that die.

Apple started things off by popularizing ARM as a hardware platform which doesn't require the same licensing terms as x86. Intel is an extremely technically competent corporation but they were also very intentionally anti-competitive in ways that had nothing to do with superior engineering through out their history.

There was also a massive change in how chips are now pretty much all manufactured at third party fabs partly due to the extreme costs of owning fabs and partly due to the nature of ARM licensing.
 
Last edited by a moderator:
Lot's of devices used arm based cpu's before apple did. Apple didn't popularize or liberate arm.
Apple was critical to the history of ARM although probably not in the way Raqia was thinking of:
http://en.wikipedia.org/wiki/ARM_Holdings said:
The company was founded in November 1990 as Advanced RISC Machines Ltd and structured as a joint venture between Acorn Computers, Apple Computer (now Apple Inc.) and VLSI Technology.[22][23][24] The new company intended to further the development of the Acorn RISC Machine processor, which was originally used in the Acorn Archimedes and had been selected by Apple for their Newton project.
Given the failure of the Newton though, I would argue that after 1993, it was Texas Instruments which played the biggest role in popularising ARM. Also note the irony that Intel's XScale helped popularise ARM at the detriment of MIPS in the early 2000s ;)

For my view of the relevant history, see http://www.beyond3d.com/content/articles/111/

One day hopefully I'll get to update that article not just with the latest ARM stuff but also more exotic architectures like Denver and Soft Machines - the latter is especially interesting, although I'll believe their performance claims when I can see it with my own eyes...
 
The last five posts have either Apple or Qualcomm as the main subject.

This thread is supposed to be about nVidia's Denver. :cry:
Agreed, although there's not much more to be said until NVIDIA reveals more (unlikely) or devices are publicly available (soon! :D)

Soft Machines is most similar to Denver although very different. It's interesting to ponder in the context of what crazy things could be done with Transmeta-like dynamic recompilation in general. It might be worth opening a new thread for Soft Machines but I fear there's not enough info to keep it alive either...
 
My point was about diversity of CPU designs in production, especially architectural features aimed at higher performance, not ARM shipment volume. (Point is taken about Texas Instruments though.) We might not have seen as much design diversity without Apple since they introduced the demand for higher performance ARM chips (which is relatively open compared to x86) coupled with lots of ram and powerful GPUs.

I don't believe there were out of order designs or sophisticated cache hierarchies for ARM prior to the performance demands of touch devices, nor could we expect a software recompilation design to ship more than a handful of units or have as much performance as Denver.
 
The galore of ARM-compatible designs and vendors now somehow reminds me of the situation in the mid-90s with x86. It was much more lively market back then, with half a dozen IHVs going around with interesting architecture concepts and sense of healthy competition.
 
Are there any (many?) Android games etc that use this performance, or is having the fastest hardware on that platform just a theoretical exercise?
 
Back
Top