The Nexus 9 is due soon and its processor is an interesting one. Looks like a repurposed, originally x86 targeted core from an acquisition nVidia made in 2006, and resembles the techniques used by Transmeta's CPUs.
A blog post:
http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/
Link to white paper:
http://www.tianna1121.com/wp-content/uploads/2014/08/NVIDIA-Project-Denver-White-Paper.pdf
It sounds like you could do things like cache-line compression (provided you have some kind of fast hardware decoder) as well during down time with a rewrite of the optimizer. I wonder what the time frame in CPU cycles of the optimization is like, what the most hard to optimize cases are, and how poorly they perform. Upcoming benchmarks for this thing should be interesting.
EDIT Early benchmarks:
http://www.phonearena.com/news/Nexu...gra-K1-outperforms-Apple-iPhone-6s-A8_id61825
These look good for single-threaded tests but the results don't scale as well to multi-threaded tests as its competitors; perhaps it's from the overhead of using the non-executing core to do code optimization.
A blog post:
http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/
Link to white paper:
http://www.tianna1121.com/wp-content/uploads/2014/08/NVIDIA-Project-Denver-White-Paper.pdf
Dynamic Code Optimization
The most unique aspect of Denver is the dynamic code optimization. The core microarchitecture
of the CPU is unique in that it has an in-order pipeline, but uses special software to reorder and
optimize instruction traces. During repetitive code sequences, the Denver CPU collects dynamic
runtime information during code execution and passes this information to the dynamic code
optimizer; enabling the optimizer to assess more optimized ways for the code to be executed.
The CPU uses hidden time slices to run the optimizer or can use the second core for
optimizations for the active core.
The dynamic optimizer runs in its own private and protected state and is not visible to the
operating system or any user code. The signed and encrypted dynamic optimizer code loads at
boot into a protected part of main memory. By performing the reordering and register renaming
in software, Denver eliminates the power hungry out-of-order control logic and yet it can achieve
comparable results.
The profiler gathers info on program flow such as branch results (such as taken, not taken,
strongly taken, and strongly not taken) and other hardware statistics tables and counters. The
optimizer (Figure 1) recognizes opportunities to improve execution and then can rename
registers, reorder loads and stores, improve control flow, remove redundant code, hoist redundant
computations, perform loop unrolling, and other common optimizations. Because the run-time
software performs optimization, the profiler can look over a much larger instruction window than
is typically found in hardware out-of-order (OoO) designs. Denver could optimize over a 1,000
instruction window, while most OoO hardware is limited to a 192 instruction window or smaller.
The dynamic code optimizer will continue to evaluate profile data and can perform additional
optimizations on the fly.
Once the ARM code sequence is optimized, the new microinstruction sequence is stored in an
optimization cache in main memory and can be recalled when the ARM code branch target
address is recognized in a special optimization lookup table on-chip. A branch target hit in the
table provides a pointer in the optimization cache for the optimized sequence, and that sequence
is substituted for the original ARM code (Figure 2). This optimization lookup table is a 1K 4-
way memory which holds ARM branch to optimization target pointer pairs, and it is looked up in
parallel to the instruction cache. The optimization cache is stored in a private 128MB main
memory buffer, a minor set aside in systems with 4GB (or more) of main memory. The
optimization cache also does not contain any pre-canned code substitutions design to accelerate
benchmarks or other applications.
It sounds like you could do things like cache-line compression (provided you have some kind of fast hardware decoder) as well during down time with a rewrite of the optimizer. I wonder what the time frame in CPU cycles of the optimization is like, what the most hard to optimize cases are, and how poorly they perform. Upcoming benchmarks for this thing should be interesting.
EDIT Early benchmarks:
http://www.phonearena.com/news/Nexu...gra-K1-outperforms-Apple-iPhone-6s-A8_id61825
These look good for single-threaded tests but the results don't scale as well to multi-threaded tests as its competitors; perhaps it's from the overhead of using the non-executing core to do code optimization.
Last edited by a moderator: