The commenters in the tech report article seem to be really pessimistic about the code translation.
The skepticism of those with some familiarity attempts at wonder-translators coupled with simple hardware reminiscent of Transmeta's efforts wouldn't be without some justification. It would be a pleasant surprise to see something like this take off, but the disclosures thus far are not very in depth at all.
On that note, what's with AMD, Nvidia, and ARM disclosures coming out as branded Tirias papers?
I'm not familiar with this particular site, and a big chunk of what they advertise is marketing services. It seems odd to me, and the public stuff doesn't seem that in-depth.
However most of them doesn't seem to understand that all modern CPUs perform some kind code translation (especially the x86 CPUs, since x86 instructions are variable length and thus inefficient to directly execute). Denver translates the code once.
I agree on the description of standard decode and uop sequencing, but I'm not certain about Denver.
There's already an unknown threshold for dynamic analysis and reoptimization, and no significant details on the internal format or method of dispersal for code fetched from the translation cache.
Denver software does the register renaming and reordering once (likely adjusting the results based on CPU feedback slightly all the time), while a traditional CPU does it also again and again for the same code.
This is can very likely cause dynamic behavior of execution traces that persist in one form for thousands of iterations to stray from the ideal on a per-iteration basis.
The cost of this is likely to rely on a number of architectural features and real-world system behaviors not discussed.
The allocation of register IDs in loop unrolling makes it sound like the uops expose to the translation software a full set of registers larger than what is visible in ARMV8. I'd be curious how this is handled in terms of where registers used by instructions in the standard path are stored, and whether the IDs are static, added to a base register, or subject to a stack engine which some architectures like Itanium put in place to uncertain levels of success.
Denver has a bigger instruction cache, perhaps to counter code density concerns, and maybe to handle that there are now two dueling instruction streams for the same program hitting the Icache.
The balance in optimization versus holding onto a somewhat less ideal trace is an interesting one, and knowing when to eat the cost of reoptimization from a power and performance standpoint would be a challenge.
I think a mild OoO engine could help in some ways, such as in the case of loop unrolling and its instruction cache footprint. An optimized subroutine for an OoO Denver might be significantly more compact than the one for an in-order wide core.
Another potential concern is how this might influence later architectural design. The solution allows for a simpler core, but at the same time this can serve as a bound on complexity even when it is warranted. Wider designs and more complex unit behaviors might lead to more dynamic variance in traces, which mild reordering could compensate for.
Denver tackles the right problem, code that is running frequently again and again.
Reentrant code is a particular case for which the Pentium 4's trace cache was optimized for, and for which it provided significant benefits.
The glass jaws for that approach were significant, and while Nvidia's scheme should have more margin because of its software control and cache size, there are probably corner cases that have not been discussed.
Being software and being held in memory, the corner cases may be fewer but more expensive depending on architectural decisions not discussed.
(A lot of things aren't discussed.
)
I've gone back to read some of the following to see if there are similarities:
http://www.realworldtech.com/crusoe-intro/
http://www.realworldtech.com/crusoe-exposed/
The presence of a standard ARM decoder indicates that Denver should be more robust internally than Crusoe, which at a low level was incapable of executing code outside of the translated region and lacked significant protections if anything delved below the level of the CMS.
Random questions and a rose-tinted remembrance of days past when architectures were actually disclosed to follow:
I have a lot of questions, although I worry that the one thing that Denver could inherit is the lack of transparency that Transmeta employed and which is all too common for a lot of mobile products as well.
Is the 7-wide superscalar and the 7 execution units more than coincidence as to how the optimized internal ISA is mapped to the hardware?
These days, the emphasis on security has hopefully plugged a likely security hole in the digital signing of the loaded CMS image, which is one detraction of the software method versus a hardware core. I'm curious if there are possible exploits and what sorts of defenses Nvidia has created to prevent the compromise of its software engine or the integrity of the encrypted code store.
The handling of exceptions is another area of interest to me. A hardware core with speculative state rolls all its optimizations and precise exception state into one package (for integer work, at least), which is not the case for CMS or potentially Denver. Transmeta had a forced step by step mode in the case of an exception, while Denver may be able to flip over to the standard ARM front end and set the IP to the last known committed value before trying again.
The memory pipeline needs to be significantly better and actually closer to an OoO core's pipeline than most. For example, Itanium's in-order cores had software-driven load/store speculation as well as a very expansive and out of order memory pipeline. Denver would likely need such a memory pipeline if the cache subsystem is going to handle OoO-like traffic patterns.
Since the chip must also be an ARMV8 design does inform that at least that part of the chip needs to behave in a certain manner, so does that leave room for the more exotic software speculation measures?
There are some additional resources compared to Efficeon and especially Crusoe. Perhaps one key element is the presence of a second core that can be leaned on for running the translation software, although that actually brings up questions of how such a design handles higher core counts.
What's the throughput of the ARM decoder?
What are the policies of the translated code cache (length of an entry, privileged instructions, self-modifying code, etc.)?
What are the policies for allocating optimized traces, and can the same set of ARM instructions wind up in more than one microcoded routine? Can the CPU directly branch from within one of these routines to another routine in the optimization cache?
Can one routine with its set of renamed registers be called while a different routine's renamed registers are in-flight?
CMS had translations dependent on the privilege level of the process at the time its code was being translated, what does Denver do?
Etc.