NVIDIA's x86-to-ARM Project Discussion

neliz

GIGABYTE Man
Veteran
http://www.semiaccurate.com/2010/08/17/details-emerge-about-nvidias-x86-cpu/

So despite nV demo'ing linux running on a GF100, we won't have to expect nV releasing any x86 compatible product any time soon.
One big clue about how badly Nvidia lost is in the FTC settlement, under Section I. F., Other Definitions. It defines, "Compatible x86 Microprocessor" as, in part iii. as "that is substantially binary compatible with an Intel x86 Microprocessor without using non-native execution such as emulation".
 
http://www.semiaccurate.com/2010/08/17/details-emerge-about-nvidias-x86-cpu/

So despite nV demo'ing linux running on a GF100, we won't have to expect nV releasing any x86 compatible product any time soon.

I doubt NVIDIA will be able to make a x86 compatible CPU without obtaining a license from Intel (and possibly AMD). The number of patents related to x86 are just too enormous. It looks to me that the only possibility is to buy (or "merge" with) VIA to get their license (which extends to 2018).

Of course, they can always make a x86 emulator. Transmeta did that to evade the patent problem. From the Business Week article it's suggested that NVIDIA is doing it the same way. However, it'd be very difficult to make a high performance x86 CPU with an emulator.
 
I doubt NVIDIA will be able to make a x86 compatible CPU without obtaining a license from Intel (and possibly AMD). The number of patents related to x86 are just too enormous. It looks to me that the only possibility is to buy (or "merge" with) VIA to get their license (which extends to 2018).
As pointed out earlier in this thread, it's VIA who could buy nVidia if one would buy the other, not the other way around (due the fact that VIA is part of really, really huge company)

edit: scratch that, it was another thread:
you mean a takeover of Nvidia by VIA? VIA is part of the Formosa Plastics Group. They are quite big.
 
So wait - NVIDIA is investing significant resources in an optimized ARM-to-x86 hardware translation engine, and ARM would care so little as to not give them very preferential access to Eagle's architecture? It would be funny if it wasn't so absurd. Keep in mind also that T50 isn't the first Eagle-based Tegra: that's T40, which is an ARM-only quad-core on TSMC 28HPM.

I also completely fail to understand how this would not fit under the "without using non-native execution" clause. There's one very critical difference between NVIDIA's approach and Transmeta's: in the latter's case, it was the CPU core itself which handled the translation mechanism. In NVIDIA's case, it will be a discrete component that could easily be power-gated off when running ARM binaries (in order to be competitive with OMAP and the like).

From a legal perspective, NVIDIA could rightfully argue that what they have implemented is a x86 hardware emulator connected to an ARM microprocessor. This is clearly even more defensible than Transmeta's approach, and seems to very easily fit in the FTC's settlement's wording even if Intel disagreed. And an (ideally configurable/programmable but highly specialised) off-core block might even mean slightly better efficiency than Transmeta's approach.

But do not think I am cheerleading here. I am very skeptical that they will achieve their goals from a technical (performance and power) perspective. I suspect Eagle will have comparable or slightly higher performance relative to Atom's successor in that timeframe. Combined with the translation mechanism, I can't really see how they will beat Intel on performance, and they'll lose a fair bit of the power advantage unless their design is really incredible. Just making it work with decent performance would be a technological tour-de-force, but actually making it attractive overall in an ultra-competitive market would rightfully blow a lot of people's minds so I understand Charlie's skepticism even if I don't agree with several of his reasons.

My understanding of NVIDIA's process evolution for Tegra is as follows: T30=28LPT, T40=28HPM, T50=20HPM (LPT=SiON & HPM = High-K, all triple gate oxide). So the one consolation is that if both they and TSMC don't suffer any significant delays (well, we'll see about that) they might not really be at a substantial process disadvantage versus Intel. They do suffer execution risk from being that aggressive though.

EDIT: Assuming Charlie is correct, I'd like to point out I correctly speculated about this in March without any insider info - who needs sources when you've got a brain? :) I didn't touch on the technical challenges then though, but my point about it having a better risk-reward ratio than a from-the-grounds-up x86 core stands: http://forum.beyond3d.com/showpost.php?p=1409915&postcount=2170
 
So wait - NVIDIA is investing significant resources in an optimized ARM-to-x86 hardware translation engine, and ARM would care so little as to not give them very preferential access to Eagle's architecture? It would be funny if it wasn't so absurd. Keep in mind also that T50 isn't the first Eagle-based Tegra: that's T40, which is an ARM-only quad-core on TSMC 28HPM.

I also completely fail to understand how this would not fit under the "without using non-native execution" clause. There's one very critical difference between NVIDIA's approach and Transmeta's: in the latter's case, it was the CPU core itself which handled the translation mechanism. In NVIDIA's case, it will be a discrete component that could easily be power-gated off when running ARM binaries (in order to be competitive with OMAP and the like).

From a legal perspective, NVIDIA could rightfully argue that what they have implemented is a x86 hardware emulator connected to an ARM microprocessor. This is clearly even more defensible than Transmeta's approach, and seems to very easily fit in the FTC's settlement's wording even if Intel disagreed. And an (ideally configurable/programmable but highly specialised) off-core block might even mean slightly better efficiency than Transmeta's approach.

But do not think I am cheerleading here. I am very skeptical that they will achieve their goals from a technical (performance and power) perspective. I suspect Eagle will have comparable or slightly higher performance relative to Atom's successor in that timeframe. Combined with the translation mechanism, I can't really see how they will beat Intel on performance, and they'll lose a fair bit of the power advantage unless their design is really incredible. Just making it work with decent performance would be a technological tour-de-force, but actually making it attractive overall in an ultra-competitive market would rightfully blow a lot of people's minds so I understand Charlie's skepticism even if I don't agree with several of his reasons.

My understanding of NVIDIA's process evolution for Tegra is as follows: T30=28LPT, T40=28HPM, T50=20HPM (LPT=SiON & HPM = High-K, all triple gate oxide). So the one consolation is that if both they and TSMC don't suffer any significant delays (well, we'll see about that) they might not really be at a substantial process disadvantage versus Intel. They do suffer execution risk from being that aggressive though.

EDIT: Assuming Charlie is correct, I'd like to point out I correctly speculated about this in March without any insider info - who needs sources when you've got a brain? :) I didn't touch on the technical challenges then though, but my point about it having a better risk-reward ratio than a from-the-grounds-up x86 core stands: http://forum.beyond3d.com/showpost.php?p=1409915&postcount=2170

Personally, a hw x86 translation engine would be cool, even if it doesn't win benchmarks. ;)
 
It's interesting that many people feel that x86-to-arm translation, even if performed in hw, will be uncompetitive, when the x86 cpu's themselves work by cracking x86 instructions to simpler ones. :p

A possible factor to consider is that atom is in-order, while it's isa-sake will be running on an OoO cpu. So I would no dismiss it's practicality without seeing some hard numbers.
 
The DEC Alpha had FX!32, which was a translator that actually kept translated versions of x86 code it ran, which presumably would represent the ceiling of performance that can be done from a pure software solution. At the time, and this was some time ago, it would have gone between 1/2-3/4 native performance.

Itanium now has software emulation for x86 as well, which allowed the later generations to ditch the built-in x86 decode unit. As a cautionary example for the idea of implementing an x86 hardware translation unit, the company that brought us x86 found the realities of implementing it in a performant and debugged fashion to be more trouble than it was worth.
The x86 block was a rather significant swath of transistors relative to the Itanium core, it needed engineering resources for validation, it could not be as flexible with instruction packing, and Intel's benchmarks showed that the software method turned out to be better.
Granted, the jump between x86 and EPIC would be more of a stretch than x86 to ARM.

Implementing a hardware x86 unit on an Eagle core may potentially be worse patent-wise than Transmeta's scheme.
The Transmeta core kept a lot of the emulation process in software, including significant portions of the memory access pipeline. The core itself had a very pared down memory access model.

Bolting an x86 front end to an ARM core means there will be physical linkages to the ARM's TLB and decode paths, and while I do not have details on the minutia, I have seen discussion that some of the likely areas covered by the x86 patents would be things like the physical implementation parts of the TLB and decode process.
Properly stitching in an x86 hardware front end may inadvertantly trample on one of these.
 
It's interesting that many people feel that x86-to-arm translation, even if performed in hw, will be uncompetitive, when the x86 cpu's themselves work by cracking x86 instructions to simpler ones. :p
Good point - I did think about the fact that x86 does x86-to-microcode while ARM does not need any such conversion, so the number of translation steps is actually the same when you add NVIDIA's hardware translation block.

The way I look at it is that this microcode conversion is internal to the core and the microcode is optimised to facilitate the conversion. Of course with functionality like micro-op fusion and macro-op fusion, it can get quite expensive and very complicated anyway. But most importantly, the translation is done in real-time.

NVIDIA/Transmeta's approach is more comparable to a JIT. The translation can be done in advance or as needed, re-used, and optimised over time. This has both advantages and disadvantages. Certainly the power cost is lower than the traditional approach for a loop running many iterations, and higher for code that isn't run often. The area cost is very complicated to estimate but probably also higher, although very interestingly if you allow the block to be a bottleneck in multi-core scenarios with little code reuse, the area cost might be *lower* than a x86 multicore where everything needs to be duplicated.

Finally for performance, in pure theory you could actually exceed native Eagle performance through JIT-like tricks, in practice that's unlikely unless the original binary's compiler was written by an undergraduate in the 1980s and expected to compile fast on a 286. There is one very intriguing optimisation I can think of, however, which is that ARM is capable of very fast predication iirc - that's certainly a perfect example where JIT might help above and beyond a binary compiler.
A possible factor to consider is that atom is in-order, while it's isa-sake will be running on an OoO cpu. So I would no dismiss it's practicality without seeing some hard numbers.
I believe Intel has hinted that Atom will evolve to OoO over time, although I can't remember where I read that. If true, it remains to be seen how that affects area/power efficiency of course.
 
It's interesting that many people feel that x86-to-arm translation, even if performed in hw, will be uncompetitive, when the x86 cpu's themselves work by cracking x86 instructions to simpler ones. :p
There is a difference between cracking x86 instructions into an internal representation that is purpose-made for emulating x86. The internal format's rules and semantics are purpose-built to match what x86 should do (or does for the vast majority of cases).

It's a different matter to have hardware crack x86 instructions and move them to an ISA with its own rules, semantics, and corner cases for which none of the rules is "do what x86 does". ARM has been around a long time and it has its own wierdness.

Properly bridging the gap may be very hard to do, and part of the problem may be that fully emulating x86 in this fashion means Nvidia's engineers might have to do a lot amount of research into current x86 implementations and then verifying a very complex system.

Maybe, hypothetically, it is easier if Nvidia had decoders that cracked ARM and x86 into the same internal format (involving a lot of competencies Nvidia may not have), but bolting x86 on top of ARM via hardware emulation may be a lot more trouble than it is worth.
 
Itanium now has software emulation for x86 as well, which allowed the later generations to ditch the built-in x86 decode unit. As a cautionary example for the idea of implementing an x86 hardware translation unit, the company that brought us x86 found the realities of implementing it in a performant and debugged fashion to be more trouble than it was worth.
[...]
Implementing a hardware x86 unit on an Eagle core may potentially be worse patent-wise than Transmeta's scheme.
[...]
Bolting an x86 front end to an ARM core means there will be physical linkages to the ARM's TLB and decode paths, and while I do not have details on the minutia, I have seen discussion that some of the likely areas covered by the x86 patents would be things like the physical implementation parts of the TLB and decode process.
Properly stitching in an x86 hardware front end may inadvertantly trample on one of these.
Hmm, makes a lot of sense and very good points, although what I was thinking of wouldn't actually suffer from any of these problems :)

I figured that debugging alone made a true hardware solution ala Itanium's absurd. What I'm thinking of as most probable instead is a programmable solution with its own ISA and specialised functionality to accelerate x86-to-ARM translation. It seems very comparable to Transmeta in ideology but it's a separate core that can be a bottleneck on its own in extreme cases. Just like Transmeta and unlike x86-to-microcode (for all intents and purposes), there is no one-to-one link between the translation and the execution.

An intriguing approach for that would be basing it on Tensilica's Xtensa architecture (which the Tegra team used for the GoForce 5500/5300 but definitely not in Tegra so that's besides the point). That's an implementation detail though.
 
From a legal perspective, NVIDIA could rightfully argue that what they have implemented is a x86 hardware emulator connected to an ARM microprocessor. This is clearly even more defensible than Transmeta's approach, and seems to very easily fit in the FTC's settlement's wording even if Intel disagreed. And an (ideally configurable/programmable but highly specialised) off-core block might even mean slightly better efficiency than Transmeta's approach.

Licensing/getting around Intel's patents might not be enough. In that time frame, they'll prolly also need AMD's patents. What do you think AMD will negotiate for? Or negotiate for disbanding? :???:
 
Licensing/getting around Intel's patents might not be enough. In that time frame, they'll prolly also need AMD's patents. What do you think AMD will negotiate for? Or not for? :???:
I'm not sure I understand: my point is that this gets around x86 patents and licensing generally *if* they are using the approach I'm thinking of *and* they pull it off correctly (easier said than done!) - there could be some difficulty in implementing memory-related things without infringing various patents as 3dilettante implied, but if it's a software-centric approach (despite conceptually being a bolted-on blackbox) they could try changing the implementation a bit following a threat from Intel or AMD.

Once again, none of this is enough to truly convince me on the performance front in that timeframe. But I agree with you that it's technologically very exciting either way.

neliz: Keep in mind there were Transmeta engineers at NVIDIA before they even seriously started any x86 project; they just worked on GPU stuff along with everyone else. It's impossible to estimate how many of those would work the CPU project and how many on completely unrelated ventures.
 
As pointed out earlier in this thread, it's VIA who could buy nVidia if one would buy the other, not the other way around (due the fact that VIA is part of really, really huge company)

Actually, VIA is a public corporation. Its chairwoman owns around 11% of its public stocks. Her husband owns around 2%. She is also very rich by HTC. Compared to HTC, VIA is in a pretty bad shape, so it's not unimaginable that she may decide VIA is a lost cause and sell it to the highest bidder. VIA's market cap is only around US$600 million, well within reach for NVIDIA.
 
I figured that debugging alone made a true hardware solution ala Itanium's absurd. What I'm thinking of as most probable instead is a programmable solution with its own ISA and specialised functionality to accelerate x86-to-ARM translation. It seems very comparable to Transmeta in ideology but it's a separate core that can be a bottleneck on its own in extreme cases. Just like Transmeta and unlike x86-to-microcode (for all intents and purposes), there is no one-to-one link between the translation and the execution.
The translation process is intensive enough that I'd expect that the translation core would have at least a full integer and memory pipeline, which is enough that I'd wonder why not run the translation routine on a single core and then switch contexts to run the translated code.

There would be three general instruction groups that would be passed to the second core:
Instructions that do what the x86 ops intend.
Instructions that do all the side-effectx x86 ops require (update the virtual x86 state, flags, weird things like x87 extended precision, x86 kitchen sink).
Instructions that stop any ARM-specific behaviors x86 ops cannot permit.
The latter two are significantly simpler if the core were not an independent ISA with no intention of emulating x86. If we are using ARM, there is no savings to be had and anything further will only add to overhead and power consumption.

Separate cores would also mean back-and-forth traffic, since both cores would be maintaining the same x86 virtual core. This would also come up during branches to untranslated code pages and during any indirect branches that had not been optimized to use ARM-appropriate destinations, which may not be possible in all cases.
Then there are exceptions or interrupts, which proper servicing may require that the second core be running an emulation mode all its own just to get proper behaviors. At that point, why involve a third ISA and another core?

An x86 black box might be better off as an x86 "summarizer" where it is fed an instruction and outputs a blob of bits that indicates the particulars of the instruction, and then an ARM translator loop figures out the rest.
This is very close to running an interpreter on what amounts to x86 micro or macro ops.
 
Instructions that do all the side-effectx x86 ops require (update the virtual x86 state, flags, weird things like x87 extended precision, x86 kitchen sink).
Instructions that stop any ARM-specific behaviors x86 ops cannot permit.
The latter two are significantly simpler if the core were not an independent ISA with no intention of emulating x86. If we are using ARM, there is no savings to be had and anything further will only add to overhead and power consumption.
Hmm, yes. I did ponder that, but now that I really stop to think about it, I realise the cost in either raw performance or area/power/latency would be greater than I thought.

An x86 black box might be better off as an x86 "summarizer" where it is fed an instruction and outputs a blob of bits that indicates the particulars of the instruction, and then an ARM translator loop figures out the rest.
This is very close to running an interpreter on what amounts to x86 micro or macro ops.
The basic issue is that a high-performance OoO core has lower perf/area and perf/watt than a simpler one, and doing that stuff straight on the ARM core without even any dedicated instructions can't be cheap.

It's definitely a very complex problem, which only makes me more skeptical on the technical side...

One small point I'd like to add: NVIDIA implemented their own L2 cache controller for Tegra 1 & 2 instead of licensing ARM's like everyone else. I wonder if ARM could add an instruction in Eagle that forces a memory load to bypass L1...
 
I'm not sure I understand: my point is that this gets around x86 patents and licensing generally *if* they are using the approach I'm thinking of *and* they pull it off correctly (easier said than done!) - there could be some difficulty in implementing memory-related things without infringing various patents as 3dilettante implied, but if it's a software-centric approach (despite conceptually being a bolted-on blackbox) they could try changing the implementation a bit following a threat from Intel or AMD.
Aah, so you were suggesting this _might_ get away with both intel and amd's patents, not just intel's.
 
The basic issue is that a high-performance OoO core has lower perf/area and perf/watt than a simpler one, and doing that stuff straight on the ARM core without even any dedicated instructions can't be cheap.
Transmeta's code morphing and FX!32 included optimization steps. Transmeta's core was designed to help x86 emulation from the outset, so it had a leg up in that there wasn't a raft of instructions needed to capture every detail, such as the fact it had 80-bit FP regs.

FX!32 saved its translations and created native DLL calls and successively optimized. It helped a lot, especially since Alpha was not designed to emulate x86.

There are legal and resource implications that possibly make this untenable in the market and difficult to justify for the reduced storage of the lower end of mobile devices.


One thing an x86 black box would look like in a way is a weird sort of microcode engine, which is given an address to decode and then an address to write the output to. It might only be single-issue and rely on the software interpreter to cache morphed code like Transmeta did.
There is going to be a notable amount of bloat when going from x86 to ARM which would saturate the ARM issue width. A single reg/mem would be at least a load and ALU op, for example, never mind any possible extra status checks.
There may be outstanding patents on high-performance superscalar x86 decode anyway.

One small point I'd like to add: NVIDIA implemented their own L2 cache controller for Tegra 1 & 2 instead of licensing ARM's like everyone else. I wonder if ARM could add an instruction in Eagle that forces a memory load to bypass L1...
Possibly a prefetch could do this, or at least the idea of a stream buffer has been mooted. I can't recall if it has been used in a current chip.
I'm not sure about a demand load, and prefetches usually at least go into the L1. There may be corner cases with cache coherence or TLB updates that are served by hitting at least one level of the hierarchy before hitting a register. A more exotic custom memory setup may be theoretically possible, though it sounds like it could be expensive and prone to problems. A coherence or TLB problem would be a game-over.
 
The DEC Alpha had FX!32, which was a translator that actually kept translated versions of x86 code it ran, which presumably would represent the ceiling of performance that can be done from a pure software solution. At the time, and this was some time ago, it would have gone between 1/2-3/4 native performance.

Itanium now has software emulation for x86 as well, which allowed the later generations to ditch the built-in x86 decode unit. As a cautionary example for the idea of implementing an x86 hardware translation unit, the company that brought us x86 found the realities of implementing it in a performant and debugged fashion to be more trouble than it was worth.
The x86 block was a rather significant swath of transistors relative to the Itanium core, it needed engineering resources for validation, it could not be as flexible with instruction packing, and Intel's benchmarks showed that the software method turned out to be better.
Granted, the jump between x86 and EPIC would be more of a stretch than x86 to ARM.

a) I think, off-hand, it might be reasonable to assume that Eagle's OoOE to cancel the translation overhead. So even if it can't reach that level of performance, it could certainly do it in about equal area, And if it can come close enough in power, I don't think it will matter that much (see c).

b) And then there is always the option of making the CLR/JVM/V8/whatever JIT directly to (suitably sandboxed?) ARM, even if the rest of the system is stuck in x86. Close cooperation with sw vendors will obviously needed though, but considering the magnitude of the challenge, this is definitely going to be lesser of their worries. :smile:

c) x86 is needed only where you need windows. So in the netbook/low end notebook market, last 500-1000mW may not be make or break.

d) And last but not the least, they need it to revive their low end market, so any product is better than none, I'd say.
 
Back
Top