NVIDIA's x86-to-ARM Project Discussion

Eagle is still targeted at a pretty power-constrained environment.
Even a balls-to-the-wall OoOE implementation would be hard-pressed to hide translation overhead, and I'm pointing at FX!32 on Alpha as the ceiling at 50-75% of native (theoretically), and that is with the ability to store the translated binaries.

OoOE also cannot fix some of the stressors emulation or translation involve.
ALU load is higher, as any in-built operations must now explicitely go through the ALUs.
OoOE does not increase ALU density, usually it is the opposite.
Transmeta went VLIW, possibly in part because they knew that there would be enough redundant and statically detectable work involved in the emulation process.

OoOE does help hide short-latency events, such as hiccups in the L1 and possibly L2 data caches.
It does not help with hiccups in the instruction delivery pipeline, and OO chips typically are limited by instruction throughput (which emulation or translation worsen through bloat and cleanup, and can spill to memory in bad ways too long-latency to hide anyway).

The likely x86 competitor is probably going to have similar clocks, so there isn't a fallback to brute force clocking for inevitably longer straight lines of code.

Maybe it is possible to shoot for good enough and hope the GPU handles the graphical glitz well enough.
 
Eagle is still targeted at a pretty power-constrained environment.
Even a balls-to-the-wall OoOE implementation would be hard-pressed to hide translation overhead, and I'm pointing at FX!32 on Alpha as the ceiling at 50-75% of native (theoretically), and that is with the ability to store the translated binaries.

OoOE also cannot fix some of the stressors emulation or translation involve.
ALU load is higher, as any in-built operations must now explicitely go through the ALUs.
OoOE does not increase ALU density, usually it is the opposite.
Transmeta went VLIW, possibly in part because they knew that there would be enough redundant and statically detectable work involved in the emulation process.

OoOE does help hide short-latency events, such as hiccups in the L1 and possibly L2 data caches.
It does not help with hiccups in the instruction delivery pipeline, and OO chips typically are limited by instruction throughput (which emulation or translation worsen through bloat and cleanup, and can spill to memory in bad ways too long-latency to hide anyway).

The likely x86 competitor is probably going to have similar clocks, so there isn't a fallback to brute force clocking for inevitably longer straight lines of code.

Maybe it is possible to shoot for good enough and hope the GPU handles the graphical glitz well enough.

I meant, x86 on ooo eagle perf ~ in order equivalent of eagle. That would be a steep hit, no doubt. Their best bet is relying on user space sw JITing to ARM and having a kick ass gpu. All JVM/.NET/Javascript could go that way, relieving LOTS of translation overhead.
 
TMTA's VLIW was custom designed to be very close to x86. ARM is by definition not very close to x86 - totally different memory model, totally different semantics for flags, etc.

Translating x86-->Alpha or x86-->IA64 isn't so bad, since Alpha and IA64 both have a lot more registers than x86. Unfortunately, ARM has the same number of registers as x86-64...so you're kind of in a bad situation there. Not only that, but you do need to be able to emulate SSE as well, and I don't know if ARM has 128b SIMD yet (they very might, but I am not sure).

The bottom line is that ARM lacks many of the features that are needed for targeting x86 - look at the Chinese MIPS clone for a good example of something that kind of does. It's possible that NV could do it, but it would probably involve a lot of extra work.

Also, anyone who thinks that the x86-->uop translation is anything like x86--->ARM, you need to learn more about ARM and more about uops : ) Not even remotely comparable.

DK
 
TMTA's VLIW was custom designed to be very close to x86. ARM is by definition not very close to x86 - totally different memory model, totally different semantics for flags, etc.
What do you mean by memory model?

Translating x86-->Alpha or x86-->IA64 isn't so bad, since Alpha and IA64 both have a lot more registers than x86. Unfortunately, ARM has the same number of registers as x86-64...so you're kind of in a bad situation there. Not only that, but you do need to be able to emulate SSE as well, and I don't know if ARM has 128b SIMD yet (they very might, but I am not sure).
AFAIK, NEON is both 64 bit and 128 bit simd.

The bottom line is that ARM lacks many of the features that are needed for targeting x86 - look at the Chinese MIPS clone for a good example of something that kind of does. It's possible that NV could do it, but it would probably involve a lot of extra work.
Then as 3d suggested, may be crack both x86 and arm into a common representation. But that would suck for both ARM and x86. ;)

But like I said earlier, getting JVM/CLR/V8 to jit to directly to ARM will solve a lot more of their headaches with less work than emulating x86. They could add an instruction to switch from x86 to ARM and vice versa, just like jazelle, or the endianness switch in arm.
 
Translating x86-->Alpha or x86-->IA64 isn't so bad, since Alpha and IA64 both have a lot more registers than x86. Unfortunately, ARM has the same number of registers as x86-64...so you're kind of in a bad situation there. Not only that, but you do need to be able to emulate SSE as well, and I don't know if ARM has 128b SIMD yet (they very might, but I am not sure).
I obviously agree with your other points David (sorry if you thought my handwaving about uops was a bit sloppy, wasn't meant to be rigorous at least though :)) and while I don't think they are necessarily show-stoppers assuming they do things differently from TMTA or Loongson, they do contribute to my skepticism about performance.

But here I think you're forgetting something: this isn't about x86-64->ARM. It's about x86-64->ARM-64 (to be introduced in Eagle), so while we don't know the number of registers on the latter it's likely that it will be a higher than before which should help for translation. ARM also supports 128b SIMD (with multiply-add) although on the A8 and A9 it's done over two cycles on a 64b unit. But Qualcomm has native 128b in Snapdragon and presumably Eagle will have the same.

EDIT: By the way, Charlie's information about TI being the 'lead licensee' for Eagle is false. From ARM's Q2 earnings release: "Major semiconductor company becomes the third lead-licensee for the “Eagle” Cortex-A™ class processor" - so it's nearly certain that both TI and NVIDIA are in that list. I don't know who the third company might be; Samsung perhaps as they've already had Eagle on one of their old roadmaps?
 
I think Qualcomm is much more likely to be on that list than NVIDIA.
Qualcomm is an architectural licensee. They are much more likely to come up with a new OoOE core of their own to replace Scorpion than to license Eagle. And assuming NVIDIA isn't on that list *if* Charlie's right is downright absurd - of course, Charlie could be wrong.
 
Qualcomm is an architectural licensee. They are much more likely to come up with a new OoOE core of their own to replace Scorpion than to license Eagle. And assuming NVIDIA isn't on that list *if* Charlie's right is downright absurd - of course, Charlie could be wrong.

Hypothetically, what would be easier, licensing eagle and then coming up with an OoO derivative of scorpion, or making an OoO-scorpion all by themselves?
 
Hypothetically, what would be easier, licensing eagle and then coming up with an OoO derivative of scorpion, or making an OoO-scorpion all by themselves?
Qualcomm couldn't just license it and base their OoO design on tricks invented by ARM for the A9 or Eagle. ARM sells completed blocks and architectural licenses, not mixes of the two. They're a corporation, not a research organisation like IMEC ;)

Snapdragon1 taped-out in 1H07 if I remember correctly, and they've got a substantial team working on this stuff. I can't see why they couldn't come up with an OoO core 4-5 years after Scorpion.
 
I thought Scorpion was a derivative of Cortex-A8, was I mistaken? It does seem to make more sense for Qualcomm to buy an architectural license plus the complete Eagle block, and then customize it as they see fit. I mean, why replicate all the engineering effort that ARM will put into Eagle?

Plus, I don't see why the idea that NVIDIA wouldn't be on ARM's VIP list should be so ludicrous. After all, NVIDIA is a pretty small player in the ARM world (where you have companies like TI, Samsung, Qualcomm, Apple, Freescale, etc.).

Actually, now that I think about it, this 3-member list is likely to be TI + Apple + Qualcomm or TI + Apple + Samsung, or something like that.
 
Qualcomm couldn't just license it and base their OoO design on tricks invented by ARM for the A9 or Eagle. ARM sells completed blocks and architectural licenses, not mixes of the two. They're a corporation, not a research organisation like IMEC ;)
So that means, if you take an architectural license, you don't get to peek inside their RTL? And if you don't take an architectural license, you can't modify anything inside?
 
So that means, if you take an architectural license, you don't get to peek inside their RTL? And if you don't take an architectural license, you can't modify anything inside?

Yet I think this is what Apple did with the A4…
 
I thought Scorpion was a derivative of Cortex-A8, was I mistaken?
You are mistaken - it's similar to the A8 architecturally and shares the same NEON ISA version, but it's a from-the-grounds-up design.

It does seem to make more sense for Qualcomm to buy an architectural license plus the complete Eagle block, and then customize it as they see fit. I mean, why replicate all the engineering effort that ARM will put into Eagle?
Because ARM doesn't allow that, it's just asking for trouble to let everyone customise everything. Or if they do, it's certainly not part of their normal license and would certainly be substantially more expensive.

Plus, I don't see why the idea that NVIDIA wouldn't be on ARM's VIP list should be so ludicrous. After all, NVIDIA is a pretty small player in the ARM world (where you have companies like TI, Samsung, Qualcomm, Apple, Freescale, etc.).
They were on the lead licensee list for the A9 along with TI and Samsung. I don't see what's ludicrous at all - they might not be a big player share-wise, but they're investing a lot of money in it. You could argue it's ludicrous they are investing so much money into it given how difficult it has been and still will be for them to get a leading position in the market, but that's a separate conversation.

Actually, now that I think about it, this 3-member list is likely to be TI + Apple + Qualcomm or TI + Apple + Samsung, or something like that.
TI + Apple + Samsung is indeed plausible. Apple has an architectural license though... there have been whispers they lost too many PA Semi engineers to finish that project, but I think it's more likely it has simply been delayed and they'll never license Eagle.

rpg.314 said:
So that means, if you take an architectural license, you don't get to peek inside their RTL? And if you don't take an architectural license, you can't modify anything inside?
Well obviously you get to have the RTL to implement it, but I'd *assume* there are fewer comments and explanations as in ARM's own version of it. Either way the normal license does not allow you to change the core.

Alexko said:
Yet I think this is what Apple did with the A4…
Yes and no. Here's the important bit:
http://www.samsung.com/global/business/semiconductor/newsView.do?news_id=1030 said:
Intrinsity's Coretex-A8 processor-based FastCore embedded core is cycle-accurate and Boolean equivalent to the original Cortex-A8 RTL specification.
I don't know whether Intrinsity had to get special rights from ARM to do it, but obviously there's no reason for ARM not to let them do that kind of thing or charge anything extra. It's very very different from what you are all saying Qualcomm should do.

The only precedent I am aware of where someone modified an existing ARM core substantially is Handshake Solutions (a Philips subsidiary) which created the first commercial clockless processor, the ARM996HS. But that was a joint collaboration where they together licensed it to a third party, both getting licensing and royalty fees. It's obviously not comparable.
 
What do you mean by memory model?

I did not remember this, but x86 is usually the more strongly ordered architecture in terms of memory consistency versus many other ISAs.
In a multiprocessor environment on x86, a single processor's writes are seen in the order they were made (there are options that can relax this, I think certain complex ops and vector instructions are more relaxed). Reads can be reordered in some situations relative to other operations.
ARM is weakly ordered, which means without a barrier instruction one CPU's view of other reads and writes may not match the order those changes became visible from the POV of other cores.
That can lead to failures for code that assumes stronger consistency and lacks the needed barriers.

I cannot find a reference at this time, but a while back there was talk of some RISC adding a mode that would make its consistency model closer to that of x86.
 
I did not remember this, but x86 is usually the more strongly ordered architecture in terms of memory consistency versus many other ISAs.
In a multiprocessor environment on x86, a single processor's writes are seen in the order they were made (there are options that can relax this, I think certain complex ops and vector instructions are more relaxed). Reads can be reordered in some situations relative to other operations.
ARM is weakly ordered, which means without a barrier instruction one CPU's view of other reads and writes may not match the order those changes became visible from the POV of other cores.
That can lead to failures for code that assumes stronger consistency and lacks the needed barriers.

I cannot find a reference at this time, but a while back there was talk of some RISC adding a mode that would make its consistency model closer to that of x86.

So by memory model, you mean memory coherency analogous to cache coherency.
 
I meant the architecture's defined memory consistency model, which is similar to coherence.
Coherence would be concerned with how and when different caches and different cores see updates to the same location.
Consistency would be concerned with how and when different caches and different cores see updates to different locations.
It might even be for one core, I think there were some surprises on single cores with weak consistency, and then there are GPUs, which are weak enough that I'm not sure if they even count as being consistent.
 
You are mistaken - it's similar to the A8 architecturally and shares the same NEON ISA version, but it's a from-the-grounds-up design.

Because ARM doesn't allow that, it's just asking for trouble to let everyone customise everything. Or if they do, it's certainly not part of their normal license and would certainly be substantially more expensive.

I stand corrected, then.

They were on the lead licensee list for the A9 along with TI and Samsung. I don't see what's ludicrous at all - they might not be a big player share-wise, but they're investing a lot of money in it. You could argue it's ludicrous they are investing so much money into it given how difficult it has been and still will be for them to get a leading position in the market, but that's a separate conversation.

TI + Apple + Samsung is indeed plausible. Apple has an architectural license though... there have been whispers they lost too many PA Semi engineers to finish that project, but I think it's more likely it has simply been delayed and they'll never license Eagle.

I'm not saying anything's ludicrous, it's just that you said it was ridiculous to think that NVIDIA wasn't on that list, while I think there are more plausible candidates, therefore I don't find it ridiculous.

As for investment, well, NVIDIA is spending a lot of money, but all that money goes into their own R&D alone, right? Why should ARM offer them preferential treatment for that?

Yes and no. Here's the important bit:
I don't know whether Intrinsity had to get special rights from ARM to do it, but obviously there's no reason for ARM not to let them do that kind of thing or charge anything extra. It's very very different from what you are all saying Qualcomm should do.

I don't really get what "cycle-accurate and Boolean equivalent" means. Just what did they change, exactly?
 
So by memory model, you mean memory coherency analogous to cache coherency.

There is coherence and there is consistency. The two are parallel concepts. Coherency deals with how various blocks see the memory state, consistency has to do with how that memory state changes.
 
Back
Top