XBox One Backwards Compatibility and Xbox One X Enhancements for X360 and OG (XO XOX BC)

I don't think installing the game and running from the HDD on 360 had a Massive Effect (see what I did there?) on pop-in. It did reduce pop-in in a lot of cases, but not significantly. Pretty much all UE games still had it even after installing.

In that case, I'm going to guess it really is down to RAM caching with aggressive prefetch. While I don't think they could quite fit the entire game they can get most of it and could be spending most idle HDD time prefetching. XB1 does have a faster HDD than XB360, but it appears to be only moderately so.

I wonder whether it could have anything to do with decompression. We know the X1 has decompression hardware, and as far as I know the 360 did not. Could this account for the difference?

If an OS or library routine for decompression is being used in conjunction with high level emulation then it's possible. But I doubt this alone would make such a dramatic performance difference.
 
And as 3dilettante said, it's rather straight forward to translate opcodes from one ISA to another. It's what emulators do on runtime, but instead of executing the instruction, you'd store the emulation code. I've written my tiny C++ to MIPS to C# to .Net converted that way. The idea is based on http://nestedvm.ibex.org/ and it actually shows how simple the idea is.

It's straightforward if you don't care about performance. Optimizing the code to perform as well as possible on the target platform is a very deep problem.

non optimized code will probably run faster on x86 than on the XBox CPU. When the PS3 came out (which is similar on CPU core side), there were benchmarks with PPU vs X86 and most of the time vanilla c++ code run on PPU as fast as it would run on an 700MHz - 1300MHz x86 core2 (you might still find those results in google). Modern Jaguar cores @ 1.86GHz might be a good match for vanilla PowerPC code. On top the X360 run two thread per core, which effectivelly means 1.6GHz and that's even less raw MHz power, eh?

There are some problems with this comparison..

1) You're looking at native code performance. You'll be lucky to average 50% of that with translated code, and that's assuming some good generous properties of the code and emulator. The binary translator simply doesn't have access to the same level of structure and information that the compiler would have and it pays for it, especially when the original code was for an arch with twice as many registers. It also pays for having to manage some branch targets that aren't known statically.
2) That performance comparison was probably done using GCC, vs what production code would have used, which was IBM's compiler which got more mature in time.
3) The common 2x1.6GHz claim is very misleading, first of all you get full access to the CPU if only one thread is running. Second, it's like SMT - there's some contention but it's not nearly like halving performance, that would defeat the point. Nonetheless, the performance per thread is nowhere close to 100% what it is when only one thread is running so there's definitely some reprieve for code that heavily multithreads the cores. BUT this is assuming that they don't rely on a high degree of synchronization for performance or correctness, one that the emulator would probably not be able to provide while running the threads on separate cores (and if they have to constantly switch threads on the same core to get the same effect performance will tank). That's kind of the thing here, with so few games supported right now we don't know what kind of potential compatibility it has.

But the other side of this is that games don't have to be using 100% CPU time on XB360 and many probably remain GPU limited or even frame time limited (especially the XBLA games) despite the CPU being so weak.

yet Altivec code was always famous to wipe the floor with the SSE code, not only does it support multiply-add, but also tons of registers and operates non destructive on more registers per op. If you give your code some altivec love, you can really gain a lot performance.

I think, if developers optimized for altivec, they've optimized the parts that were really time critical (e.g. some physics or AI raycast). Hence those hard to emulate altivec ops will be issued when it's most critical (e.g. combat) and that would hurt emulation twice.

Heavily optimized Altivec code will indeed be hard to deal with especially because it has so many registers. You're going to get inner loops that routinely blow the 16 XMM register budget. The emulator will probably have to heavily access registers in RAM to make up for this.

But there's been tons of presentation material on optimizing XB360 and PS3 CPU code that go far beyond just using well scheduled Altivec, so it's pretty safe to say that a lot of major games were heavily optimized throughout and would do a lot better than the vanilla C++ comparison you gave.
 
It's straightforward if you don't care about performance. Optimizing the code to perform as well as possible on the target platform is a very deep problem.
I did not mean to imply that translation or cached translation would lead to native performance, just that it would be needed to avoid the order of magnitude hit that prior discussions about emulating brought up. The set of games that might tolerate weaker performance that is still within a generous range of "good enough" should be broader.
 
It's straightforward if you don't care about performance. Optimizing the code to perform as well as possible on the target platform is a very deep problem.
1) You're looking at native code performance. You'll be lucky to average 50% of that with translated code, and that's assuming some good generous properties of the code and emulator. The binary translator simply doesn't have access to the same level of structure and information that the compiler would have and it pays for it, especially when the original code was for an arch with twice as many registers. It also pays for having to manage some branch targets that aren't known statically.
Microsoft wrote the compiler that generated the X360 opcodes, they are probably the best at generating something AST alike from the opcodes and generating an optimized binary for a new target platform. That's what driver do with shader byte-code (which is optimized nowadays for a totally different "imaginary" processor). That's also what .Net does with the bytecode. That's what NVidia's "Denver" does also. That's why I'd call it 'state of the art' in the high end software compiler business.
I agree that some information is missing, but at the same time, that information would be of not that much help, because the data is layout already with that information in mind e.g. alignment, padding of members in structs, endianess etc.

2) That performance comparison was probably done using GCC, vs what production code would have used, which was IBM's compiler which got more mature in time.
if you're referring to the PPU vs x86 benchmarks, your assumptions are not correct.
e.g. http://web.archive.org/web/20100531...kpatrol.ca/2006/11/playstation-3-performance/
(that's not the comparison I recall, just random 1min of google)

you have now the exact same situation, microsoft VC++ vs Microsoft VC++, I'd doubt the X360 compiler would be more advanced than the optimizers they use for the transcoders now.

3) The common 2x1.6GHz claim is very misleading, first of all you get full access to the CPU if only one thread is running. Second, it's like SMT - there's some contention but it's not nearly like halving performance, that would defeat the point.
there are several things to consider
a) it's an in-order CPU, on OoO both threads can try to fill units of the CPU if some memory fetches are stalling, while on an In-Order design a fetch will completely stall that one pipeline. Running Both pipelines will cause a better occupation on instruction side, but per SMT-thread it will cause more contention and friction e.g. on L1D side, which is actually critical for the stalls.
b) if a game does not utilize 6threads, it's likely there was no need to do so, thus I'd mightily assume it's not a critical code path. if it was critical, then it's likely spread to more cores for more throughput overall, but less throughput per core. That favors the real cores of XBOne

I have no hard numbers to back my assumption, tho.

Nonetheless, the performance per thread is nowhere close to 100% what it is when only one thread is running so there's definitely some reprieve for code that heavily multithreads the cores. BUT this is assuming that they don't rely on a high degree of synchronization for performance or correctness, one that the emulator would probably not be able to provide while running the threads on separate cores (and if they have to constantly switch threads on the same core to get the same effect performance will tank). That's kind of the thing here, with so few games supported right now we don't know what kind of potential compatibility it has.
I agree. and yes, I'm wildly guessing here. I don't claim it's the way I say, just from my experience it's what I'd assume most likely. MS is a great [edit: not OS, I meant:] COMPILER company and software emulating opcodes on runtimes or even 1:1 translation wouldn't sound to me as good performing as a transcoded binary.

But the other side of this is that games don't have to be using 100% CPU time on XB360 and many probably remain GPU limited or even frame time limited (especially the XBLA games) despite the CPU being so weak.
makes me curious to see some more recent games that made the XB360 sweat :)
Is there something?

Heavily optimized Altivec code will indeed be hard to deal with especially because it has so many registers. You're going to get inner loops that routinely blow the 16 XMM register budget. The emulator will probably have to heavily access registers in RAM to make up for this.
which again makes it more likely to go the complex way of parsing the opcodes into an AST and run all the VC backend for x86. It's not just register renaming and instruction translation, quite some code would be done in a different way (e.g. some load, modify, store on XB360 could end up in one instruction like "inc memory")

But there's been tons of presentation material on optimizing XB360 and PS3 CPU code that go far beyond just using well scheduled Altivec, so it's pretty safe to say that a lot of major games were heavily optimized throughout and would do a lot better than the vanilla C++ comparison you gave.
I agrre. That's what I wanted to point out also. there will be some 10% of code for e.g. physics, AI, etc. that are really critical and are heavily optimized and used in time critical cases. So in those moment (e.g. combat) not only there is way more pressure, but the code is also harder to translate. This might explain why in non-critical cases (e.g. maybe cutscene) the emulated version might run way better, while (as some claim) it action moments the FPS drops to 10fps. (again, my wild guess :) )

makes me wonder whether MS might have a farm of programmers profilling critical code bits and re-writing those to x86 (at least c/c++ code with SSE intrinsics) and we'll get patches further improving game performance by some 2x, 3x, 4x scale in those low-fps situations.
 
Please release OG Xbox back compat now :). Would love to play some Soul calibur 2 again
 
Microsoft wrote the compiler that generated the X360 opcodes, they are probably the best at generating something AST alike from the opcodes and generating an optimized binary for a new target platform. That's what driver do with shader byte-code (which is optimized nowadays for a totally different "imaginary" processor). That's also what .Net does with the bytecode. That's what NVidia's "Denver" does also. That's why I'd call it 'state of the art' in the high end software compiler business.
I agree that some information is missing, but at the same time, that information would be of not that much help, because the data is layout already with that information in mind e.g. alignment, padding of members in structs, endianess etc.

This isn't anything like VM bytecodes which are designed with JIT in mind, and it's nothing like Denver which has many architectural decisions that lend towards being useful for translating ARM (lots of registers, transactional memory, very efficient asserts, hardware branch mapping). x86 is not designed at all to be an efficient target for PowerPC translation.

The missing information I'm referring to is not low level stuff that you're talking about but variable and control flow graphs which allow the compiler to do better register allocation and other optimizations than they can working with unannotated machine code.

I've written multiple reasonably high performance emulators employing binary translation so I do have some experience with this.

if you're referring to the PPU vs x86 benchmarks, your assumptions are not correct.
e.g. http://web.archive.org/web/20100531...kpatrol.ca/2006/11/playstation-3-performance/
(that's not the comparison I recall, just random 1min of google)

you have now the exact same situation, microsoft VC++ vs Microsoft VC++, I'd doubt the X360 compiler would be more advanced than the optimizers they use for the transcoders now.

So 2006 Visual Studio, which I guess you're assuming is the state of the art in XBox 360 compilation. I doubt that. I've heard multiple sources claim that IBM's compilers were the best which would make sense.

At any rate, where does this comparison say VC++ was used? For binaries running on Linux no less?

there are several things to consider
a) it's an in-order CPU, on OoO both threads can try to fill units of the CPU if some memory fetches are stalling, while on an In-Order design a fetch will completely stall that one pipeline. Running Both pipelines will cause a better occupation on instruction side, but per SMT-thread it will cause more contention and friction e.g. on L1D side, which is actually critical for the stalls.

Actually, despite being in-order the L1 and L2 caches in Xenon/Cell PPE are non-blocking and support hit-under-miss so multiple threads increases MLP to the caches.

b) if a game does not utilize 6threads, it's likely there was no need to do so, thus I'd mightily assume it's not a critical code path. if it was critical, then it's likely spread to more cores for more throughput overall, but less throughput per core. That favors the real cores of XBOne

I have no hard numbers to back my assumption, tho.

So you're saying if it doesn't peg six threads then it probably won't peg one thread? There's tons of software that's poorly threaded but has heavy single threaded requirements. Games have gradually been becoming less so but that was a hard lesson for developers over the lifetime of the XBox 360. Parallelizing code isn't always easy (and sometimes doesn't give enough benefit even when you do it)

I agree. and yes, I'm wildly guessing here. I don't claim it's the way I say, just from my experience it's what I'd assume most likely. MS is a great [edit: not OS, I meant:] COMPILER company and software emulating opcodes on runtimes or even 1:1 translation wouldn't sound to me as good performing as a transcoded binary.

Visual Studio's code quality is actually not that incredible these days (there's a reason people use ICC), but even if their binary translation was the best in the world there are still hard practical limits.

Clearly they've done a very very good job just based on what we're seeing so far. I'm not saying their emulator sucks, just that I'm not so confident that it'll be able to handle whatever is thrown at it. We don't really know yet how hard games will push it.

makes me curious to see some more recent games that made the XB360 sweat :)
Is there something?

I'm sure there are games that push the system a lot harder than Mass Effect did, let alone Kameo, Perfect Dark Zero, or the variety of XBLA games currently available.

which again makes it more likely to go the complex way of parsing the opcodes into an AST and run all the VC backend for x86. It's not just register renaming and instruction translation, quite some code would be done in a different way (e.g. some load, modify, store on XB360 could end up in one instruction like "inc memory")

No recompiling emulators out there use a C compiler as a back end for generating code (people talk about using LLVM for this now and again but I'm not aware of anyone actually doing so). For one thing, code is generated dynamically (you can't generate all of it statically) placing somewhat of a limit on how quickly they can work and these compilers aren't optimized enough for speed and perform many optimizations that are generally not applicable to code that was already compiled. And it's really not that well suited for optimizing what looks like machine code. You can get useful peephole-style optimizations you're talking about without throwing an entire compiler backend at it.

I agrre. That's what I wanted to point out also. there will be some 10% of code for e.g. physics, AI, etc. that are really critical and are heavily optimized and used in time critical cases. So in those moment (e.g. combat) not only there is way more pressure, but the code is also harder to translate. This might explain why in non-critical cases (e.g. maybe cutscene) the emulated version might run way better, while (as some claim) it action moments the FPS drops to 10fps. (again, my wild guess :) )

Yeah, but there could be other speed traps that are less obvious. I don't know how XBox 360 games are written but if they're allowed to map hardware I/O into the address space and then proceed to hammer it this will be very slow because of how it'll have to trap to the hypervisor (I'm assuming that's how they're doing it). On the other hand, if this is all abstracted by OS syscalls to begin with it's not necessarily a problem.

makes me wonder whether MS might have a farm of programmers profilling critical code bits and re-writing those to x86 (at least c/c++ code with SSE intrinsics) and we'll get patches further improving game performance by some 2x, 3x, 4x scale in those low-fps situations.

That very well could be the plan for some games if they haven't already done it. They already admit that they're packaging a unique binary of the emulator with every game so they can tweak it on a per-game basis, that could easily include custom HLE for specific games.

Where it matters it's probably going to be way more work than just a few functions though. The 90/10 rule is very often overstated. At some point it has to make more sense to lean on some kind of library that's uses parts of the emulator to help port the game and getting the company to let them do a real port.
 
In that case, I'm going to guess it really is down to RAM caching with aggressive prefetch. While I don't think they could quite fit the entire game they can get most of it and could be spending most idle HDD time prefetching. XB1 does have a faster HDD than XB360, but it appears to be only moderately so.

how much is reused or droped from memory and then reused quickly?
They have the space to just leave data in memory so next call is quicker, last used first out once the allocated space is full which will be a massive cache.

More like hybrid ssd caching,
basic but probably quite effective.
 
how much is reused or droped from memory and then reused quickly?

I don't know, in CPU caches the answer is a lot, which is why LRU tends to be viewed as the best replacement policy. Although even there there are enough exceptions that it can be worth switching to other replacement strategies adaptively: http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ But not necessarily MRU like you describe.

Disk access patterns in games may be different. If an entire area or level is loaded there's a good chance you won't be loading it again for a while or ever. But if pieces of a level are streamed in that may not be the case. Disk accesses are so slow that it can make sense to invest some CPU time in coming up with a decent adaptive algorithm here too.
 
Burnout Paradise was 60fps (but sadly not a Burnout game in all the ways that count, IMO).

Burnout 3 (the ultimate, complete, unlikely-to-be-repeated king of Burnouts) may have been 60fps on the original Xbox, but I don't think it runs at 60fps on Xbox 360. Doesn't feel like it anyway.

Burnout 3 and Flatout:Ultimate Carnage are the main reasons I'm still holding on to my 360.

MS really need to get on with that original xbox emulator. I have my Kung-fu Chaos disc ready and waiting! (my favourite local multiplayer game ever)
 
Is flatout any good it seems so if your keeping your 360 for it
do you know if I would be better gettting FlatOut 3 Chaos And Destruction instead ?
soz for the o.t

edit: read some reviews, stay away from flatout 3 chaos and destruction...
 
Last edited:
Yeah I hear Flatout:CnD was a stinker.

The minigames in Ultimate Carnage were what had me hooked. In a couch multiplayer situation it was a riot. Pure rediculous awesomess.

Stone skipping (think: On a beach with flat pebbles skipping on the surf) with a driver ejecting from a jet car? Oh yes. Yes ineed.
 
If I read that correctly, the trouble with racing games being a problem is they no longer have the rights for the songs or having a partnership with the publisher so it may take longer to get their permission.

Depending on how the songs were packed initially, they might not be able to replace them. It would be nice to allow the users to select their own songs to be used, but that would take an external app to do the manipulation.
 
I would look at the logos being the big one, since there are about a billion of them per mm2 in some cases. I haven't played racing games enough to know if it's the attempt to simulate real venues to or if there is some other reason why songs would affect that genre more.
 
Back
Top