Predict: The Next Generation Console Tech

Status
Not open for further replies.
I don't think that's how it works in this case. AMD doesn't do a fully custom design for Sony (or MS) and sells it then to the competitor. Sony (or MS) don't own the design, they license it from AMD (just as you can license ARM designs).
AMD offers some basic building blocks (CPU cores, coherency interconnect, memory controllers, southbridge functionality, GPU cores and so on) which a customer can mix'n match according to his needs. They offer some limited customization.or integration of customer IP on top of that. Sony (or MS) can never obtain exclusive rights to the basic building blocks (it's genuinely AMD's IP), they just obtain a licence for a certain combination of AMD designed components. AMD cannot give some of Sony's IP they may have integrated to MS, of course. But they can still use the same basic building blocks (they are owned by AMD) and design a semi-custom APU according to MS wishes (which may call for different customizations). I am very sure everyone is aware of this when signing such a contract.
There is no reason to be angry for anybody. It only shows that both MS as well as Sony probably judged AMD as offering the best overall solution this time around.

Edit:
Part of the reason is probably that AMD can offer such semi-custom designs a lot cheaper than fully custom ones. The non-exclusiveness of the design components lowers the price tremendeously, just as Helmore mentioned. ;)

Maybe, but legally and ethically defensible are two different things and many were skeptical that AMD could pull off the dual contract, so the oneous is on AMD to be above any hint of impropriety.

But hey, if Sony is totally cool with the end result then yea no problem.
 
  • Microsoft and possibly Sony are really expecting game developers to move several tasks from the CPU to GPU
That's the best explanation, but it's not exactly like that. Hardware needs to provide developers with resources to be used in making their games. Back around the year 2000, Sony, like others, were looking at very programmable, fast throughput, which is flexible enough to render graphics, encode and decode video, crunch through AI datasets, etc. So they set about developing a high SIMD chip. Whether MS really wanted a high SIMD chip themselves in the initial design, or just reacted to the Cell noise, we might never know.

Once those chips found their way into the consoles, the developers used them to calculate whatever functions they needed. Which more-often-than-not was graphics tasks. During that time, GPU's have developed to be more flexible with their computer power, meaning the development of high SIMD CPUs has taken a back-step once again.

It's not that workloads have changed, or designers overestimated anything. You can never have too much power! It's just that the economies of providing flexible compute power have shifted once again. Maybe there's a recognition that all that compute power gets spent on graphics anyway so it can be moved into the GPU anywhere, but I dare say that it's a simple price/performance balance this time around.
 
Interesting fact: when the Xbox 360 launched, Xenon had a theoretical peak floating point throughput of 115.2 gflops, while a contemporary state-of-the-art desktop CPU, like the Athlon 64 X2 4800+ had a peak of about 19.2 gflops. If the new consoles really have 8 unmodified Jaguar cores running at 1.6 GHz, the peak floating point throughput would be 102.4 gflops, while contemporary 4 cores Haswell processor will peak at about twice that number (204.8 gflops at 3.2 GHz). I guess that one of the following statement must be true:

4 Core Haswell would be twice that - 409.6 GLFOPS via AVX2.

Current Sandy and Ivy bridge processors peak at 204.8 with 4 cores at 3.2Ghz.
 
It does not have 64 bit vector units.

I know that. My point is that a core with 128 bit vector units and 0.2 IPC can be easily be outperformed by a smarter core (even by 2005 standards) with 64 bit vector units. I wonder whether the Xbox 360 design team expected game code to achieve such a low IPC. The actual IPC achieved by game code this generation might be one of the reason to go with Jaguar cores, which are clearly lacking in terms of peak floating point throughput (by 2013 standards) this time around, rather than, for isntance, PowerPC A2 cores.
 
I know that. My point is that a core with 128 bit vector units and 0.2 IPC can be easily be outperformed by a smarter core (even by 2005 standards) with 64 bit vector units. I wonder whether the Xbox 360 design team expected game code to achieve such a low IPC. The actual IPC achieved by game code this generation might be one of the reason to go with Jaguar cores, which are clearly lacking in terms of peak floating point throughput (by 2013 standards) this time around, rather than, for isntance, PowerPC A2 cores.
The apocryphal history states that MS actually wanted an out of order CPU for xenon, but IBM could not deliver in the required timeframe. In terms of actual workload, a jaguar core is more than a match for a xenon core, even with the dual issue. The only place you'll have problems would be highly optimized streaming vector calculations, used extensively by most audio engines. With very few jumps and a nice large data set, a xenon core would almost double the performance of a jaguar one.
 
). I guess that one of the following statement must be true:
  • the designers of the Xbox 360 and PS3 greatly overstimated the need for high peak floating throughput
  • the requirements of game engines changed throughout the generation, hinting at different priorities for the next generation;
  • Microsoft and possibly Sony are really expecting game developers to move several tasks from the CPU to GPU
  • the budget for R&D was not high enough to get a custom CPU this time around and no CPU with the target specifications was available, so they went with Jaguar cores
.

I would be curious what kind of flop throughput is actually sustainable on the 360's CPU. my guess is that it's probably below 50% (if not a lot less) of the peak. I just can't imagine the memory and caches being capable of sustaining data for that kind of performance.
 
I would be curious what kind of flop throughput is actually sustainable on the 360's CPU. my guess is that it's probably below 50% (if not a lot less) of the peak. I just can't imagine the memory and caches being capable of sustaining data for that kind of performance.

It's got enough registers that for the right workload, where you can actually accommodate the memory latency, you could hit close to 100%

But that's not really indicative of real workloads.

I've always maintained than both the 360 CPU and Cell were focused on the wrong things, designed by marketing.

10 years ago in order CPU's were back in vogue, many people thought that compilers could solve the scheduling problem, it didn't happen.
 
I would be curious what kind of flop throughput is actually sustainable on the 360's CPU. my guess is that it's probably below 50% (if not a lot less) of the peak. I just can't imagine the memory and caches being capable of sustaining data for that kind of performance.
No high-performance system can provide sustained peak flop throughput. A 32 bit calculation requires 4 bytes (let's take it as 4 bytes in addition to what's in registers, rather than having to load two 32 values to operate on). 1 gigaflop would thus require 4 gigabytes per second of bandwidth. 100 GFlops would need 400 GBps, outstripping every bus available by a mile. A 1 TFlop GPU would consumer 4 terabytes of data a second sustained peak... :oops:

If your code can operate out of caches/registers, recycling the data being consumed (a set of serial operations applied to the initial data), you can get better flop throughput, but that's a limited set of code.
 
It's got enough registers that for the right workload, where you can actually accommodate the memory latency, you could hit close to 100%
I think Xenon's often quoted 115.2 GFlop/s are really a bit optimistic.
It assumes issuing one 128Bit FMA/DotProduct (8 flops) in the VMX unit + another 4 flops from another instruction for 12 flops per cycle and core.
 
Last edited by a moderator:
Interesting fact: when the Xbox 360 launched, Xenon had a theoretical peak floating point throughput of 115.2 gflops, while a contemporary state-of-the-art desktop CPU, like the Athlon 64 X2 4800+ had a peak of about 19.2 gflops. If the new consoles really have 8 unmodified Jaguar cores running at 1.6 GHz, the peak floating point throughput would be 102.4 gflops, while contemporary 4 cores Haswell processor will peak at about twice that number (204.8 gflops at 3.2 GHz). I guess that one of the following statement must be true:
  • the designers of the Xbox 360 and PS3 greatly overstimated the need for high peak floating throughput
  • the requirements of game engines changed throughout the generation, hinting at different priorities for the next generation;
  • Microsoft and possibly Sony are really expecting game developers to move several tasks from the CPU to GPU
  • the budget for R&D was not high enough to get a custom CPU this time around and no CPU with the target specifications was available, so they went with Jaguar cores
The first option is quite possible, since the average efficiency of Xbox 360 code is so low at 0.2 instructions per cycle that the usefulness of such a high peak floating point throughput is quite hard to judge. A wider, out-of-order CPU design could probably achieve a much higher IPC and achieve the same actual performance even with 64-bit vector units. The third option would be the most interesting one, since it implies that we might see either a relatively big APU, which probably doesn't make sense on the PC due to lack of main memory bandwidth, or a really fast interconnect between the CPU and the GPU.

It was 77 GFLOPs, for Xenon's theoretical performance; according to this IBM graphic reference (within the Forbes link).

EDIT: "The 115.2 figure is the theoretical peak if you include non-arithmetic instructions such as permute. These are not normally included in any measure of FLOPs."
 
Last edited by a moderator:
It seems to have served it's purpose perfectly, and led the way to the accelerated processing revolution years later. I would have thought that gaming devs would prefer a very powerful OoO PPE combined with lots of SPEs instead of having to do GPGPU code. But what do I know.
If the heaving lifting code on the SPEs is vertex culling, compute shader workloads for complex illumination, mesh skinning, post processing...they are tasks ideally suited for GPUs; no GPGPU about it. SPEs are AFAIK mostly used for graphics tasks that the RSX isn't great at.
 
Just as an aside, how long does it take to design a custom cpu from the ground up, and what size of an engineering team are we talking? Be it IBM, Intel, whatever.
 
If the heaving lifting code on the SPEs is vertex culling, compute shader workloads for complex illumination, mesh skinning, post processing...they are tasks ideally suited for GPUs; no GPGPU about it. SPEs are AFAIK mostly used for graphics tasks that the RSX isn't great at.
But there's a lot of image decoding and processing for the Kinect-2 or Move (or even the rumored kinect-like eyetoy), there's decryption and decompression from the disk media, with an SSD it's going to need 10 times more processing, RT compression for a remote display (WiiU or PS4->Vita, 720->phone/tablet), more complex physics and particles and water become possible, isn't that all much easier on a real CPU with gobs of FP performance? I mean processing a point cloud and comparing to a database for the Kinect can't be easy to do on a GPU.
 
But there's a lot of image decoding and processing for the Kinect-2 or Move (or even the rumored kinect-like eyetoy), there's decryption and decompression from the disk media, with an SSD it's going to need 10 times more processing, RT compression for a remote display (WiiU or PS4->Vita, 720->phone/tablet), more complex physics and particles and water become possible, isn't that all much easier on a real CPU with gobs of FP performance? I mean processing a point cloud and comparing to a database for the Kinect can't be easy to do on a GPU.
Image decoding and processing is probably better suited to a GPU. video encoding is more efficient with custom hardware. Decryption and decompression doesn't need gobs of vector maths. More complex physics and water - PhysX on nVidia GPUs says hello. Leaving AI, audio, and housekeeping, which a multicore general purpose processor will be fine for.
 
4 Core Haswell would be twice that - 409.6 GLFOPS via AVX2.

Current Sandy and Ivy bridge processors peak at 204.8 with 4 cores at 3.2Ghz.
that thing is to kill anything "next gen" about "next gen" systems... :LOL:
It's a monster
Devs in general hate Cell. I personally hate it with a burning passion. Ever meet some old-timer PS3 devs, buy the one with prematurely gray hair a beer, and you get to listen to vitriol-filled horror stories about trying to fit normal processing into the programming model of the Cell. Sure, we'll use it, if it's the only game in town, and we'll even grumpily admit that it's actually really good at a narrow set of tasks, but I don't think I've ever heard anyone who has actually programmed on it to prefer it over anything other than having his fingers amputated. And even that will probably make a lot of us pause and think.
What makes me kind of sad is that whereas it never shipped when I read the 'few" here that had the chance to play with Larrabee I've the, may be wrong, sense (reading to much into), that they really enjoyed working on it.
 
Last edited by a moderator:
Well we had mostly heard about the older Orbis target specs with 2GB of GDDR5...

As for the "next-gen" game that "we're" already playing, I do recall a Skyrim interview saying as much.
 
Status
Not open for further replies.
Back
Top