There is indeed no Cell 2.0, end of the story no matter how I look at it.Cell 2.0 is here, only it's not really like Cell at all, and it's way way way better.
Last edited by a moderator:
There is indeed no Cell 2.0, end of the story no matter how I look at it.Cell 2.0 is here, only it's not really like Cell at all, and it's way way way better.
Basically, they're making a poor man's Cell, only easier and x86 style...
I kind of like ArsTechnica but that article is bad and doesn't mean much.Norden started by focusing on the chips, including the 64-bit x86 CPU that he stressed provided low power consumption and heat. The eight cores are capable of running eight hardware threads, with each core using a 32KiB L1 I-cache and D-cache, and each four-core group sharing 2MiB of L2 Cache. The processor will be able to handle things like atomics, threads, fibers, and ULTs, with out-of-order execution and advanced ISA.
Sony is building its CPU on what it's calling an extended DirectX 11.1+ feature set, including extra debugging support that is not available on PC platforms. This system will also give developers more direct access to the shader pipeline than they had on the PS3 or through DirectX itself. "This is access you're not used to getting on the PC, and as a result you can do a lot more cool things and have a lot more access to the power of the system," Norden said. A low-level API will also let coders talk directly with the hardware in a way that's "much lower-level than DirectX and OpenGL," but still not quite at the driver level.
The system is also set up to run graphics and computational code synchronously, without suspending one to run the other. Norden says that Sony has worked to carefully balance the two processors to provide maximum graphics power of 1.843 teraFLOPS at an 800Mhz clock speed while still leaving enough room for computational tasks. The GPU will also be able to run arbitrary code, allowing developers to run hundreds or thousands of parallelized tasks with full access to the system's 8GB of unified memory.
The translation clears up a few things for me.. after reading the google translation I somehow thought they were referring to CPU cache bypass operations but apparently it's for the GPU side (others seemed to have already picked this up). I wonder if it'll be done by address/via MMU or if it'll involve special instructions.
Still not totally clear on what this VOLATILE tag means. Something that's part of cache wouldn't go direct to memory by definition, unless they're wasting cachelines instead of TLB attributes to mark memory as uncacheable. I think it means that the writeback will take the non-coherent bus and won't snoop the CPU's caches, and/or won't be updated by modifications from the CPU. Or maybe that it is coherent with changes to main RAM and can be updated, hence actually volatile...
So it's nothing at all like the Cell paradigm. Liverpool is in fact a Toaster Oven, while the Cell was a CPU with accelerating cores.the existence of an APU gives us the ability to come close to the results obtained from the SPU.
We’re trying to replicate the SPU Runtime System (SPURS) of the PS3 by heavily customizing the cache and bus. SPURS is designed to virtualize and independently manage SPU resources.
So it's nothing at all like the Cell paradigm. Liverpool is in fact a Toaster Oven, while the Cell was a CPU with accelerating cores.
The way Cerny is presenting the link between the Cell and Liverpool, while at the same time saying they had to ditch BC, seems to be an attempt to help the porting of PS3 games. As long as they can recompile them, and modify them to work on the PS4. They need to prevent having a bottleneck which would change the paradigms used to code against the SPUs. I think he's saying the Cell had some strong advantages in this area (the close coupling and SPU memory model that is micro-managed and doesn't pollute the cache) maybe they wanted to modify the APU to make it at least on par with the SPUs in every aspect, so porting would be much easier. (could this help software BC too?)
As for "BC":
Re-coding the SPU "applets" or whatever they're called to something equivalent (that can run on the GPU, possibly, or else on a CPU thread) can't be too horrific a task can it? Each SPU only has 256k RAM available to it, less for actual code. *shrug* I'm not a coder, especially not a PS3 coder, so maybe it would be a huge deal that would require re-engineering most if not all of the rendering code. Who knows.
Maybe ERP or another of our experienced gearheads can offer insights?
SPUs also sat behind a very fat, but extremely high-latency bus, AFAIK without direct access to GPU RAM, or at least extremely slow access (something like 25MB/s or such). So to do graphics on GPU assets, you first had to get the GPU to DMA over the stuff to XDR memory to avoid the ridiculous speed penalty (with all the overhead and latency that entails, hits to bandwidth to both pools of RAM etc), have the SPUs fetch it via DMA (more latency), do the actual processing at 3.2GHz, and then transfer back in reverse order. Probably not a net gain, compared to running a compute shader on a modern GPU @800MHz...The SPUs also run @ 3.2GHz, twice as high as Jagaur's, 4 times the CUs' clock. OTOH, the CUs pack more power and features.
Realizing Energy Efficiency and Smoothness using a Second Custom Chip with Embedded CPU
Cerny: The second custom chip is essentially the Southbridge. However, this also has an embedded CPU. This will always be powered, and even when the PS4 is powered off, it is monitoring all IO systems. The embedded CPU and Southbridge manages download processes and all HDD access. Of course, even with the power off.
Yes, but SPUs weren't exactly your average run of the mill general purpose processors, so maybe you could trade one set of Cell SPU complications for GPGPU complications? (Yeah yeah... Programming doesn't work like that, I know. ) In any case, how many types of SPU jobs do PS3 games include these days, in general? If it's just a couple you could perhaps be able to brute-force it done, since rendering a PS3 game on hardware more than half a decade more recent shouldn't be too taxing one would think.Porting anything that ran on a general purpose processor, even one with limited memory to GPU code so it runs with any degree of efficiency is at best hard and sometimes impossible.
Maybe Sony's additional debug hardware they purportedly included in PS4 will help with that. Surely there has to be performance counters exposed to developers so they can properly examine how their code actually runs on the hardware...?It's also often difficult to determine what the actual bottleneck is in a compute shader, and intuitions tend to not be correct.
Yes, the Japanese interview has much more tech details that clear up a few things.
First and foremost, it talks about Cerny's technical ambition: To create a "seamless" programming model + platform for programming the CPU and GPU together, like how they programmed PPU and SPU together.
I took a quick look at the leaked SPURS doc.
The SPUs and PPUs are programmed in C/C++. They load SPURS kernel into the SPUs (and PPU) so that they manage themselves without PPU intervention. They also have some sort of real-time scheduling system in place to hang tasks/jobs in work queues. The system knows about the deadlines and schedule tasks around some set schedule (and if they have already missed it).
You debug the PPUs and SPUs in an integrated graphics debugger.
This is Cerny's goal but they are not there yet. It's not like OpenCL because the GPU doesn't need to switch compute/graphics mode per se. They probably want to schedule the CUs like the SPUs in SPURS in one consistent framework.
[... and I have to go fetch my kid now ]
EDIT:
I think the gist of it is they are trying to hammer away any low level obstacles, and at the same time put in the necessary tweaks to rope the CUs closer to the CPU.
Gotta run now !
SPUs also sat behind a very fat, but extremely high-latency bus, AFAIK without direct access to GPU RAM, or at least extremely slow access (something like 25MB/s or such). So to do graphics on GPU assets, you first had to get the GPU to DMA over the stuff to XDR memory to avoid the ridiculous speed penalty (with all the overhead and latency that entails, hits to bandwidth to both pools of RAM etc), have the SPUs fetch it via DMA (more latency), do the actual processing at 3.2GHz, and then transfer back in reverse order. Probably not a net gain, compared to running a compute shader on a modern GPU @800MHz...
Really, I just quoted your quote of what Chris Norden said.
For the GPU you cannot get 100% = 100% + something greater than 0%. It isn't going to happen no matter how much you wish it were so. There is a maximum of 100%. If 100% is being used for graphics that leaves 0% for compute.
And you are STILL ignoring that he specifically said two processors. Let me repeat that for you again. Two processors. And in case you forgot already. Two processors.
Regards,
SB
Just so long as you understand that the 1.84TF number is arrived at by using every alu on every cycle.
So yes you could gain efficiency, but just because you're running code that isn't particularly graphics intensive at certain times does not mean that is the time when you will need extra compute just because the GPU would be available.