PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

liolio · Mar 30, 2013

Scott_Arm said:
Cell 2.0 is here, only it's not really like Cell at all, and it's way way way better.

There is indeed no Cell 2.0, end of the story no matter how I look at it.

BRiT · Mar 30, 2013

yewyew said:
Basically, they're making a poor man's Cell, only easier and x86 style...

So basically it's nothing at all like Cell.

liolio · Mar 30, 2013

Norden started by focusing on the chips, including the 64-bit x86 CPU that he stressed provided low power consumption and heat. The eight cores are capable of running eight hardware threads, with each core using a 32KiB L1 I-cache and D-cache, and each four-core group sharing 2MiB of L2 Cache. The processor will be able to handle things like atomics, threads, fibers, and ULTs, with out-of-order execution and advanced ISA.
Sony is building its CPU on what it's calling an extended DirectX 11.1+ feature set, including extra debugging support that is not available on PC platforms. This system will also give developers more direct access to the shader pipeline than they had on the PS3 or through DirectX itself. "This is access you're not used to getting on the PC, and as a result you can do a lot more cool things and have a lot more access to the power of the system," Norden said. A low-level API will also let coders talk directly with the hardware in a way that's "much lower-level than DirectX and OpenGL," but still not quite at the driver level.
The system is also set up to run graphics and computational code synchronously, without suspending one to run the other. Norden says that Sony has worked to carefully balance the two processors to provide maximum graphics power of 1.843 teraFLOPS at an 800Mhz clock speed while still leaving enough room for computational tasks. The GPU will also be able to run arbitrary code, allowing developers to run hundreds or thousands of parallelized tasks with full access to the system's 8GB of unified memory.

I kind of like ArsTechnica but that article is bad and doesn't mean much.
There are typos to begin with, Sony is building a CPU based on directx etc. The guys meant GPU.

Anyway, I would prefer to go by the specs sheet or the Japanese interview (now properly translated but even google translation made more sense than that "piece" from Arstechnica).

patsu · Mar 30, 2013

Yes, the Japanese interview has much more tech details that clear up a few things.

First and foremost, it talks about Cerny's technical ambition: To create a "seamless" programming model + platform for programming the CPU and GPU together, like how they programmed PPU and SPU together.

I took a quick look at the leaked SPURS doc.

The SPUs and PPUs are programmed in C/C++. They load SPURS kernel into the SPUs (and PPU) so that they manage themselves without PPU intervention. They also have some sort of real-time scheduling system in place to hang tasks/jobs in work queues. The system knows about the deadlines and schedule tasks around some set schedule (and if they have already missed it).

You debug the PPUs and SPUs in an integrated graphics debugger.

This is Cerny's goal but they are not there yet. It's not like OpenCL because the GPU doesn't need to switch compute/graphics mode per se. They probably want to schedule the CUs like the SPUs in SPURS in one consistent framework.

[... and I have to go fetch my kid now

]

EDIT:

Exophase said:
The translation clears up a few things for me.. after reading the google translation I somehow thought they were referring to CPU cache bypass operations but apparently it's for the GPU side (others seemed to have already picked this up). I wonder if it'll be done by address/via MMU or if it'll involve special instructions.

Still not totally clear on what this VOLATILE tag means. Something that's part of cache wouldn't go direct to memory by definition, unless they're wasting cachelines instead of TLB attributes to mark memory as uncacheable. I think it means that the writeback will take the non-coherent bus and won't snoop the CPU's caches, and/or won't be updated by modifications from the CPU. Or maybe that it is coherent with changes to main RAM and can be updated, hence actually volatile...

I think the gist of it is they are trying to hammer away any low level obstacles, and at the same time put in the necessary tweaks to rope the CUs closer to the CPU.

Gotta run now !

MrFox · Mar 30, 2013

the existence of an APU gives us the ability to come close to the results obtained from the SPU.

We’re trying to replicate the SPU Runtime System (SPURS) of the PS3 by heavily customizing the cache and bus. SPURS is designed to virtualize and independently manage SPU resources.

So it's nothing at all like the Cell paradigm. Liverpool is in fact a Toaster Oven, while the Cell was a CPU with accelerating cores.

The way Cerny is presenting the link between the Cell and Liverpool, while at the same time saying they had to ditch BC, seems to be an attempt to help the porting of PS3 games. As long as they can recompile them, and modify them to work on the PS4. They need to prevent having a bottleneck which would change the paradigms used to code against the SPUs. I think he's saying the Cell had some strong advantages in this area (the close coupling and SPU memory model that is micro-managed and doesn't pollute the cache) maybe they wanted to modify the APU to make it at least on par with the SPUs in every aspect, so porting would be much easier. (could this help software BC too?)

yewyew · Mar 30, 2013

MrFox said:
So it's nothing at all like the Cell paradigm. Liverpool is in fact a Toaster Oven, while the Cell was a CPU with accelerating cores.

The way Cerny is presenting the link between the Cell and Liverpool, while at the same time saying they had to ditch BC, seems to be an attempt to help the porting of PS3 games. As long as they can recompile them, and modify them to work on the PS4. They need to prevent having a bottleneck which would change the paradigms used to code against the SPUs. I think he's saying the Cell had some strong advantages in this area (the close coupling and SPU memory model that is micro-managed and doesn't pollute the cache) maybe they wanted to modify the APU to make it at least on par with the SPUs in every aspect, so porting would be much easier. (could this help software BC too?)

Ever since we looked at early leaks with 18 CU's with 4 of them dedicated for CPU assistance, it always seemed to be the case apparently, that there may be an attempt for software BC of PS3 games or at least enhanced ports.

Arwin · Mar 30, 2013

There are at least obvious business motives for reworking PS3 games to run on PS4, which is that it could be a joint effort to also get them running in the cloud.

patsu · Mar 30, 2013

Well I wouldn't say they are aiming for BC at this point. They are shooting for abstraction and general use of the CUs first. The CUs and the SPUs have pretty different characteristics. They may not be able to emulate the SPUs adequately. Presumably they can load their SPURS kernel into the CPU cores and ACEs for the CUs to manage themselves like the SPUs ?

Made a mistake above. Instead of bringing the CUs closer to the CPU, it's probably more accurate in spirit to say they want to bring the CUs closer to the developers. Hence they need to chisel away some of the default h/w behaviors that may be too restrictive or difficult (because of legacy/GPU assumptions) when the dev tries to use them at the abstract level.

Grall · Mar 30, 2013

As for "BC":
Re-coding the SPU "applets" or whatever they're called to something equivalent (that can run on the GPU, possibly, or else on a CPU thread) can't be too horrific a task can it? Each SPU only has 256k RAM available to it, less for actual code. *shrug* I'm not a coder, especially not a PS3 coder, so maybe it would be a huge deal that would require re-engineering most if not all of the rendering code. Who knows.

Maybe ERP or another of our experienced gearheads can offer insights?

patsu · Mar 30, 2013

By and large, the SPUs are used for graphics related compute jobs. It's probably easier to recode the SPUlets in the CUs' native language and unified memory environment, instead of following the SPUs' way in a split pool setup.

The SPUs also run @ 3.2GHz, twice as high as Jagaur's, 4 times the CUs' clock. OTOH, the CUs pack more power and features.

It sounds more like them improving their practices instead of targeting BC this way.

ERP · Mar 30, 2013

Grall said:
As for "BC":
Re-coding the SPU "applets" or whatever they're called to something equivalent (that can run on the GPU, possibly, or else on a CPU thread) can't be too horrific a task can it? Each SPU only has 256k RAM available to it, less for actual code. *shrug* I'm not a coder, especially not a PS3 coder, so maybe it would be a huge deal that would require re-engineering most if not all of the rendering code. Who knows.

Maybe ERP or another of our experienced gearheads can offer insights?

Porting anything that ran on a general purpose processor, even one with limited memory to GPU code so it runs with any degree of efficiency is at best hard and sometimes impossible.
GPGPU programming is IME counter intuitive, especially if you've spent any time writing optimal CPU code. Where on a CPU you avoid touching memory or copying it at all costs, often the most efficient thing to do on a GPU is copy and reformat megabytes of memory, so you can get some degree of compute efficiency.
It's also often difficult to determine what the actual bottleneck is in a compute shader, and intuitions tend to not be correct.
It is not uncommon for someone writing their first none trivial compute shader to end up with something slower than a single threaded CPU implementation even though the CPU solution is fp bound and the compute shader has hundreds of times the fp performance.

Grall · Mar 30, 2013

patsu said:
The SPUs also run @ 3.2GHz, twice as high as Jagaur's, 4 times the CUs' clock. OTOH, the CUs pack more power and features.

SPUs also sat behind a very fat, but extremely high-latency bus, AFAIK without direct access to GPU RAM, or at least extremely slow access (something like 25MB/s or such). So to do graphics on GPU assets, you first had to get the GPU to DMA over the stuff to XDR memory to avoid the ridiculous speed penalty (with all the overhead and latency that entails, hits to bandwidth to both pools of RAM etc), have the SPUs fetch it via DMA (more latency), do the actual processing at 3.2GHz, and then transfer back in reverse order. Probably not a net gain, compared to running a compute shader on a modern GPU @800MHz...

DieH@rd · Mar 30, 2013

Realizing Energy Efficiency and Smoothness using a Second Custom Chip with Embedded CPU

Cerny: The second custom chip is essentially the Southbridge. However, this also has an embedded CPU. This will always be powered, and even when the PS4 is powered off, it is monitoring all IO systems. The embedded CPU and Southbridge manages download processes and all HDD access. Of course, even with the power off.

- Anyone has idea what kind of CPU will be inserted into southbridge? 1 module Temash or some ARM?
- Who will make it? AMD?
- Will that CPU has its own memory pool or will it access power-hungry GDDR5?
- If it has access to its own memory pool [one or two 512MB DDR3 chips], is there a chance that beefy soutbridge CPU [~5 watts] will be used for running complete OS/UI, leaving central APU and GDDR5 to be [almost] free for gaming?

Grall · Mar 30, 2013

ERP said:
Porting anything that ran on a general purpose processor, even one with limited memory to GPU code so it runs with any degree of efficiency is at best hard and sometimes impossible.

Yes, but SPUs weren't exactly your average run of the mill general purpose processors, so maybe you could trade one set of Cell SPU complications for GPGPU complications?

(Yeah yeah... Programming doesn't work like that, I know.

) In any case, how many types of SPU jobs do PS3 games include these days, in general? If it's just a couple you could perhaps be able to brute-force it done, since rendering a PS3 game on hardware more than half a decade more recent shouldn't be too taxing one would think.

Then again, some games alledgedly massage geometry with SPUs, culling or even outright transforming instead of using GPU vertex shaders I seem to recall reading here on B3D, all that SPU code could of course just be straight dropped without any problems, you'd just let the GPU handle that work like you ordinarily would on any other (decently) modern platform.

All that would remain would be actual graphics processing, like say, DoF, glow effects and stuff like that, and those things should run quite well in a compute shader wouldn't they?

It's also often difficult to determine what the actual bottleneck is in a compute shader, and intuitions tend to not be correct.

Maybe Sony's additional debug hardware they purportedly included in PS4 will help with that. Surely there has to be performance counters exposed to developers so they can properly examine how their code actually runs on the hardware...?

onQ · Mar 30, 2013

patsu said:
Yes, the Japanese interview has much more tech details that clear up a few things.

First and foremost, it talks about Cerny's technical ambition: To create a "seamless" programming model + platform for programming the CPU and GPU together, like how they programmed PPU and SPU together.

I took a quick look at the leaked SPURS doc.

The SPUs and PPUs are programmed in C/C++. They load SPURS kernel into the SPUs (and PPU) so that they manage themselves without PPU intervention. They also have some sort of real-time scheduling system in place to hang tasks/jobs in work queues. The system knows about the deadlines and schedule tasks around some set schedule (and if they have already missed it).

You debug the PPUs and SPUs in an integrated graphics debugger.

This is Cerny's goal but they are not there yet. It's not like OpenCL because the GPU doesn't need to switch compute/graphics mode per se. They probably want to schedule the CUs like the SPUs in SPURS in one consistent framework.

[... and I have to go fetch my kid now ]

EDIT:

I think the gist of it is they are trying to hammer away any low level obstacles, and at the same time put in the necessary tweaks to rope the CUs closer to the CPU.

Gotta run now !

Welcome to the mind of a mad man.

http://forum.beyond3d.com/showpost.php?p=1715734&postcount=9

http://forum.beyond3d.com/showpost.php?p=1722235&postcount=908

http://forum.beyond3d.com/showpost.php?p=1723418&postcount=1107

& it's kinda scary to think how amazing this could be if it all works out as planned, using the APU as a powerful CPU & GPU at the same time, I guess this is why they called it a 'Supercharged PC Architecture' but I think it will take some time for them to take full advantage of this & most developers will probably stick to what they know & use the CPU & GPU the same way they would on a PC game.

patsu · Mar 31, 2013

Grall said:
SPUs also sat behind a very fat, but extremely high-latency bus, AFAIK without direct access to GPU RAM, or at least extremely slow access (something like 25MB/s or such). So to do graphics on GPU assets, you first had to get the GPU to DMA over the stuff to XDR memory to avoid the ridiculous speed penalty (with all the overhead and latency that entails, hits to bandwidth to both pools of RAM etc), have the SPUs fetch it via DMA (more latency), do the actual processing at 3.2GHz, and then transfer back in reverse order. Probably not a net gain, compared to running a compute shader on a modern GPU @800MHz...

It's rather different.

The SPU uses LocalStore, which offers L1 level access speed and latency. They use DMA to load the LocalStore in double buffer manner (e.g., 1 thread loads the data while the other thread computes). The memory access latency is hidden this way. If the SPUs hit their LocalStores regularly, they can be hard to beat.

Access to XDR is around 16-22GB/s. Read access to GDDR3 is 16MB/s, and write is 4GB/s. Presumably, the SPUs will rely on XDR access more.

The CUs have much more computation power and a more restrictive instruction set. It also has more direct/"closer" access to other graphics h/w and the bulk of graphics data. It doesn't need to shuffle data between 2 pools of memory unnecessarily too.

Going for dinner now !

Lucid_Dreamer · Mar 31, 2013

Silent_Buddha said:
Really, I just quoted your quote of what Chris Norden said.

For the GPU you cannot get 100% = 100% + something greater than 0%. It isn't going to happen no matter how much you wish it were so. There is a maximum of 100%. If 100% is being used for graphics that leaves 0% for compute.

And you are STILL ignoring that he specifically said two processors. Let me repeat that for you again. Two processors. And in case you forgot already. Two processors.

Regards,
SB

I remember a B3D discussion about how GPUs don't/can't use all of it's clock cycles. It had something to do with all the extra power eSRAM could possibly through "efficiency".

Knowing a GPU can't make full use of ALL of it's clock cycles, why couldn't that be an opportunity to stick some compute tasks within those windows? Wouldn't that still leave it's normal graphics tasks uneffected?

Didn't Cerny say they could bypass GPU L2 and L1 cache?

AlphaWolf · Mar 31, 2013

Just so long as you understand that the 1.84TF number is arrived at by using every alu on every cycle.

So yes you could gain efficiency, but just because you're running code that isn't particularly graphics intensive at certain times does not mean that is the time when you will need extra compute just because the GPU would be available.

patsu · Mar 31, 2013

That's where the developers get to optimize their code such that they overlay each other nicely.

Lucid_Dreamer · Mar 31, 2013

AlphaWolf said:
Just so long as you understand that the 1.84TF number is arrived at by using every alu on every cycle.

So yes you could gain efficiency, but just because you're running code that isn't particularly graphics intensive at certain times does not mean that is the time when you will need extra compute just because the GPU would be available.

If it's impossible to use all the clock cycles for graphics and GPGPU tasks could fill those empty clock cycle, which sayings would still be accurate?

- GPU and GPGPU tasks can be handled, simultaneously, without effecting graphics.
- GPU compute and graphics tasks can co-exist with no penalty to graphics.

It's all still means the same thing onQ was saying. You have all the graphics power + whatever can be pulled from the remaining clock cycles. That should be a HUGE amount of processing capability.

Question: If GPU commands are issued in waves, wouldn't that mean one cycles are naturally not being used? When FLOPs tests are performed, are they still using this "wave" technique?

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

liolio

Aquoiboniste

BRiT

(>• •)>⌐■-■ (⌐■-■)

liolio

Aquoiboniste

patsu

MrFox

Deludedly Fantastic

yewyew

Arwin

Now Officially a Top 10 Poster

patsu

Grall

Invisible Member

patsu

ERP

Grall

Invisible Member

DieH@rd

Grall

Invisible Member

onQ

patsu

Lucid_Dreamer

AlphaWolf

Specious Misanthrope

patsu

Lucid_Dreamer

Similar threads