PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

ultragpu · Apr 25, 2013

Would you say the fire effect in Deep Down is some kind of compute shader? If so then devs should get on to compute and graphics parallel processing asap.

mrcorbo · Apr 25, 2013

DJ12 said:
I remember a while back someone mentioned 32 alus was perhaps overkill and would never be used to their fullest with this information does it now make sense?

A bit of extra resource so compute can always have some alu resources to play with?

You're misremembering. People were saying the 32 ROPS were overkill. Someone can correct me if I'm wrong but I believe that PS4 has 18 CUs, each of those made up of 4 groups of 16 ALUs. 18 X 4 X 16 = 1152 ALUs total.

ZiGgY · Apr 25, 2013

Disappointed that there is still no word on the rumoured 2Ghz clock rate on the CPU, rather than the standard 1.6Ghz.

Gipsel · Apr 25, 2013

Shifty Geezer said:
Why doesn't that consumer all the ALUs for a very short time then? I'd have thought, in my naivety, that vertex work saturates the ALUs for a brief moment to churn through those jobs, and then the ALUs sit on the pixel work for much of the scene churning through that work, both pixels and then post processing. In my understanding of US GPUs, there's not going to be a moment when a few ALU pipes are active and the rest are idle, or when the whole set of ALUs are idle waiting for something to do.

There are several possibilities. The amount of work could be too low to really fill all ALUs (that means getting full occupancy of 40 wavefronts [or whatever register usage allows] per CU). The throughput could be bottlenecked by another pipeline stage, leaving only a relatively low amount of wavefronts in flight at a given time.
A unified shader GPU doesn't run shaders in a strict sequence of vertex shader => (hull shader => tessellator => domain shader =>) geometry shader => rasterizer => pixel shader, anyway. For any reasonable workload, different shader types run at the same time overlapping their execution to get a higher occupancy of the shader units. Adding yet another shader to run concurrently isn't that much of an extension

. Sony probably adds the flexibility to configure the rendering pipeline in a more flexible way then allowed by DX11 (i.e. inserting dependent compute shaders in the pipeline, the differentiation between the different shader types eventually lose their meaning, btw.). That's basically not a limitation of the hardware even with current GPUs.

DieH@rd · Apr 25, 2013

First look at AMD G-series APUs [Jaguar]

According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].

patsu · Apr 25, 2013

Their modifications are mostly to prevent parallel jobs from stepping on each other's toes. The graphics compute jobs will probably be "consistent" w.r.t. other graphics jobs. OTOH, some non-graphics compute jobs may interfere with whatever is going on if not handled correctly (Article mentioned the L2 cache tweak).

They may also need some way to schedule the jobs "correctly".

Gipsel · Apr 25, 2013

patsu said:
Their modifications are mostly to prevent parallel jobs from stepping on each other's toes. The graphics compute jobs will probably be "consistent" w.r.t. other graphics jobs. OTOH, some non-graphics compute jobs may interfere with whatever is going on if not handled correctly (Article mentioned the L2 cache tweak).

What L2 cache tweak? The volatile tag exists in all GCN based GPUs sold since 16 months. It's a GCN feature that finally gets used

.

And the GCN hardware is capable of handling large amounts of different shaders/kernels with or without direct dependencies within the rendering pipeline (the latter would be asynchronous compute stuff). Wavefronts can be assigned different priorities for instance (for example, all wavefronts of a certain shader/kernel could be assigned a higher priority or some asynchronous background tasks can be assigned a lower priority). What Sony will probably add is a possibility for the devs to exert some influence on the work distribution and prioritization. Currently, there is no such possibility on PC GPUs (one can only change the priority in shader code once it got scheduled to a CU [and only when writing the shader in the native ISA, there is no possibility to do it through some higher level API], one can't set the base priority assigned to a wavefront upon creation), it is handled by game profiles in the driver. But that doesn't necessitate hardware changes, it's an API and firmware issue.

Kaotik · Apr 25, 2013

DieH@rd said:
First look at AMD G-series APUs [Jaguar]

According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].

And 1.5GHz quadcore with GPU enabled at 15W, too.

patsu · Apr 25, 2013

Gipsel said:
What L2 cache tweak? The volatile tag exists in all GCN based GPUs sold since 16 months. It's a GCN feature that finally gets used .

And the GCN hardware is capable of handling large amounts of different shaders/kernels with or without direct dependencies within the rendering pipeline (the latter would be asynchronous compute stuff). Wavefronts can be assigned different priorities for instance (for example, all wavefronts of a certain shader/kernel could be assigned a higher priority or some asynchronous background tasks can be assigned a lower priority). What Sony will probably add is a possibility for the devs to exert some influence on the work distribution and prioritization. Currently, there is no such possibility on PC GPUs (one can only change the priority in shader code once it got scheduled to a CU [and only when writing the shader in the native ISA, there is no possibility to do it through some higher level API], one can't set the base priority assigned to a wavefront upon creation), it is handled by game profiles in the driver. But that doesn't necessitate hardware changes, it's an API and firmware issue.

I thought the default GCN tagging mechanism is more coarse grain ? They also need to tweak the way cache is flushed when the GPU interacts with the CPU. The PS4 GPU has many more ACEs, small obstacles like this will prevent parallelism.

Yes, I suspect they have to change the firmware to fudge with the priority scheme (to give developer control). It's mentioned in the interview when they talk about the vertex-compute jobs example. In the general sense, they are trying to fit a SPURS-like kernel into the GPU. On PS3, the kernel runs on every SPU.

Love_In_Rio · Apr 25, 2013

Gipsel said:
What L2 cache tweak? The volatile tag exists in all GCN based GPUs sold since 16 months. It's a GCN feature that finally gets used .

Then why the f does Cerny sell it as one of the three compute customizations...in the end it all seems standard AMD GCN 1.1

anexanhume · Apr 25, 2013

DieH@rd said:
First look at AMD G-series APUs [Jaguar]

According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].

So CPU is 30W (I think doubling is fair since it's 2 clusters of 4 cores without full resource sharing) at top end and GPU is probably ~ 100W. It basically has the dissipation of a high end desktop chip.

3dcgi · Apr 25, 2013

Shifty Geezer said:
Why doesn't that consumer all the ALUs for a very short time then? I'd have thought, in my naivety, that vertex work saturates the ALUs for a brief moment to churn through those jobs, and then the ALUs sit on the pixel work for much of the scene churning through that work, both pixels and then post processing. In my understanding of US GPUs, there's not going to be a moment when a few ALU pipes are active and the rest are idle, or when the whole set of ALUs are idle waiting for something to do.

Consider a part like the PS4 with 18 CUs, 72 SIMDs and 2 triangles/clock throughput. In this example there's one unique vertex per triangle. Each SIMD works on 16 vertices/clock so to get work on every SIMD it will take 16*72/2=576 clocks.

If the oldest VS finishes in < 576 clocks at least one SIMD is idle. A significant part of the VS's clocks are spent fetching data from the vertex buffer, constant memory, etc. so the ALUs aren't busy the entire time leaving some idle time for compute.

onQ · Apr 25, 2013

DieH@rd said:
First look at AMD G-series APUs [Jaguar]

According to the chart, quadcore part running on a 1.6GHz with disabled onboard GPU is using 15W. The most power-hungry version spends 25W [quadcore 2ghz, gpu activated].

The highest performing SKU is a 25 watt TDP part running four cores at 2 GHz.

So Mike R info was real?

MikeR said:
Jaguar Vanilla
* 1.8GHz LC Clocks (can be under-clocked for specific low-powered battery device needs - tablets, etc...).
* 2MB shared L2 cache per CUs
* 1-4 CUs can be outfitted per chip. (i.e. 4-16 logical cores)
* 5-25 watts depending on the device/product. (45 watts is achievable under proper conditions)

PS4 Jaguar with chocolate syrup.
* 2GHz is correct as of now.
* 4MB of total L2 cache (2MB L2 x 2 CUs)
* 2 CUs (8 Logical cores).
* idles around 7 watts during non-gaming operations and around 12 watts during Blu-ray movie operations. Gaming is a mixed bag...

What would be nice is a fully loaded Jaguar chip.

Quit bringing up platforms beyond the scope of the thread.

3dilettante · Apr 25, 2013

The clock may be 2 GHz, especially if the CPU is allowed a higher TDP than the embedded line.

One thing that goes against using the embedded line as solid proof is that the 2GHz bin is one of many. We won't have a good idea of what the mix is, but the PS4 won't have the luxury of multiple bins to handle defects or chip variation. The 2 GHz chips at 25W are probably chips that can hit those clocks with decent power consumption. Other Jaguar lines may give a better picture of higher TDP scaling.

The milder clocks and better design do show an improvement in that the SKUs don't seem to have the same sensitivity to CPU clock like the earliest APUs like Llano. That chip could gain or lose 50-60W with a 200MHz clock increment on the CPUs, GPU or not.

patsu · Apr 25, 2013

Love_In_Rio said:
Then why the f does Cerny sell it as one of the three compute customizations...in the end it all seems standard AMD GCN 1.1

It may be a different tag. Someone spoke about GCN's cache tag a few pages ago.

He sounds like a proud guy. He may not be selling you any funny ideas. He's just explaining what he and his team did.

The SPURS kernel is very small. It fits inside the SPU's local store, and run the SPUlet plus 2 data buffers (for double buffering). All inside 256K. So, sizewise, whatever SPURS-like stuff they put in can live in the firmware and/or the cache. The GPU already has ring buffers and queues the developer can use to hold data items/indices.

Shifty Geezer · Apr 25, 2013

Gipsel said:
There are several possibilities. The amount of work could be too low to really fill all ALUs (that means getting full occupancy of 40 wavefronts [or whatever register usage allows] per CU). The throughput could be bottlenecked by another pipeline stage, leaving only a relatively low amount of wavefronts in flight at a given time.

Okay. With this and 3dcgi's post, I think I understand. I was thinking of a wavefront occupying the ALU's for 100% of the time during its resolution, and then another wavefront following behind, so there was no idle time. I hadn't made the connection with delays in the other pipes. Although I'd want to hear how much of a frame can really be lost on a GPU by such stalls. Is there really a lot of spare GPU power going to waste that can be repurposed to compute?

3dcgi said:
Consider a part like the PS4 with 18 CUs, 72 SIMDs and 2 triangles/clock throughput. In this example there's one unique vertex per triangle. Each SIMD works on 16 vertices/clock so to get work on every SIMD it will take 16*72/2=576 clocks.

If the oldest VS finishes in < 576 clocks at least one SIMD is idle. A significant part of the VS's clocks are spent fetching data from the vertex buffer, constant memory, etc. so the ALUs aren't busy the entire time leaving some idle time for compute.

Thanks for the explanation.

rockaman · Apr 25, 2013

I really like this guy.

The main reason I feel like he really knows what he is talking about is that not only is he designing the hardware, but he's also making the games to run on that hardware. Talk about a really close relationship with the hardware, literally and figuratively!

If anyone should know what direction PS4 is trying to go for anything, this is your guy!

patsu · Apr 25, 2013

Shifty Geezer said:
Okay. With this and 3dcgi's post, I think I understand. I was thinking of a wavefront occupying the ALU's for 100% of the time during its resolution, and then another wavefront following behind, so there was no idle time. I hadn't made the connection with delays in the other pipes. Although I'd want to hear how much of a frame can really be lost on a GPU by such stalls. Is there really a lot of spare GPU power going to waste that can be repurposed to compute?

Thanks for the explanation.

To exploit these "holes" effectively, they need to fire off the right job(s) at the right time, plus get the data ready -- with minimal overhead.

rockaman said:
I really like this guy.

The main reason I feel like he really knows what he is talking about is that not only is he designing the hardware, but he's also making the games to run on that hardware. Talk about a really close relationship with the hardware, literally and figuratively!

If anyone should know what direction PS4 is trying to go for anything, this is your guy!

Good that he picked UE4 as the platform for his game. It will translate better to third party work.

3dcgi · Apr 25, 2013

Shifty Geezer said:
Although I'd want to hear how much of a frame can really be lost on a GPU by such stalls. Is there really a lot of spare GPU power going to waste that can be repurposed to compute?

That would really depend on the game and how well compute work complements graphics work. If compute needs whatever is bottlenecking the graphics shader it will slow down graphics work. The more ALU heavy the compute work is the more effective this will be as it's easier to become fetch bound than ALU bound.

DieH@rd · Apr 25, 2013

patsu said:
Good that he picked UE4 as the platform for his game. It will translate better to third party work.

As far as I know, the only media outlet that reported UE4-Cerny connection was Eurogamer, who posted that article immediately after PSMeeting, in ~2AM UK time. Maybe they were tired and mixed things up [Cerny showcased UE4 demo before Knack, he never directly mentioned that he is making game with UE4].

PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

ultragpu

mrcorbo

Foo Fighter

ZiGgY

Gipsel

DieH@rd

patsu

Gipsel

Kaotik

Drunk Member

patsu

Love_In_Rio

anexanhume

3dcgi

onQ

3dilettante

patsu

Shifty Geezer

uber-Troll!

rockaman

patsu

3dcgi

DieH@rd

Similar threads