PlayStation 4 (codename Orbis) technical hardware investigation (news and rumours)

Status
Not open for further replies.
If Knack doesn't use UE4, then it may be able to explore these custom features more thoroughly. They can architect the code from ground up to prepare for better (more aggressive) optimization opportunities. Some of the tech presentations they do next should be on Knack.

KZ:SF seems to have rather humble/traditional tech goals so far.
 
If you're rendering a bunch of simple triangles into shadow maps it's possible 50% of compute is idle, but if there's heavy pixel shading the number will be much less.
Yes, I appreciate the various opportunities. I was just wondering at a ball-park figure. Even if 50% could be idle during a shadow pass, if the shadow pass is 10% of frame time, and the GPU is 100% occupied the rest of the time, compute would have effectively 5% free GPU time to work without impacting the graphics workload of the game. That in turn would equate to 90 GFlops free compute. I suppose the approach to this question would be to break down a typical game (let's say UE3 as a very widespread basis) into typical passes/parts, identifying how much frame time is spent on each part, and then identifying how much free ALU space there is at each phase.
 
Yes, I appreciate the various opportunities. I was just wondering at a ball-park figure. Even if 50% could be idle during a shadow pass, if the shadow pass is 10% of frame time, and the GPU is 100% occupied the rest of the time, compute would have effectively 5% free GPU time to work without impacting the graphics workload of the game. That in turn would equate to 90 TFlops free compute. I suppose the approach to this question would be to break down a typical game (let's say UE3 as a very widespread basis) into typical passes/parts, identifying how much frame time is spent on each part, and then identifying how much free ALU space there is at each phase.

I think you made a typo.
 
90 TFlops magic sauce! I fully expect this to appear on GAF these days. :LOL:

BTW, do we have any consensus on the pricing of sixteen 512MB GDDR5 chips?
 

I don't remember reading the part below in the gamasutra article but I could be wrong:


The PS4 SoC has three major enhancements not found on today’s PCs (it is unclear whether at least certain of those technologies will eventually make it to the PC, but it is highly likely):

The graphics processor can write directly to system memory, bypassing its own L1 and L2 caches, which greatly simplifies data sync between graphics processor and central processor. Mr. Cerny claims that a special bus with around 20GB/s bandwidth is used for such direct reads and writes. To support the case where developer wants to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, Sony has added a “volatile” bit in the tags of the cache lines. Developers can then selectively mark all accesses by compute as “volatile”, and when it is time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time. In general, the technique radically reduces the overhead of running compute and graphics together on the GPU. The original AMD GCN architecture allows one source of graphics commands, and two sources of compute commands. For PS4, Sony worked with AMD to increase the limit to 64 sources of compute commands. If a game developer has some asynchronous compute you want to perform, he should put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system. Sony believes that not only games, but also various middleware will use GPU computing, which is why requests from different software clients need to be properly blended and then properly prioritized.
“If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn't even use a pixel shader, it is entirely done by vertex shaders and the rasterization hardware – so graphics are not using most of the 1.8TFLOPS of ALU available in the CUs. Times like that during the game frame are an opportunity to say, 'Okay, all that compute you wanted to do, turn it up to 11 now',” said Mr. Cerny.

While the PlayStation 4 is very powerful already, there were ways to further boost its performance, but at the cost of increasing complexities for game developers. The company decided that minimal hassles for game designers is more important than additional performance and decided to steak to the current architectural solutions. The benefits of Sony’s architectural decisions will be seen in the PlayStation 4's launch games.

"The launch lineup for PlayStation 4 – though I unfortunately cannot give the title count – is going to be stronger than any prior PlayStation hardware,” said Mr. Cerny.
 
I just thought that the bluray drive specs changed from PCAV 3.3x-6x (vgleaks) to plain 6x CAV (so 2.5x-6x). That's an interesting change because the early PCAV version would have had the inner tracks spin up to 6600 RPM, while the normal CAV would be 5000 RPM constant. This 5000 RPM happens to be the exact rotational speed limit for slim drives. Maybe that's a hint about the form factor, or future proofing.

It could also be simply to take care of the noise. A PS3 drive will spin up to 4000 RPM while playing a 3D bluray film (for example), and it's silent at that speed, so I suppose 5000 RPM would still be fine.
 
The graphics processor can write directly to system memory, bypassing its own L1 and L2 caches, which greatly simplifies data sync between graphics processor and central processor. Mr. Cerny claims that a special bus with around 20GB/s bandwidth is used for such direct reads and writes. To support the case where developer wants to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, Sony has added a “volatile” bit in the tags of the cache lines. Developers can then selectively mark all accesses by compute as “volatile”, and when it is time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time. In general, the technique radically reduces the overhead of running compute and graphics together on the GPU. The original AMD GCN architecture allows one source of graphics commands, and two sources of compute commands. For PS4, Sony worked with AMD to increase the limit to 64 sources of compute commands. If a game developer has some asynchronous compute you want to perform, he should put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system. Sony believes that not only games, but also various middleware will use GPU computing, which is why requests from different software clients need to be properly blended and then properly prioritized.
“If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn't even use a pixel shader, it is entirely done by vertex shaders and the rasterization hardware – so graphics are not using most of the 1.8TFLOPS of ALU available in the CUs. Times like that during the game frame are an opportunity to say, 'Okay, all that compute you wanted to do, turn it up to 11 now',” said Mr. Cerny.

It's certainly in the article, although written differently - the interesting part is that Sony appears to have intended middleware vendors to use it... removing much of the complexity from game devs.

On another note, if you consider that this design was "intended", then I'd suggest that the 14/4 setup was emulating it - which may give some idea of how much relative performance they expect to be available.
 
Yeah, 14/4 in early dev kits may be because the priority scheme customization was not ready yet.

It may be hard to change utilization pattern by retrofitting. If Knack is not written in UE4, then I think PS4 optimization in UE4 will be opportunistic, as opposed to "aggressive".

A UE4 game may have to do a fair bit of tailoring to exploit PS4 GPU specific features (but 8GB unified RAM is probably easy to exploit or abuse). Does UE3 take advantage of SPURS inherently for rendering this gen ? It may be more common to keep them separate (SPURS handling carved out "large" SPU jobs, UE3 for overall rendering).
 
Along those same lines, I'm still wondering how robust the video encoder is. PS4 Eye video, Remote Play, and all of the Share functions will likely rely on it. Can the encoder handle these functions simultaneously? E.g. playing Sports Championship 3, with a live video chat over PSN, while uStreaming the whole thing? That's at least 2 completely separate encodes (the PS Eye RAW/YUV video feed and the games frame buffer both being encoded).

To answer my own question I probably should have started by looking at the current Radeon encode/decode hardware as that's likely the building block used in the PS4 APU (if not the very same chipset). Unless they completely threw out the AMD components and built their own custom units. Maybe a couple of SPUs built into the APU (VGLeaks exclusive)!?! :p If Watch Impress' speculation that the encode/decode hardware are off APU and on the Southbride's custom low power chip I suppose a custom solution might make more sense. But even then using AMD's design would seem to be more cost efficient. Anyway, this seems to be the common feature set across the Radeon line:

AMD HD Media Accelerator

Unified Video Decoder (UVD)
H.264
VC-1
MPEG-2 (SD & HD)
MVC (Blu-ray 3D)
MPEG-4 Part 2 (DivX/Xvid)
Adobe Flash
DXVA 1.0 & 2.0 support
WMV HD

Video Codec Engine (VCE)
Multi-stream hardware H.264 encoder
Full-fixed mode: 1080p @ 60 FPS encoding
Hybrid mode: Stream Processor-assisted encoding

Decoder is a bit more robust than I thought (in terms of codec support). For some reason, I was under the impression only H.264 was hardware accelerated but that doesn't appear to be the case (for decode, at least). The VCE, on the other hand, is quite a bit more clear cut. Pure hardware encoding (Full Mode, which operates without the assistance of the CUs) seems to be H.264 only and at 1080p60. Along with a hybrid encoding mode which uses the fixed function encode hardware with help from the CUs (stream processors). What really caught my eye though was the multi-stream support, which would answer my question about how many of the functions dependent on the encode hardware can actually be used at the same time. It would appear that may not be much of an issue.

EDIT

Was wondering what would handle the audio stream for a game session stream, shared video and video chat. The dedicated audio hardware or the video encode unit? Looks like it will probably be the VCE plus the audio unit:

The VCE hardware supports multiple compression and quality levels, and it can multiplex inputs from various sources for the audio and video tracks to be encoded. Interestingly, the video card's frame buffer can act as an input source, allowing for a hardware-accelerated HD video capture of a gaming session.

This is all assuming the PS4 uses AMD's UVD and VCE, of course.
 
Last edited by a moderator:
Yeah I read up on the VCE and UVD stuff.

Cerny mentioned that the audio unit can decode a large number of MP3 streams. Is the AMD APU audio unit known for its performance ? I could find a few reviews for older APUs but it doesn't quite standout. ^_^
 
Yeah I read up on the VCE and UVD stuff.

Cerny mentioned that the audio unit can decode a large number of MP3 streams. Is the AMD APU audio unit known for its performance ? I could find a few reviews for older APUs but it doesn't quite standout. ^_^

IMHO sound chip is the only with Sony exclusive tech ( well not AMD´s at least, as it could very well be a Yamaha chip ) in there ( and maybe also the zlib decompressor ).
 
Last edited by a moderator:
Is the AMD APU audio unit known for its performance ? I could find a few reviews for older APUs but it doesn't quite standout. ^_^

I don't believe it offers any kind of hardware acceleration. It seems to just be a standard codec as might be found on a PC motherboard. That said, with their latest generation of video cards (77XX-79XX) they added a feature that they branded as "Discrete Digital Multi-Point Audio" which added the ability to route up to 6 independent multi-channel (up to 8 channels) audio streams to different video outputs, which is unique AFAIK.
 
Hmm... Rumors and Cerny only called out the audio performance. Haven't read anything about discrete multi-point audio per se.


IMHO sound chip is the only with Sony exclusive tech ( well not AMD´s at least, as it could very well be a Yamaha chip ) in there ( and maybe also the zlib decompressor ).

It's a glaring gap given Sony's expertise in this area. Short of official info, it's probably good to wait for tear down.
 
Status
Not open for further replies.
Back
Top