Playstation 5 [PS5] [Release November 12 2020]

Thanks, I was getting tired of dismissive posts that said “patents aren’t products.” You mixed it up on this one, so good on you.
well, sorry man. This wasn't meant to be dismissive of Sony. Everyone has the same problem.

Unless Nvidia wants to start handing out models, everyone is stuck unless they want to build their own. Which is by all means can be very expensive.

To put things into perspective, Google, Amazon, and MS are the largest cloud processing for AI. None of them have a DLSS model. Facebook is trying but has something inferior to Nvidia as I understand it. Even with using RTX AI hardware, it's magnitudes away from DLSS performance.

MS can tout ML capabilities on the console, but no model, it's pointless. The technology for AI is in the model, the hardware to run it is trivial.

Further explanation on this front: a trained model consists of data and processing and the network. Even if you have the neural network to train with, and lets say it was open source, you still need data and then you need processing. Power.

To put things into perspective, BERT is a transformer network whose job is for natural language processing. It can read sentences and understand context as it reads both forwards and backwards to understand context. BERT the network is open source. The Data is not. The data source is Wikipedia (the whole wikipedia is read into BERT for training) but you'd still have to process the data ahead of time for it to be placed for training. Assuming you had a setup capable of training so much data, then gets into the compute part of the equation. Simply put, only a handful of companies in this world can train a proper BERT model. So while there are all sorts of white papers on BERT, small teams can't verify the results or keep up because the compute requirements are so high.

For a single training:
How long does it take to pre-train BERT?
BERT-base was trained on 4 cloud TPUs for 4 days and BERT-large was trained on 16 TPUs for 4 days. There is a recent paper that talks about bringing down BERT pre-training time – Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

If you make any change to it, any change to the network or data set. That's another 4 days of training before you can see the result. Iteration time is very slow without more horsepower on these complex networks.

***

Google BERT — estimated total training cost: US$6,912
Released last year by Google Research, BERT is a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks.

From the Google research paper: “training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.” Assuming the training device was Cloud TPU v2, the total price of one-time pretraining should be 16 (devices) * 4 (days) * 24 (hours) * 4.5 (US$ per hour) = US$6,912. Google suggests researchers with tight budgets could pretrain a smaller BERT-Base model on a single preemptible Cloud TPU v2, which takes about two weeks with a cost of about US$500.

...

What may surprise many is the staggering cost of training an XLNet model. A recent tweet from Elliot Turner — the serial entrepreneur and AI expert who is now the CEO and Co-Founder of Hologram AI — has prompted heated discussion on social media. Turner wrote “it costs $245,000 to train the XLNet model (the one that’s beating BERT on NLP tasks).” His calculation is based on a resource breakdown provided in the paper: “We train XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimizer, linear learning rate decay and a batch size of 2048, which takes about 2.5 days.”

***

None of these costs account for the amount of the R&D of how many times they had to run training just to get the result they wanted. The labour and education required from the researchers. The above is the cost of running the hardware.

Nvidia has been in the business since the beginning sucking up a ton of AI researcher talent. They have the hardware and resources and subject matter expertise in a long legacy of graphics to make it happen. It's understandable how they were able to create the models for DLSS.

I frankly can't see anyone else being able to pull this off. Not nearly as effectively. At least not anytime soon.
 
Last edited:
speedstep is slow compared to renior , and this seems to be much faster then it
https://www.anandtech.com/show/1570...k-business-with-the-ryzen-9-4900hs-a-review/2

NIce, I had no idea it was that slow from idle. No wonder gamers reported stuttering from speedstep.

"Here our Ryzen 9 4900 HS idles at 1.4 GHz, and within the request to go up to speed, it gets to 4.4 GHz (which is actually +100 MHz above specification) in 16 ms. At 16 ms to get from idle to full frequency, we’re looking at about a frame on a standard 60 Hz monitor, so responsiveness should be very good here."
"Technically this CPU can get to 4.5 GHz, it even says so on the sticker on the laptop, however we see another small bump up to 4.3 GHz at around 83 milliseconds. So either way, it takes 2-4x longer to hit the high turbo for Intel’s 9th Gen (as setup by Razer in the Blade) than it does for AMD’s Ryzen 4000 (as setup by ASUS)."

Idle to turbo here is of course the worst case, very small bumps would be faster, but with PS5 it needs to be deterministic so it has additional challenges.

This is the lag I was talking about. The PLL can certainly change very quickly (probably microseconds) and we see that granulality in the graphs, but if the next step is a higher voltage it needs to wait for the "spool up" to charge and stabilize the input capacitance. And if the next step is a lower voltage, sure it can drop the frequency instantly, but there will also be a lag before the voltage goes down so it will continue to waste power until it settles down.

So I guess from that perspective 2ms is very good.
 
I don't know if it's a side effect of Oodle Texture being made by the same group or if it would have applied as much to any other RDO tools, but it seems Oodle Texture combines particularly well with Kraken, making it an even higher multiplying factor. So compared to the PS4 zlib plus RDO, we're seeing 3.99x versus 3.36x on this dataset. Therefore we're seeing more than the 10% average improvement we should have expected from zlib versus kraken. And that's still without BC7prep.

From the blog...

"The compression improvement factor from Oodle Texture is similar and good for all the compressors, but stronger compressors like Oodle Kraken are able to get even more benefit from the entropy reduction of Oodle Texture. Not only do they start out with more compression on baseline non-RDO data, they also improve by a larger multiplier on RDO data."


Without RDO:
1.89:1 ooLeviathan8
1.88:1 lzma_def9
1.85:1 ooKraken8
1.77:1 ooMermaid8
1.76:1 zstd22
1.69:1 zlib9
1.60:1 lz4hc1
1.60:1 ooSelkie8

RDO lambda 40:
4.06:1 lzma_def9
4.05:1 ooLeviathan8
3.99:1 ooKraken8
3.69:1 ooMermaid8
3.65:1 zstd22
3.36:1 zlib9
2.93:1 ooSelkie8
2.80:1 lz4hc1

RDO Improvement factors:
2.154 ooLeviathan8
2.157 ooKraken8
2.085 ooMermaid8
1.831 ooSelkie8
2.148 lzma_def9
2.074 zstd22
1.988 zlib9
1.750 lz4hc1
 
So are we now saying that we could see data transfers of between 15GB/s and ~20GB/s more often than we thought? That's a lot of data.
Well, I hope not. That would have a huge impact on the main memory bandwidth. Faster delivery is a good thing, but the GPU still needs most of the memory bandwidth to read and write things. If you write 15-20GB from memory, you still have to read it, which would be 30-40GB of "lost" memory bandwidth. This is ok for loading screens if the GPU has nothing to do, but not so ok for live action.
All in all memory bandwidth is something the next gen consoles are a bit underdelivering. Not only that the CPU & GPU are much stronger (and need more memory bandwidth e.g. faster clocks, new features like ray tracing, ...), now we also have a fast SSD for streaming that this way also "steals" bandwidth.
 
Well, I hope not. That would have a huge impact on the main memory bandwidth. Faster delivery is a good thing, but the GPU still needs most of the memory bandwidth to read and write things. If you write 15-20GB from memory, you still have to read it, which would be 30-40GB of "lost" memory bandwidth. This is ok for loading screens if the GPU has nothing to do, but not so ok for live action.
All in all memory bandwidth is something the next gen consoles are a bit underdelivering. Not only that the CPU & GPU are much stronger (and need more memory bandwidth e.g. faster clocks, new features like ray tracing, ...), now we also have a fast SSD for streaming that this way also "steals" bandwidth.

Been able to read so much data doesn't mean it is the average speed of streaming. It means when it is important for loading or for example for portal in R&C Rift Apart you can do it faster.

We can imagine massive destruction scene or setpiece with this.
 
Last edited:
Well, I hope not. That would have a huge impact on the main memory bandwidth. Faster delivery is a good thing, but the GPU still needs most of the memory bandwidth to read and write things. If you write 15-20GB from memory, you still have to read it, which would be 30-40GB of "lost" memory bandwidth. This is ok for loading screens if the GPU has nothing to do, but not so ok for live action.
All in all memory bandwidth is something the next gen consoles are a bit underdelivering. Not only that the CPU & GPU are much stronger (and need more memory bandwidth e.g. faster clocks, new features like ray tracing, ...), now we also have a fast SSD for streaming that this way also "steals" bandwidth.

You are confusing GPU memory bandwidth with how fast the SSD can load assets on to memory . They are different on buses.

Thanks, I was getting tired of dismissive posts that said “patents aren’t products.” You mixed it up on this one, so good on you.

Sadly it happens a lot in this post. Drive-by posting from people who are not even interested in buying it.

Related to that patent, EA Chief Studio Officer did mention some type of ML capability supported by PS5. Probably they'll need to sacrifice some shader cores to perform the operations, like MS is doing, instead of having dedicated hardware like NVIDIA. But on his latest video, Dictator showed it's totally worth it.
 
Last edited by a moderator:
You are confusing GPU memory bandwidth with how fast the SSD can load assets on to memory . They are different on buses.
...

No. What you read from the SSD goes into main memory and so reduces the main memory bandwidth (for other components at that time) when the SSD delivers the data. Than the GPU must read the data (well the part the GPU needs) which also cost bandwidth. All the data must always go through the main memory.
So if you overdeliver with >10GB/s the memory bandwidth cost is 20GB (+ memory contention) because the GPU (or whatever component) reads the data (well maybe not all of it, depends on how much "wasted" data is still loaded into memory).
There is no SSD -> GPU direct bus. It is always SSD -> (Compression logic -> ) RAM -> GPU or SSD -> (Compression logic -> ) RAM -> CPU, ...

But just because you can reach x GB/s does not mean you will actually stream that much data. In most cases, you just save time to stream the data. You really don't want to always use the whole SSD bandwidth that is available.

edit: ah, I see, maybe 'reduces' is the wrong term to use. So a short example so we both understand each other ^^
If you have 350GB/s main memory bandwidth and the SSD delivers 10GB/s into main memory, the other components have just 340GB/s to share (minus memory-contention and other things).
 
Last edited:
No. What you read from the SSD goes into main memory and so reduces the main memory bandwidth (for other components at that time) when the SSD delivers the data. Than the GPU must read the data (well the part the GPU needs) which also cost bandwidth. All the data must always go through the main memory.
So if you overdeliver with >10GB/s the memory bandwidth cost is 20GB (+ memory contention) because the GPU (or whatever component) reads the data (well maybe not all of it, depends on how much "wasted" data is still loaded into memory).
There is no SSD -> GPU direct bus. It is always SSD -> (Compression logic -> ) RAM -> GPU or SSD -> (Compression logic -> ) RAM -> CPU, ...

But just because you can reach x GB/s does not mean you will actually stream that much data. In most cases, you just save time to stream the data. You really don't want to always use the whole SSD bandwidth that is available.

edit: ah, I see, maybe 'reduces' is the wrong term to use. So a short example so we both understand each other ^^
If you have 350GB/s main memory bandwidth and the SSD delivers 10GB/s into main memory, the other components have just 340GB/s to share (minus memory-contention and other things).

Yeah, you are totally right I wasn't even thinking about contention or what happens when both try to use the pool of RAM at the same time. I thought you were saying the APU and the SSD shared the same memory interface.

Probably some of the features Cerny talked about like the cache scrubbers, are precisely oriented to a more efficient use of the available resources.

I'm trying to think of any situation where you would not want the maximum speed you can possibly get from the SSD?

But is it 20GB/s necessary? Cerny was talking about loading the next second into RAM, instead of the next 30 seconds like it is on the PS4. Do you really need (for a 60fps game) 330MB per frame of assets?
 
But is it 20GB/s necessary? Cerny was talking about loading the next second into RAM, instead of the next 30 seconds like it is on the PS4. Do you really need (for a 60fps game) 330MB per frame of assets?
You may not need 330MB of data to be loaded in 1 frame (20GB in one second), but you may need, say, 80MB for the frame, as in 5GB in a quarter of a second?

That's the point I'm trying to make.

If you have 80-ish MB to get - for example - why wouldn't you want to get that data in a quarter of a second, or as fast as you possibly can?
 
Last edited:
I guess it should be very very short peaks from worst case situations like the player dropping from a flying mount or fast travel, the average streaming would remain below 100MB/s but the level design wouldn't have any limitations since it can fetch anything on demand. Cerny said it's the number one reason he wanted this storage pipeline, so that levels no longer have to be designed around streaming speed.
 
I guess it should be very very short peaks from worst case situations like the player dropping from a flying mount or fast travel, the average streaming would remain below 100MB/s but the level design wouldn't have any limitations since it can fetch anything on demand. Cerny said it's the number one reason he wanted this storage pipeline, so that levels no longer have to be designed around streaming speed.
Exactly. I just can't think why anyone would ever want data being transferred more slowly than they can get. The fact that we may not need 300MB of data per frame is irrelevant, when we may need any amount of data to be transferred at the highest speed possible.
 
I suppose it depends on how fixed your budget is for b/w otherwise, if all of those have relatively static demands on b/w and I/O can mop up the excess then why not. What I don't know is how much other non-I/O tasks might have variable b/w costs frame to frame, for example Sony is making a lot of noise about Tempest audio, if my avatar steps into a cave does that cause tempest to consume a lot of b/w to determine occlusion tests for audio filtering? Genuinely no idea but years of devs talking about careful RAM b/w management make me sceptical that there aren't other devs with non I/O tasks who saw the 448GB/s RAM b/w and saw a chance to implement their own b/w intensive ideas that were impossible before.
 
Sony should/could upgrade the memory to say around 600Gb/s perhaps? At 448, are you not constrained regarding high(er) resolutions? No idea if that applies to consoles nowadays, but on pc bandwith ties to resolution alot.
 
I suppose it depends on how fixed your budget is for b/w otherwise, if all of those have relatively static demands on b/w and I/O can mop up the excess then why not. What I don't know is how much other non-I/O tasks might have variable b/w costs frame to frame, for example Sony is making a lot of noise about Tempest audio, if my avatar steps into a cave does that cause tempest to consume a lot of b/w to determine occlusion tests for audio filtering? Genuinely no idea but years of devs talking about careful RAM b/w management make me sceptical that there aren't other devs with non I/O tasks who saw the 448GB/s RAM b/w and saw a chance to implement their own b/w intensive ideas that were impossible before.

That's also a very good point! We keep talking about this magical "data needed per frame" number, but we never talk about other aspects such as sound data that may be needed as fast as possible, and this can get quite large.

Again, I don't see why we would not want the highest transfer speeds that we can get!
 
Last edited:
That's also a very good point! We keep talking about this magical "data needed per frame" number, but we never talk about other aspects such as sound data that may be needed as fast as possible, and this can get quite large.
Worst case scenario, dynamic loading of vast amounts of data mid-game is carefully stage-managed (much as loading assets are sneakily concealed now - just less so), and loading times are still effectively eliminated.
 
Back
Top