Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

Status
Not open for further replies.
So you could set any QLC drive controller permitting) to work in MLC mode? And would that bring with it MLC like write endurance of 50~100k writes...?

Yes, like some QLC SSDs have QLC cells set to SLC mode, but then overall capacity takes a large hit since you're only using 1/4th of the cells.
 
Yes, like some QLC SSDs have QLC cells set to SLC mode, but then overall capacity takes a large hit since you're only using 1/4th of the cells.

Ah, so that explains variable size SLC caches ... :)

That'd give you a lot of options as to how you balance durability and capacity for a single drive. Amazing no-one is selling 256 GB SLC SSDs then. A 10+ PB write drive would seem to have at least some uses.

If you were to format a 1TB QLC drive down to, say, 240GB could you force it to always act as an SLC drive with the corresponding write endurance?
 
New math for console APUs.

Usually only ~40% of a discrete gpu is the CU + ROPs. For 60CUs Vega 7 on 7nm, the CUs and ROPS would be ~140mm2. The majority of the chip is taken up by memory controllers, audio processing blocks, video encode/decode hardware, the PCIe 3.0 bus interface, and a number of other low-level silicon blocks.

Zen2 should be the same size as the 28nm Jaguars. The 28nm PS4 dedicated 88mm2 of its die to the GPU. The PS4 APU is 328mm2.
Given a 350mm2 die, we have an extra 22mm2 for the gpu.
88mm2 + 22mm2 = 110mm2.
That translates to 48-50CUs and 64 ROPs. I don't expect Navi to be a denser design than Vega 7 (in fact, I expect the opposite), so 50 CUs should the expected maximum.

48-50 CUs top out at 11-11.5 teraflops at 1.8 ghz.

I expect 11-11.5 teraflops to be the absolute max in an APU design.
 
I expect 11-11.5 teraflops to be the absolute max in an APU design.

Why would it be an APU design when AMD are already buying wafers for 7nm Zen 2 chiplets en masse? Wouldn't it be financially sensible for both AMD, Sony, and probably MS to dip into that pre-existing pool of product and glue ;-) it to a custom GPU?
 
You don't know what waiting is until you tried to install win95 (beta) with a big stack of floppies, I think it took the entire day. It was my first PC experience, after using an amiga for 6 years where installing an OS was basically copying a few files on a blank disk or hdd and making it bootable. Installing a driver was copying a single file in /devices. I was not amused.

The great thing with Windows 95 was when I found out about batch files...being able to stick batch files on the HDD and install from the batch files was like going from completely PITB manual process to fully automatic (including installing drivers etc). But yes, the old days were fun...I remember the piles of Amiga floppy disks for some games!

Whoops - sorry for the OT post!
 
Why would it be an APU design when AMD are already buying wafers for 7nm Zen 2 chiplets en masse? Wouldn't it be financially sensible for both AMD, Sony, and probably MS to dip into that pre-existing pool of product and glue ;-) it to a custom GPU?

I don't know how these contracts work, but I would guess if AMD, Sony and MS all use the same Zen 2 chiplet then it's probably easier to acquire Gigafab capacity from TSMC which would bring down the cost.

There is something I've been thinking recently: consoles usually don't use the highest binning silicon from what I've read - in fact more the opposite. MS also uses AMD Epyc in their datacenter. Could they negotiate a special deal with AMD for cheaper prices due to that? Along the lines of:

"Hey AMD, we will not customized the Zen 2 cores for our Scarlet familiy like we did with Jaguar in the previous consoles but instead we want to order normal Zen 2 chiplets. Due to the sheer volume of our Zen 2 order this will increase your possibility of binning immensely. We will use the lower binning dies for our consoles and give you the good binning dies from our order for your Epyc and Threadripper CPUs. But in turn we get a really good deal for our console silicon (including the GPU) and for our Azure order as well."
 
I'm not sure how many affordable (any?) client drives can sustain 300K IOPS. Depending on how you would characterize the access pattern as random versus linear, the 4K size is one of the worst sources of degradation for anything not Optane.

Only for writes, running games is effectively read only.

The consumer NVME QLC drives I've seen reviewed on Anandtech can have random 4K burst rates of ~75MB/s on an empty drive (large amount of QLC cells set to SLC mode) to ~33 MB/s on full drive.

Don't know if that is down to the software Anandtech uses for testing random performance. Storagereview.com uses VDBench, and here the Crucial P1 (QLC) goes above 300k IOPS for 4K reads.

The consumer oriented WD Black SN750 TB also hits 300K IOPS, but with worse latencies


Edit: Anandtech seems to test random performance with one queue and one thread. The sustained random performance is the maximum of one, two or four queues, but still with one thread. VDBench uses 32 queues, Storagereview tests with 1 and 8 threads, so that's a maximum of 256 ops in flight for SR vs just four for Anandtech.

Given the complexity of the problem space, I'm not sure Sony can roll its own NAND SSD in a console budget to that level of sustained performance without some bug or inconsistency that might make the results generally not worth the investment.

I do not think they will develop the entire storage solution themselves, I think they will partner with someone, WD or Micron, to develop it for them. I do think it will be soldered in the main board and not a M.2 solution.

Cheers
 
Last edited:
I don't get the Stadia comparaison... It's a cloud thing, and they can "stack them", no ? Or when it's not powerful enough, I guess upgrade them...
 
The 100K+ IOPS numbers you often see quoted for NAND are at very high queue depths. Not a gaming workload. NAND random read performance falls off a cliff at low queue depths, and contrary to popular belief reads are more expensive than writes. If you want to augment DRAM with a non-volatile medium, you really need Optane.

Devs will still have to arrange game data for sequential access and large block sizes to achieve 1GB/s+ throughput. That said, it's not hard to do, and unlike with hard drives the potential gains are enormous.
 
Last edited:
Given low enough latency and high enough amount of iops I wonder if game engine and even hw could treat ssd as ram. Just mmap the game files into ram and access blocks, let hits and misses happen. HBCC in steroids. Of course one wants to optimize accesses patterns but I wonder if such a simplistic approach(mmap, ssd as ram) could be the base from where to start? Maybe have additional metadata file so mmap can optimize what to never cache, what to preload to ram etc.

Console games should already be fairly optimized for streaming in sequential mannor. Seeks on existing console hw are brutally slow.
 
The 100K+ IOPS numbers you often see quoted for NAND are at very high queue depths. Not a gaming workload.
It is not a gaming workload because games have been loading in large chunks since the beginning of time, - to circumvent the atroucious random seek latency of legacy media (optical and HDDs).

What I'm talking about is demand loading missing texture mipmaps at 4K page boundaries. Such a workload could use as many outstanding transactions as the NAND controller supports.

manux said:
Given low enough latency and high enough amount of iops I wonder if game engine and even hw could treat ssd as ram. Just mmap the game files into ram
Exactly !.

Cheers
 
Only for writes, running games is effectively read only.



Don't know if that is down to the software Anandtech uses for testing random performance. Storagereview.com uses VDBench, and here the Crucial P1 (QLC) goes above 300k IOPS for 4K reads.

The consumer oriented WD Black SN750 TB also hits 300K IOPS, but with worse latencies


Edit: Anandtech seems to test random performance with one queue and one thread. The sustained random performance is the maximum of one, two or four queues, but still with one thread. VDBench uses 32 queues, Storagereview tests with 1 and 8 threads, so that's a maximum of 256 ops in flight for SR vs just four for Anandtech.
I missed that the numbers I was using were at lower queue depths.
There is a selectable section that goes to 32-deep, but it doesn't seem to me that the throughput at a depth of 32 for the Intel and Crucial QLC drives matches Storagereview.
The Crucial drive's random read throughput at Anandtech seems far too low for the IOPs given by Storagereview. The Intel drive fares better than Crucial though still notably slower at Anandtech. A significant caveat is that greater performance doesn't manifest if the drive is full.

I'm not sure what to attribute this to. Perhaps a testing suite quirk, or a platform difference between a setup matching a desktop PC versus a Dell Poweredge?

There are other examples in Anandtech's list like a WD TLC drive review that has lower performance than Storagerview, but they're still in the ballpark.
 
I think will just have RAM RAM and RAM.... RAM of the cheapest DDR of slowest type that buffers the HD.... with OS that (running also mostly on it) manages to let see that from games just as an really quick HD

Given low enough latency and high enough amount of iops I wonder if game engine and even hw could treat ssd as ram. Just mmap the game files into ram and access blocks, let hits and misses happen. HBCC in steroids. Of course one wants to optimize accesses patterns but I wonder if such a simplistic approach(mmap, ssd as ram) could be the base from where to start? Maybe have additional metadata file so mmap can optimize what to never cache, what to preload to ram etc.

Console games should already be fairly optimized for streaming in sequential mannor. Seeks on existing console hw are brutally slow.
(also for backwards compatibility issues)....
 
Can't remember the CPU used on stadia

It is: "Custom 2.7GHz hyper-threaded x86 CPU with AVX2 SIMD and 9.5MB L2+L3 cache." The cache amount implies to me each instance only gets three cores on a Xeon many core processor. Compared to an 8 core, 16 thread Zen 2 based PS5, the Sony console would be significantly more powerful.
 
There is a selectable section that goes to 32-deep, but it doesn't seem to me that the throughput at a depth of 32 for the Intel and Crucial QLC drives matches Storagereview.

SR also runs 8 threads in parallel.

Anyway, We can deduce the minimum access latency from the Anandtechs numbers, NVMe SSDs seem to have an average of 45MB/s for sequential random accesses, that's 11k IOPS and an access latency of 90 us.

For 300k IOPS, we would need (at least) 30 simultaneous transactions in flight, so game engine developers will have to rework some of their assets loading subsystems after decades of minimizing seeks/accesses.

Cheers
 
Last edited:
Do I miss something here but do you really conclude from how many 512B/4K blocks the SSD can process in parallel to the max bandwidth?:)
 
Do I miss something here but do you really conclude from how many 512B/4K blocks the SSD can process in parallel to the max bandwidth?

No, I derived how many transactions that would need to be served in parallel to hit 300k IOPS. 300k IOPS each reading 4K only totals 1.2GB/s, a lot less that the internal >6 GB/s bandwidth of eight ONFI4 channels running at 800MT/s.

Cheers
 
Status
Not open for further replies.
Back
Top