If I have understood it correctly nand chips consist of blocks. Row in a block is page. Only small amount of blocks can be active at the same time. Only pages from active blocks can be read. Page could be for example 4k and the block could be hundreds of kB to few MB. When doing a random read we would first find the right block, prime the block and then read the page. There is finite number of blocks we can keep active. If block from which we want to read a page is not active the read is going to be "slow".
Part of the random test being so slow is hitting random blocks and having to prime the block before being able to read a page. If we could do even simple reordering of reads so it becomes more likely for read to hit already active block that would be nice perf win. One would assume a test for random reads would try to create load that would try to hit random blocks in a way that is most challenging for the drive.
If one thought random access to small pages is most important thing then the nand could be designed to favor small random accesses. Or one could just go with optane.
Yes, it's similar to banks for dram. The exact order is:
Many channels per chip (everything scales per channel)
Many panes per channel (this is random read iops)
Many blocks per pane (this is the erasure size, so impacts writes the most)
Many pages per block (this is the minimum read size per pane)
The smallest concurrent pieces are the panes. Each pane can have a request happening at the same time (caveats, it depends on the architecture of the chip). Multiply everything together by the average latency required to serve a request, and it gives the total iops. It only works if all panes can have a successive request in the command buffer on time.
The channel interface is always 8bit. So a single 1200MT channel is 1.2GB/s, but in practice only a fraction of it is the effective bandwidth, there's an overhead with error correction, commands, timings, polling, etc... And 1200MT is the top speed available. If that speed was the only important value, the PS5 12ch would be 14GB/s. So 5.5GB/s is slower inexpensive chips and the final figure includes all reasonable overheads.
Each channel gets a request ready for each pane, change page and block on each, read it to the page register, and transmit it on the channel while starting immediately with the next request in the queue during the transfer because there's two page registers per pane.
The physical limit is the number of request per pane, per channel, per chip, up to the limit of the interface of the channel.
Nand chips have 2 panes per channel, and we are starting to see 4 panes to double the iops. Future versions of nand might double this to 8 panes. We need more width because the latency is not improving much.
Random read latency is about 20 microseconds average, but this depends on the chip, it goes worse with TLC, QLC, and also improves as some vendor launches a better architecture, so it's always moving.
Hypothetical 12ch controller:
1/20us = 50000 completed transaction per second per pane
50k * 2 panes * 12 channels = 1.2M iops
That's the absolute hardware limitation. The rest of the steps are every single layer of the pipeline outlined on the Cerny presentation at GDC. There's a lot of step that can go wrong, and it looks like all of these steps are taken care by the i/o processor at full bandwidth until the data is in memory directly usable by the GPU.