Your method would require reading everything in a block and re-writing it, so at 100MB/s any write would take a minimum of 10ms. It'll change a bit for different reading/writing/erasing throughputs, but this is in the vicinity of HDD times. Writing can be done one page at a time, though, so using garbage collection it improves by a factor of 128, and now you see the big performance benefits.
In theory, if write wearing is not a problem, the flash should be designed with a copy-on-write feature to perform this entirely in the chip. The chip should have a certain amount of spare blocks for consecutive writing. If you have to erase 128 pages at once, which may require, say, 900 ms, then you can allocate, say 90 units (i.e. 45MB) to reduce the worst case write latency to 10 ms.
Of course, since wear leveling is an important issue, this is just theoretical and not useful at all.