What's needed in the API for better streaming?
Lazy devs!
The problems I found trying to do that with D3D11 is that you can't map a default buffer, you need a staging one. You can't map and read from the same staging buffer, which means you need (worst case, most likely never met) as many staging buffers as default buffers.
(Since you have multi-threaded data streaming you need to keep your buffers mapped for a while before you can use their data, and you want to "upload" as soon as possible.)
(Whatever happens you get an extra copy, you can't just read from disk straight to the destination area.)
You always must go through the API for memory management, it can be justified for textures (due to their layout being GPU specific), but for a raw buffer it's a bit lame...
[An alternative would be to explicitely call an API transcoding function.]
So you can't manage memory, which reduces a lot the clever stuff you can do, including using memory as a giant cache with a cache algorithm.
In theory all the API cares about is the set of states for a draw command and the source/format of the data.
If you could just create header/descriptors and modify their data pointer that would be nice, having access to the command queue would allow copying/pasting parts of it for reuse.
After that, there's the option to control the MMU, specifically decide the virtual to physical memory mapping. At that point all you have to do is to divide memory in pages, and manage pages with a cache algorithm.
That's it, with that, you can stream any kind of data and always get rid of the most irrelevant pages. (So as good as the cache algorithm you use.)
(I don't know how expensive switching pages would be. Also you could have software page faults for resources, that'd be reactive instead of proactive loading... But much easier to handle.)
We aren't lazy, we just have a limited amount of time and most often code bases are inflating dangerously because many people don't seem to understand that more code is bad.
(longer compiling time, way more reading before modifying...)
-*EDIT*-
Forgot to mention that I wouldn't mind the GPU getting an ISA, just like CPU, having a set and known common texture memory layout, command stream and things like that.
(And I don't care if you need a massive decoder somewhere, x86 does it
)