All this make me wonder about how the GPU sees the two pools of ram (main RAM and the scratchpad). I was assuming (and some others too) that the move engines role were to have the GPU to see those 2 pools of RAM (/fake it).
The data provided by VGleaks seems to imply that the GPU by self is able to deal with those 2 pools of RAM without having to resort to the DME.
The data are a bit confusing to me as they use "shader" for the GPU, it is unclear how much bandwidth the "shader cores" /SIMDs or CUs or the ROPs have to the scratchpad memory.
I've to say I'm a bit lost, can the CPU access the scratchpad, would that even help in some way?
Instead of pre-fetching data using the CPU cores, you set the DMEs to gather (or scatter) data from the main ram. I would think that the latency to read or write from the CPU to that pool of memory would be (too) high but if you can have the CPU to pre-fetch from the scratchpad could it "work".
I think of those big data structures used for example by epic in UE4, could it be possible to have them compressed in RAM, to load the relevant parts in the scratchpad memory (it would remained compressed) then on request from the CPU to stream and uncompress the data (on the fly) to the CPU cache?
(I wonder the same about virtual texturing, or data structures used by Kinect).
If it is used by the CPU (too) the 25.6GB of bandwidth is less of an issue because no matter how the system is put together (1 or 2 chips) I don't expect the CPU to be able to suck that much more bandwidth.
Another idea is could the CPU set command for the GPU in the scratchpad, the data would be compressed on the fly, uncompressed on the fly by the GPU when it reads it?
Another thing is tessellation, I remember reading something about previous AMD GPU, the GPU had to dump data to the RAM (depending on the level of geometry amplification). Could it be a win if the GPU could dump data (compressed on the fly) in the scratchpad.
It would interesting if the scratchpad is not use "overall" by the whole system(and mostly the GPU) as a monolitic place used mostly to "render" (/deal with bandwidth intensive operations like blending) but is used in many different ways as a buffer by the units within the system (I mean a general purpose scratchpad memory).
What compression ratio can we expect? Or actually "how big" the 32MB memory could be made?
EDIT
Another question that raises into my head is the amount of RAM and CPU cores supposedly reserved to run the "OS", it sounds like quiet a lot to me.
I wonder if part of that reservation could be made to run the "API" (/ system level driver) Edge spoke about, could a core or more (as well as some memory) could be used to deal with DMEs and which "system" in your program (/game) uses and in which amount the scratchpad.
I've no idea here just wondering (though my wording is unclear) if out of this resources ( a lot going by the rumors) a lot it used to make the Scratchpad (with more than often only compressed content) to act pretty transparently as what could be a big L3 for the whole system (CPU and GPU). It could stream in advance stuff in the reserved RAM (be it virtual texture, Data structure holding voxel, kinects, and /what not) and then try to put the stuff in time in the scratchpad for either the GPU or the CPU to use?