Embedded Memory & Novel Techniques for IQ and Rendering

Acert93

Artist formerly known as Acert93
Legend
**This thread is intended for those with an understanding of rendering pipelines and the implications of using embedded, high bandwidth memory on-chip. This thread is designated to explore the potential benefits of such a setup and possible novel approaches it allows to tackle traditional problems from new angles. This is NOT an Orbis vs. Durango thread; this is an embedded memory vs. traditional memory configuration thread and the consequential benefits and detriments of such choices. Please leave the uneducated bickering to the other VS. threads. Thank You.**

As this is the console forum and Durango is rumored to have embedded memory on the SoC the general flow of points and discussion should be aimed in that direction as much as possible. Some basic talking points regarding recent rumors.

* Durango is rumored to have 32MB of embedded memory (sounds like a form of SRAM memory space accessible by the GPU and NOT like the Xenos daughter die with the ROPs/eDRAM).

* Bandwidth is not confirmed, but rumored to be about 102GB/s (which function used to predict would indicate 16ROPs @800MHz).

* Latency is not confirmed, but is rumored to be low.

* ERP has noted repeatedly that FLOPs are not a measure of a consoles performance and has repeatedly noted the memory heirarchies of the new consoles will be a significant factor in their performance.

* ERP has heard through the grapevine that MS is indicating the ESRAM should not necessarily be used like the eDRAM in the Xbox 360.

* Different architectures and designs may get similar performance even when the system bandwidth and peak flops are substantially different.

So two lines of thought I can think of to explore--developers may offer others.


#1. What specific efficiency gains could have the Durango engineers envisioned by moving a block of small, fast, and low latency memory onto the SoC?

Are are specific cases?

Are these corner cases? e.g. improving a 1.0ms render task that has a lot of stalls and cleans it up to .3ms?

Or are there real bottlenecks--that consumer larger amounts of render time (e.g. multiple ms)--that could see significant improvements? e.g. a drop from 5ms down to 1ms because the issue is one of latency and not of pure bandwidth/compute?


#2. What novel techniques does 32MB of embedded memory open the door to?

Embedded memory need not be all about "making up performance" of lost peak compute/system bandwidth. The other way to skin the cat is to envision approaches that are tailored to the unique properties of a low latency, high bandwidth memory space that is accessible by many parts of the GPU pipeline and fairly close to the CPU as well.

So does memory embedded on the GPU open some doors to techniques that could substantially impact the end-product IQ? I am only a novice but it sounds interesting if, say with MegaTextures, your entire budget of textures for a frame could be tossed into the 32MB of memory. That would be very, very low latency compared to a traditional GPU. (Maybe Ptex wouldn't be so slow?) Also, could not some novel post processing be applied? It would also sound like this would be a great way to do some advanced post processing where the frame (or binned segments) and the Z-buffer could be quickly traversed for all sorts of post processing effects.

Are there some scenarios at SIGGRAPH where there are unique rendering techniques or approaches that are not compute limited but more-so latency limited? Are any of these friendly to small memory spaces? Are there examples of big IQ wins that have a multi-factor performance increase when paired with low latency memory?


#3. Developer Simplicity.

So far, to my ears, Durango's embedded memory sounds a lot like MS's version of Sony's Cell: requires a crapload of manhour-work to get similar performance, corner cases for improvement, and a lot of ways to break. For a company like MS that has pretty much gone with developer ease what am I missing?

Explain some ways how such a memory space is going to make developers life easier, both in development AND getting the best IQ results? The Xbox 360 eDRAM had the benefit of allowing the ROPs to run full speed and as well not affecting system bandwidth. So all the niggles about tiling and size aside there were developer benefits that helped developers tune their game.

Is ESRAM going to offer the same benefits or are we looking at a case where ESRAM (a) offered BC with the Xbox 360 (b) was a cheap solution to the bandwidth issue (expensive GDDR5, wide bus, limited process shrink with wide buses) and (c) as they are going more OS / App approach the ESRAM fits well into the lower bandwidth (68GB/s) and 12CU "ball park" and it is just an "unfortunate" situation where Sony aimed a bit higher. Durango, for its performance target (i.e. 33% less than Orbis) is a solid design, attempting to force the ESRAM to solve the performance deficit is asking the wrong questions? ESRAM may make developer life easier as long as they are not attempting to match exact parity? (instead the design will remain simple and efficient as long as framerate, framebuffer size and dynamic resolutions, LOD, letter boxing, and IQ adjustments are taking into consideration).
 
The more I look at the vgleaks diagram and function's math, the more this seemingly makes sense. The vgleaks Durango diagram clearly (intentionally?) shows a secondary memory interface to esram, in additional to the 360-esque gpu dedicated bus. Implying that the CPU can store/access data there directly, and if that was the case, would special logic be required to maintain coherency? So to add to Acert's line of questioning, what challenges are introduced if this is the case and what role might the mysterious DME's play in supporting that?

EDIT: Feel free to move this to one of the other threads. Just kind of seemed that in addition to possibilities made available by embedded memory, like to know challenges introduced as well. Certainly more developer centric views on this angle would be interesting to me as well.
 
I am wondering will it open up some of gpgpu prowess of the apu or make ppu functionality more robust. Didn't the PhysX ppu use scratchpad ram?

It has to be for something new because isn't eDRAM like 4 times denser than eSRAM? Could have MS gone at least with 64 or 128mb if it had used eDRAM instead?

The gpu in the GC used 2mb of SRAM to alleviate its lack of bandwidth so maybe MS is attempting the same thing as the amount of RAM is more important than amount of bandwidth to the feature set of the 720.
 
The more I look at the vgleaks diagram and function's math, the more this seemingly makes sense. The vgleaks Durango diagram clearly (intentionally?) shows a secondary memory interface to esram, in additional to the 360-esque gpu dedicated bus. Implying that the CPU can store/access data there directly, and if that was the case, would special logic be required to maintain coherency? So to add to Acert's line of questioning, what challenges are introduced if this is the case and what role might the mysterious DME's play in supporting that?

EDIT: Feel free to move this to one of the other threads. Just kind of seemed that in addition to possibilities made available by embedded memory, like to know challenges introduced as well. Certainly more developer centric views on this angle would be interesting to me as well.

L3 or L2 CPU/GPU cache?
 
Back
Top