The pros and cons of eDRAM/ESRAM in next-gen

Pixel · Sep 6, 2014

I was reading a paper on edram and there are a variety of solutions available at various power draws. The low power draws forms have multiple times lower bandwidth than the full power solutions which are ideal for high power gpu designs, while low power designs are ideal for integrated gpus and mobile devices.

IBM's Shared 96MB on-die L3 22nm edram design on the power8 arch offers approximately 3TB/s bandwidth. Much like the esram is broken into 4x8MB each with thieir own memory controller and you multiply the bandwith all together to get the total 204GB/s, the edram is broken into 8MB/s banks however all 96MB is shared and accessible by all the cores.

For the Intel Iris Pro, its designed with low power in mind and is targetting laptops the edram operates only off of a single narrow double pumped memory controller and interface and the bandwidth may be more of a design choice given power consumption and the fact that its only acting as an eviction cache cpu's L3, and assisting an integrated gpu which has limited bandwidth necessity. Even low range discreet gpus need only a limited amount of memory bandwidth to be satsified as we see with something like the the narrow bus 7730 cards from AMD which outperform Intels Iris Pro best only have 77GB/s bandwith to their 1GB of gddr5.

Starx · Nov 3, 2014

Scaling the Power Wall: A Path to Exascale

To further improve upon the energy efficiency of our
register file and caching sub-system, we recognize that future
HPC applications are likely to have varying needs for register
file versus cache capacity to achieve optimal efficiency. We
implement a malleable memory system proposed by Gebhart
et al. that allows flexible use of on-chip SRAM to optimize
energy efficiency [25]. Rather than having a fixed pool or
registers per thread and cache-capacity per thread or compute
cluster, malleable memory allows the compiler to identify and
expose the number of registers that will be needed for any
given kernel execution. If the number of registers is small,
the remaining SRAM capacity can be used to expand the reach of the data caches. If the compiler identifies that a large
number of registers are needed for a thread’s working set, the
processor can decrease the capacity of the caches accordingly.
By flexibly moving on-chip SRAM resources between data
caches and register file usage, malleable memory helps ensure
that resources are not being wasted due to a fixed provisioning
of capacity

https://www.cs.utexas.edu/users/skeckler/pubs/SC_2014_Exascale.pdf

I want to know are there similarities between the sram on xone and malleable memory in the article?

Lalaland · Nov 3, 2014

Hard to interpret as the article is rather more a general overview of 'This how we get to Exascale' than it is an exploration of how best to use SRAM. In particular they are using SRAM as 'malleable memory' which they define on page 6 as applying to the L1 and registers rather than L3 or lower which is where ESRAM on XB1 lives. Most of all though for me it's hamstrung by this sentence

If the compiler identifies that a large number of registers are needed for a thread’s working set, the processor can decrease the capacity of the caches accordingly.

Every time I hear of an advance that will offer great improvements while relying on a compiler identifying anything I get itchy. Now for the space this paper addresses, HPC, there are a lot of undergrads and postgrads you can throw at the very specific narrow problems they are evaluating. For a more 'general use' compiler, such as those used to compile commercial software, this is much harder it's a large part of why most EPIC and VLIW designs have a hard time outside of niche areas where things hiring dedicated teams to write bespoke software is standard practice.

Starx · Nov 3, 2014

In particular they are using SRAM as 'malleable memory' which they define on page 6 as applying to the L1 and registers rather than L3 or lower which is where ESRAM on XB1 lives

Thanks for your help.

zupallinere · Nov 3, 2014

Lalaland said:
HPC, there are a lot of undergrads and postgrads you can throw at the very specific narrow problems they are evaluating.

Compilers need to come with a graduate student module to really leverage today's high performance computing hardware

Shifty Geezer · Nov 3, 2014

Starx said:
I want to know are there similarities between the sram on xone and malleable memory in the article?

No. They're talking about cache very close to the CPU, at the L1, with very low latencies. There's no way XB1's ESRAM is operating that fast given both bus connection and capacity. It actually reminds me more of Cell's SPE local store.

Starx · Nov 3, 2014

Shifty Geezer said:
No. They're talking about cache very close to the CPU, at the L1, with very low latencies. There's no way XB1's ESRAM is operating that fast given both bus connection and capacity. It actually reminds me more of Cell's SPE local store.

Thanks to you.

Starx · Mar 5, 2015

https://twitter.com/draginol/status/573268559654072320

https://twitter.com/draginol/status/573267363308564480

Ike Turner · Mar 5, 2015

PIX is nothing new at all and has been available on x360 too...Wardell is full o hot air as usual...

Shifty Geezer · Mar 5, 2015

"New tool called Pix." Ha ha ha ha ha! Has he been coding on XB without looking at the software suite all this time?

function · Mar 5, 2015

Pix or it didn't happen.

Ike Turner · Mar 5, 2015

AlNets said:
Best-case scenario is that has never followed console development.

He has never followed PC development either because PIX has been part of Visual Studio/DirectX SDK/XNA since 2006 IIRC.

Ike Turner · Mar 6, 2015

Cyan said:
Most probably PIX made its debut on the Xbox One just recently. Either that or someone caught him in a fib.

What happened to the Anisotropi Filtering thread, btw?

PIX for Durango has been part of the XDK since the May 2012 release...

mrcorbo · Mar 6, 2015

Ike Turner said:
PIX for Durango has been part of the XDK since the May 2012 release...

Per http://www.eurogamer.net/articles/digitalfoundry-2015-evolution-of-xbox-one-as-told-by-sdk-leak , ESRAM viewer has only been included since September.

Eurogamer said:
Development tools: Microsoft's profiler, PIX, continues to receive updates and in September, Microsoft introduces an ESRAM viewer into the system designed to help developers maximise usage of the ultra-fast scratchpad.

This is probably what he meant.

Twitter and 140 chars are the enemy of precise language.

iroboto · Mar 6, 2015

I guess we were a little too soon to judge him.

This article has a bit more info on the new SDK that was released with respect to esram management.
http://www.redgamingtech.com/esram-performance-improves-15-dx12-info-xbox-one-analysis/

Brad Wardell says in a blog post “This is where DirectX 12 comes in. First, they’re redoing the APIs to deal with eSRAM based on developer feedback. Second, they have a wonderful tool called “Pix” that now has a feature (or will very soon) that lets developers try out different strategies in using eSRAM and see how it performs without having to rebuild the project (i.e. they can simulate it in the tool). This too is huge.”

I guess whenever I read that you can make changes without a recompile/relink, I see that as a huge gain in productivity. I'm not sure how bad it was before though.

IIRC someone asked a long time ago what would be held in esram, and what would be held in DDR for splitting.
The article also includes a picture of what that looks like.

iroboto · Mar 10, 2015

I was trying to find out more about split render targets on the Xbox One seeing how that's accomplished, and I came across this in the leaked SDK:

The difference in throughput between ESRAM and main RAM is moderate. 102.4GB/s versus 68 GB/s. The advantages of ESRAM are lower latency and lack of contention from other memory clients -- for instance CPU, I/O, and display output. Low latency is particularly important for sustaining peak performance of the color blocks (CBs) and depth blocks (DBs)

There appears to be 4 Depth Blocks and 4 Colour Blocks. Could anyone help in explaining what these items are and what their purpose is creating the targets - I've been searching google but I've come up empty handed so far.

shredenvain · Mar 10, 2015

iroboto said:
I was trying to find out more about split render targets on the Xbox One seeing how that's accomplished, and I came across this in the leaked SDK:

There appears to be 4 Depth Blocks and 4 Colour Blocks. Could anyone help in explaining what these items are and what their purpose is creating the targets - I've been searching google but I've come up empty handed so far.

Don't know if it will help much, but the depth and color blocks are the ROPs.

iroboto · Mar 10, 2015

Yea it's what I eventually figured out.

iroboto · Mar 16, 2015

Hey guys, newbie question coming up, but this is like important for me to resolve any remaining ignorance I have about the memory system on XBO.

As I understand it correctly, XBO GPU has access to two pools of memory
esram, which it can read/write to @ 1024 bits
DDR, which it can read/write to @ 256 bits

the DME engines can perform the following instructions (if I understand correctly)
copy DDR to esram
esram to DDR
esram to GPU
DDR to GPU
GPU to esram
GPU to DDR

esram is capable of simultaneously read & write at a maximum thereotical bandwidth of 204GB/s, only if you are not reading and writing to the same location. Modifying values this number will drop to approximately 140 GB/s. DDR will get it's bandwidth chopped really bad in heavy simultaneous read & write operations.

Having said that, typical optimization patterns for games involving memory to avoid heavy bandwidth penalties, is to perform a large blocks of reads. Followed afterwards by large blocks of writes. This effectively maximizes your bandwidth - and if you have a large number of ALUs, you will effectively load them all up. Wait for them to all finish and write the results all back. This type of memory pattern benefits GPUs with a larger number of CUs, but because you are performing all reads, followed by all writes, followed by all reads, if I am to understand this correctly, this is bound to create a lot of idle time for your ALU, as while it's writing off all it's results to RAM and waiting for the next batch of work to come in, it's doing nothing. (enter async compute)

With XBO memory setup, I can imagine this to be detrimental, since (a) it has less ALU to load up and perform large batches of work (b) the memory is not as wide as GDDR, it could take additional trips to completely move its set of data.
However, if the GPU is not writing back to DDR at all, and it's writing it's results straight into ESRAM, then shouldn't information be constantly streamed from DDR straight into the GPU? Results will be written into esram as required and additional free ALU can read and write results back into esram (essentially no bandwidth penatly) while it waits for DDR to pull in more resources (I assume textures).

Aside from some very particular textures that are being used all over the place, what's the point of copying values from DDR into esram before doing work?
Is this right?

The pros and cons of eDRAM/ESRAM in next-gen

Pixel

Starx

Lalaland

Starx

zupallinere

Shifty Geezer

uber-Troll!

Starx

Starx

Ike Turner

Shifty Geezer

uber-Troll!

function

None functional

Ike Turner

Ike Turner

mrcorbo

Foo Fighter

iroboto

Daft Funk

iroboto

Daft Funk

shredenvain

iroboto

Daft Funk

iroboto

Daft Funk

Similar threads