Memory bandwidth vs memory amount *spin off*

On a second note, I'd like to emphasis the need for ECC as memory amount and bandwidth increases.

No, it doesn't. Everyone keeps ignoring that in the google report, they found bit errors to scale linearly with the physical volume of the ram, and not the amount of bits. This agrees with the theory that nearly all bit errors are caused by radiation, either from the materials the chips are made from, or from outside sources. As you scale the ram to smaller processes, the bit error rates go down. 8 chips of ram always have roughly the same amount of errors, regardless if they are 2GB of DDR4 or 512MB of GDDR3.
 
I meant it was related to the amount of data going through the chips, and having more DIMM/chips.
(We usually have more DIMM when we get more memory, but you are correct.)
I'm not sure whether the google report talks about what I meant though...

Also we may be going off-topic.
 
In the case in question, we are talking about a max memory bandwidth in the range of 75-100 GB/s and a realistic bandwidth on the order of 70-80% of peak for a range of 56-75 GB/s. With gives a per frame bandwidth of roughly 1GB or a bit higher at 60 FPS.
I thought we were talking about known current systems and current unified memory architectures (Trinity, Sandy/Ivy Bridge, maybe even Xbox 360), not some rumored future ones. We only have actual facts and benchmark data of existing systems (everything else is pure speculation, and as I said earlier, I am not interested in participating in next gen speculation).

Please don't tell me you think a 70-100 GB/s unified memory architecture is considered "slow" by today's standards. Not even Intel's highest end 12 thread Sandy Bridge E and the fully enabled 16 thread Xeon server CPU versions are equipped with a memory system that fast. Quad channel DDR3-1600 is the fastest officially supported, and it provides a 51 GB/s theoretical bandwidth (37 GB/s in benchmarks, not far from AMDs utilization percentages: http://www.anandtech.com/show/5091/...-bridge-e-review-keeping-the-high-end-alive/4). These chips cost 1000$+ and the motherboards supporting quad channel memory aren't cheap either.

Lets look at highest end desktop APUs available with unified memory. Dual channel DDR3-1600 is the maximum officially supported memory for Intel's flagship desktop APU (Ivy Bridge). Dual channel DDR3-1833 is the maximum officially supported memory for AMD's flagship desktop APU (Trinity). Memory bandwidths are 25.6 GB/s and 29.2 GB/s respectively. These figures match perfectly with my calculations for the "slow" memory system (common DDR3 memory at highest commonly available clocks).

Of course you can find memory kits designed for CPU overclockers. I actually bought these kind of premium memory sticks to my old Q6600 based desktop. The problem with these kind of enthusiast kits is that they are produced in very low quantities (cherry picked parts), and thus the price is very high. For example cheapest DDR3-2400 kit (2 x 4 GB) I found from newegg.com was G.SKILL Ripjaws Z series at 96.99$. In comparison you will find standard DDR-1600 kits (2 x 4 GB) for 40.99$. As DDR-1600 is the highest officially supported on Intel platforms, it is commonly used in brand new high end gaming desktops, and thus is the most relevant high volume product that we can still somehow qualify as "slow and cheap".
Then if the design has a high speed temporary buffer of reasonable size (32MB+), this also reduces the amount of non-static texture data that must be stored and read further increasing the relative size of the texture bandwidth and therefore the streaming texture cache space required.
Relatively large manual high speed "caches" such as the Xbox 360 EDRAM are very good for reducing redundant bandwidth usage (especially for GPU rendering). EDRAM removes all the memory bandwidth waste you get from blending, overdraw, MSAA and z-buffering. Basically you get all these for free. The bandwidth free overdraw of course also helps with shadowmaps as well, but since Xbox 360 cannot sample from EDRAM, you have to eventually copy the shadowmap to main memory (consumes memory bandwidth) and sample it from there (consumes memory bandwidth just like any static texture). Same is true for g-buffer rendering and sampling (must be copied eventually to main memory and sampled from there consuming memory bandwidth).

However no matter how excellent EDRAM is, it cannot increase the maximum total accessible unique memory per frame. It can "only" (drastically) reduce the waste for double (or even higher) access counts to same memory regions, and thus get us more near to the theoretical maximum (= 200 MB unique memory per frame, assuming we still use the current highest end desktop APU unified memory systems as our "system of choice"). I have already stated in many threads how much I like the EDRAM in Xbox 360, so I don't do that again :)
The available memory bandwidth per frame is only interesting as it applies to total memory if you can predict with some degree of certainty which 200MB or so you're going to touch, and you can actually get it off disk before you will need it.
Of course. Without exact knowledge of your access patterns and excellent prediction algorithms and good fall back plans (stalling doesn't count :) ) you need to keep considerable extra overhead data in your memory (just in case).
So while above some threshold more memory doesn't help you with higher res textures that doesn't make it useless.
Extra memory is of course always a good thing to have. It allows you to keep some (hard to predict) data components permanently in memory. And it saves you development time as well. That's not an insignificant gain. More is always better, unless it means we have to compromise somewhere else. Aaronspink stated he would prefer to have 2 GB of extra memory instead of a 3-4x faster GPU, and that's something I cannot agree with (especially if that GPU is 3-4x slower because of bandwith limitations that in other hand limit the usability of the extra 2 GB I would get in the trade).
What most would refer to procedural content, at some level you can consider parametric content to be data compression with extreme compression ratios.
Really the memory is still just a cache, but it's a cache for computation rather than disk reads. It's one area I'd be seriously looking at going forwards.
Parametric content (artist controlled) will be very important in the future. However I also see it as a way to reduce memory accesses. Why would you store the calculations to memory, if you can recalculate it every time to the L1 cache instead and waste no bandwidth at all? ALU is basically free (compared to memory accesses), and it will become even more free in the future (while memory accesses will remain expensive).

However if the parametric generation consumes more bandwidth than the access of the generated data, then I am a huge supporter for caching it. For example in our virtual texturing system, the terrain texture is generated (blended with a complex formula) from a huge amount of artist placed decals. In the worst case areas there are almost 10 layers of decals on top of each other, but we burn that data once to the virtual texture cache, and during the terrain rendering a single texture lookup is enough (generated data gets repeatedly reused 60 times per second just like loaded data from HDD).
As an aside The Sony paper is interesting but doesn't age well, you can still kill yourself with virtual function calls
That's not the main point of the paper. Yes it's nice that you can evade some branches and virtual calls, but the main point (and main performance gain) is the improved memory access pattern. Component model is a good approach, and many developers are using it in their newest engines.
 
Last edited by a moderator:
That's not the main point of the paper. Yes it's nice that you can evade some branches and virtual calls, but the main point (and main performance gain) is the improved memory access pattern. Component model is a good approach, and many developers are using it in their newest engines.

That's certainly a valid point, but IME outside of a few small blocks of code that transform streams of data, modern CPU's rarely suffer L2 cache misses, and almost never miss the ICache.
Most of the win in the streaming case is not poluting the cache with data you will never read.

As an aside one of the things that irritates me about new college grads is the lack of understanding of basic memory architecture, and behavior. None of this stuff is rocket science.
 
Wasn't the Jon Olick demo of the ID Tech6 stuff over 1 gig with just a single model on screen?
 
Please don't tell me you think a 70-100 GB/s unified memory architecture is considered "slow" by today's standards.

It is only reasonably aggressive but it is where we will be within a year with DDR4.

Not even Intel's highest end 12 thread Sandy Bridge E and the fully enabled 16 thread Xeon server CPU versions are equipped with a memory system that fast. Quad channel DDR3-1600 is the fastest officially supported, and it provides a 51 GB/s theoretical bandwidth (37 GB/s in benchmarks, not far from AMDs utilization percentages: http://www.anandtech.com/show/5091/...-bridge-e-review-keeping-the-high-end-alive/4). These chips cost 1000$+ and the motherboards supporting quad channel memory aren't cheap either.

Xeons play in a very different envelope of the design space than something like a gaming focused console. They make a lot of trade offs to support larger memory capacities. Large being defined in the approaching TB range which requires multiple dimms per channel and advanced ECC capabilities. The advanced ECC capabilities in turn direct the memory interfaces into wide 128b channels and burst-chop mode lowering efficiency.

Also it is important to recognize that there can be little correlation between cost and price.


Relatively large manual high speed "caches" such as the Xbox 360 EDRAM are very good for reducing redundant bandwidth usage (especially for GPU rendering). EDRAM removes all the memory bandwidth waste you get from blending, overdraw, MSAA and z-buffering. Basically you get all these for free. The bandwidth free overdraw of course also helps with shadowmaps as well, but since Xbox 360 cannot sample from EDRAM, you have to eventually copy the shadowmap to main memory (consumes memory bandwidth) and sample it from there (consumes memory bandwidth just like any static texture). Same is true for g-buffer rendering and sampling (must be copied eventually to main memory and sampled from there consuming memory bandwidth).

A larger edram without integrated ROPs would allow sampling from edram.
 
The embedded GPUs really aren't aimed at gamers. They are primarily designed for cost/packaging/board/thermal reasons. If you want to game, you will always be better off with a discrete GPU until such a time as moderate sized on package memories become viable/standard. though we are getting closer, a single wide IO DRAM will be able to provide in the range of 100-200 GB/s of bandwidth and between 512-1024 MB of capacity. Combined with a main memory in the range of 50 GB/s and the integrated GPUs will finally be able to stretch their legs. Realistically, that is all about 3-5 years out for mainstream computers at the front edge. Lots of other markets though would prefer if PCs got their sooner so they could leverage off of them.

Is that a considered estimate or a generic "3-5 years away" statement that people often use when they don't know how long it is going to take?
 
Is that a considered estimate or a generic "3-5 years away" statement that people often use when they don't know how long it is going to take?

That's my realistic estimate of when things like wide I/O on package and stacked memory will start to hit the mainstream.
 
That's my realistic estimate of when things like wide I/O on package and stacked memory will start to hit the mainstream.

Do you consider interposer based solutions to be inadequate/immature/limited-by-something/too-narrow/too-unstacked or are you defining mainstream, as say >50% marketshare of all PCs?
 
Do you consider interposer based solutions to be inadequate/immature/limited-by-something/too-narrow/too-unstacked or are you defining mainstream, as say >50% marketshare of all PCs?

my definition of mainstream is on the order of 100-300m parts per year.
 
As an aside one of the things that irritates me about new college grads is the lack of understanding of basic memory architecture, and behavior. None of this stuff is rocket science.

They are probably too busy learning some massively verbose programming language, and don't have time for computer architecture neither algorithms... :(
 
A larger edram without integrated ROPs would allow sampling from edram.
That's true, but none of the current PC or console GPUs work like that.

GPU accessible read/write EDRAM would practically nullify all the bandwidth costs of the deferred g-buffer generation/sampling and post process rendering (etc full screen effects that are consumed later in the pipeline). And it would be great for GPU compute. However it wouldn't nullify the bandwidth cost of shadow maps, unless you had huge amount of EDRAM. A single 4096x4096 shadow map atlas takes 64 MB, and even that isn't enough if you want to have above 720p rendering with good shadow map quality.

---

Last weekend I bumped into an article of Sequoia. It's the new #1 super computer in the TOP500 list. It doubles the performance of the previous champ, and consumes almost 40% less power. The most interesting thing is that it uses EDRAM to reach high memory bandwidth, and the PowerPC A2 CPU is basically a spiritual successor for Xenon. It has in-order execution, powerful vector units, lots of cores (Xenos had the highest core and thread count when it was released) and SMT/hyperthreading (four way this time).

16 cores, 4 threads per core = 64 threads per CPU. Each CPU has double channel DDR3-1333 memory bus and 32 MB of EDRAM. This is an interesting design if we analyze its memory performance. Large chunk of EDRAM gives it very fast local work memory. Compared to Cell SPU local stores (256 KB) the EDRAM is 128x larger. That's a huge deal, and allows you to run much wider selection of algorithms inside the fast local work memory. The main memory bus isn't wide, but the four way SMT provides the chip with good memory latency hiding capacity. Low 1.6 GHz CPU clock also means that memory latency (in cycles) remains low. Put 1.6 million of these processing cores to a same room, and you get nice chunk of processing power (and nice amount of combined EDRAM bandwidth) :)
 
sebbbi said:
Last weekend I bumped into an article of Sequoia. It's the new #1 super computer in the TOP500 list. It doubles the performance of the previous champ, and consumes almost 40% less power. The most interesting thing is that it uses EDRAM to reach high memory bandwidth, and the PowerPC A2 CPU is basically a spiritual successor for Xenon. It has in-order execution, powerful vector units, lots of cores (Xenos had the highest core and thread count when it was released) and SMT/hyperthreading (four way this time).

16 cores, 4 threads per core = 64 threads per CPU. Each CPU has double channel DDR3-1333 memory bus and 32 MB of EDRAM. This is an interesting design if we analyze its memory performance. Large chunk of EDRAM gives it very fast local work memory. Compared to Cell SPU local stores (256 KB) the EDRAM is 128x larger. That's a huge deal, and allows you to run much wider selection of algorithms inside the fast local work memory. The main memory bus isn't wide, but the four way SMT provides the chip with good memory latency hiding capacity. Low 1.6 GHz CPU clock also means that memory latency (in cycles) remains low. Put 1.6 million of these processing cores to a same room, and you get nice chunk of processing power (and nice amount of combined EDRAM bandwidth) :)

Now just hope, that they let me use this pretty lady for my computations ;-)
Damn, one can dream...right!
Fortunately Jülich gets a similar machine...not that big, but still nice :)

Gimme gimme gimme!!!!!
 
They are probably too busy learning some massively verbose programming language, and don't have time for computer architecture neither algorithms... :(

Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.
 
Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.

To be fair it makes sense to teach how to do it right before teaching how to do it fast.
(But I'd expect at least a couple classes about performance/bottlenecks on modern computers.)
We are off-topic ^^
 
To be fair it makes sense to teach how to do it right before teaching how to do it fast.
(But I'd expect at least a couple classes about performance/bottlenecks on modern computers.)
We are off-topic ^^

Only slightly, to bring it back around, what offers the better optimization/performance gains and which is easier for especially young programmers to optimize, more memory or more bandwidth?
 
Being a recent college grad myself. I'd say it's more them teaching you to code by reinventing the wheel and very little large project coding. Most of my coding classes the projects could be done in a day or 2. Only 2-3 classes really had any large scale coding that took tons of time and planning and even then they weren't performance driven but output correctness driven.

I can second this, hell mine c++ classes were a complete drama imo.
It was like here you guys know C# and Java here is how you make a array in C++, here are some assignments go make them and totally don't make use of pointers. :rolleyes:
It was like getting programming 101 all over again.
 
That's certainly a valid point, but IME outside of a few small blocks of code that transform streams of data, modern CPU's rarely suffer L2 cache misses, and almost never miss the ICache.
Most of the win in the streaming case is not poluting the cache with data you will never read.
(Finally got time to write an answer for this one)

Instruction cache is not a concern, I can agree with that one, but modern CPUs do L2 misses very frequently.

As I said earlier in my post, 4MB L2 gets fully evicted around 50 times every frame (if no memory bandwidth is wasted). You can't count on having data in L2 for long time. If you access the same cache line at start of the frame, middle of the frame and end of the frame, you will pay for 3 memory fetches (and likely also 3x L2 misses if your structures are not cache optimized).

Some time ago we did extensive benchmarking for different styles of data structures and access patterns (on different processors ranging from mobile chips to Sandy Bridge). Even a Sandy Bridge benefits hugely from predictable access patterns. A pointer list for example is up to 4x slower in our benchmarks compared to a cache line aligned bucketed list (on modern PC CPUs). Bucketed list generates more instructions (both more ALU and memory instructions), but the predictable access pattern (combined with manual cache prefetching) makes it considerably faster. All ALU instructions get masked out (by L2 stalls) and thus are practically free.

Similar performance can be seen if you compare a (balanced) tree based search structure with (open address) hashing. Search operation has similar instruction counts for both of the structures (with moderately large data sets), but hashing is usually 5x-10x faster, because it has (often) a single memory access (and a single L2 miss). Trees have very slow pointer->pointer->...->pointer style traversal (lots of L2 cache misses).

The most important thing (when performance is considered) of a component based architecture (in comparison to inheritance based large objects) is to slice object to smaller slices so that memory accessing becomes more efficient.

Example:

You have an inheritance based object structure and for example you want an object to have both physics and graphics behavior, and you want it have have transform (and children hierarchy) as well. Lets say all this data makes the object 250 bytes long (matrices alone take 64 bytes each, so this is a realistic estimate).

Now you want to determine the visibility of all your potentially visible objects. Say you have 25000 of potentially visible objects in total and 5000 of them will be visible (20%). Now you iterate though them and the visibility determination algorithm reads position and bounding radius from each of the objects. Position is a 3d vector and thus takes 12 bytes (3 x float32). Bounding radius is 4 bytes (single float32). Sandy Bridge has 64 byte cache lines. Each object takes 4 cache lines, but the code only accesses a single cache line of a object, and only 16 bytes of it. A good modern automatic prefetcher detects the correct stride quickly, so it only reads one cache line per object from the memory. So 64 bytes per object gets read. However the code only utilizes 16 bytes per object (position + radius = 16 bytes), so 75% of the memory bandwidth gets wasted.

If you have a component based architecture, the transform components (containing position and radius = 16 bytes) are stored separately in a linear array. Now a single cache line contains only data you need (four transform components), and 100% of the bandwidth gets utilized. As a nice bonus this kind of linear batch processing is very much suited for SOA vector processing (8 wide AVX is very efficient in processing it). Around 70-80% of the raw ALU performance of modern CPUs come from the vector execution units. Performance critical parts of a game engine should be designed to exploit vector execution as much as possible.

Link how Battlefield 3 does their vectorized culling (data driven / component based architecture):
http://publications.dice.se/attachments/CullingTheBattlefield.pdf

They even use 16 bit floats (halfs) to optimize the memory accesses (even if that means some extra ALU usage for decompression). Xbox 360 has vector instructions to float16 <-> float32 point conversion, and so does Ivy Bridge, Bulldozer and Piledriver (and soon also Haswell). Packing data as small as possible is now more important than ever.
 
Last edited by a moderator:
Wasn't the Jon Olick demo of the ID Tech6 stuff over 1 gig with just a single model on screen?
That's true. However... SVO renderers are highly memory bandwidth bound. GPU based SVO renderers are significantly faster because of GDDR5 and because of GPUs excellent memory latency hiding (thread slackness).

We need both memory bandwidth and memory amount of make SVO renderers viable. Data compression is one of the main areas of research in voxel rendering. We have lots of extra ALU to burn to decompress the data (as voxel renderers are memory bound).

SVO streaming isn't much harder than virtual texture streaming. It's basically pretty much the same. Viewport changes aren't actually much more critical than they are for virtual texturing. The data gets gradually sharper if streaming bandwidth hits a cap. Fortunately human brains take a lot of time to process completely new scenes, and details can be kept blurry for tens of frames without any problems (we cannot "see" it). Look though a corner (to an unknown scene), and you will see that you cannot instantly focus your eyes to small details.
 
Last edited by a moderator:
(Finally got time to write an answer for this one)

Instruction cache is not a concern, I can agree with that one, but modern CPUs do L2 misses very frequently.

Yep, it really has to do with access pattern. Pointer chains are simply evil. also data layout has huge impacts as you said. You want to organize your data structure based on what will be accessed as blocks not what is necessarily easier as a programmer.
 
Back
Top