The next gen speculation thread started to have interesting debate about memory bandwidth vs memory amount. I don't personally want to contribute to the next gen speculation, but the "memory bandwidth vs memory amount" topic is very interesting in it's own. So I decided to make a thread for this topic, as I have personally been doing a lot of memory access and bandwidth analysis lately for our console technology, and I have programmed our virtual texturing system (and many other JIT data streaming components).
Historically memory performance has improved linearly (very slowly) compared to exponential (Moore's law) growth of CPU performance. Relative memory access times (latencies) have grown to be over 400x higher (in clock cycles) compared to first PC computers, and there's no signs that this development will slow down in the future, unless we invent some radically new ways of storing data. None of the currently known future technologies is going to solve the problem, just provide some band aid. So we need to adapt.
Some links to background information first:
1. Presentation by Sony R&D. Targeted for game technology programmers. Has a very good real life example how improving your memory access pattern can improve your performance by almost 10x. Also has nice charts (slides 17 and 18) showing how memory speed has increased historically compared to ALU:
http://harmful.cat-v.org/software/O...ls_of_Object_Oriented_Programming_GCAP_09.pdf
2. Benchmark results of a brand new x86 chip with unified memory architecture (CPU & GPU share the same memory & memory controller). Benchmark shows system performance with all available DDR3 speeds from DDR3-800 to DDR3-1866. All other system settings are identical, only memory bus bandwidth is scaled up/down. We can see an 80% performance (fps) improvement in the gaming benchmark just by increasing the DDR3 memory clock:
http://www.tomshardware.com/reviews/a10-5800k-a8-5600k-a6-5400k,3224-5.html
3. A GPU benchmark comparing old Geforce GTS 450 (1 GB, GDDR5) card to a brand new Kepler based Geforce GT 640 (2 GB, DDR3). The new Kepler based card has twice the memory amount and twice the ALU performance, but only half of the memory bandwidth (because of DDR3). Despite the much faster theoretical shader performance and twice the memory amount, it loses pretty badly in the benchmarks because of it's slower memory bus:
http://www.anandtech.com/show/5969/zotac-geforce-gt-640-review-
--- ---
I will use the x86 based Trinity APU [link 2] as my example system, as it has close enough performance and memory bandwidth compared to current generation consoles (it's only around 2x-4x faster overall) and it has unified memory (single memory bus shared between CPU & GPU). It's much easier to talk about a well known system, with lots of public benchmarks results around the net.
Let's assume we are developing a vsync locked 60 fps game, so each frame must complete in 16.6 ms time. Let's assume our Trinity system is equipped with the fastest DDR3 it supports (DDR3-1866). According to Tom's Hardware synthetic bandwidth benchmark, this configuration gives us 14 GB bandwidth per second. Divide that by 60, and we get 233 MB bandwidth per frame. Let's round that down to even 200 MB per frame to ease up our calculations. A real game newer utilizes memory bandwidth as well as a synthetic benchmark, so even the 200 MB per frame figure is optimistic.
Now I know that my game should never access more than 200 MB of unique memory per frame if I want to reach my vsync locked 60 fps. If I access more memory, my frame rate dips as the memory subsystem cannot give me enough data, and my CPU & GPU start stalling.
How about CPU & GPU caches? Caches only help with repeated data access to the same data. Caches do not allow us to access any more unique data per frame. Also it's worth noticing that if you access the same memory for example at beginning of your frame, at middle of your frame and at end of your frame, you will pay as much as if you did three unique memory accesses. Caches are very small, and old data gets replaced very fast. Our Trinity CPU has 4 MB of L2 cache and we move 200 MB of data to the cache every frame. Our cache gets fully replaced by new data (200/4 =) 50 times every frame. Data only stays in cache for 0.33 ms. If we access it again after this period, we must fetch it from the memory again (wasting our valuable 200 MB per frame bandwidth). It's not uncommon that a real game accesses every data in the current working set (on average) twice per frame, leaving us with 100 MB per frame unique accessible memory. Examples: Shadowmaps are first rendered (to textures in memory) and sampled later during lighting pass. Physics simulation moves objects (positions & rotations) and later in frame those same objects are rendered (accessing those same position and rotation datas again).
However let's keep the theoretical 200 MB per frame number, as engines differ, and access patterns differ (and we do not really want to got that far in the analysis). In a real game you can likely access only around 100 MB - 150 MB of unique memory per frame, so the forthcoming analysis is optimistic. A real game could likely access less memory and thus have a smaller working set.
So far we know that the processing and rendering of a single frame never requires more than 200 MB of memory (we can't reach 60 fps otherwise). If your game has a static scene, you will not need more memory than that. However static scenes are not much fun, and thus this scenario is highly unlikely in real games (except for maybe a chess game with a fixed camera). So the billion dollar question becomes, how much does the working set (memory accesses) change from frame to frame in a 60 fps game?
In a computer game, objects and cameras do not really "move" around, they get repositioned every frame. In order for this repositioning to look like smooth movement we can only change the positions very slightly from frame to frame. This basically means that our working set can only change slightly from frame to frame. According to my analysis (for our game), our working set changes around 1%-2% per frame in general case, and peaks at around 10%. Especially notable fact is that our virtual texturing system working set never changes more than 2% per frame (textures are the biggest memory user in most games).
We assume that a game with a similar memory access pattern (similarly changing working set from frame to frame) is running on our Trinity example platform. Basically this means that in average case our working set changes from 2 MB to 4 MB per frame, and it peaks at around 20 MB per frame. We can stream this much data from a standard HDD. However HDDs have long latencies, and long seek times, so we must stream data in advance and bundle data in slightly bigger chunks than we like to combat the slow seek time. Both streaming in advance (prefetching) and loading in bigger chunks (loading slightly wider working set) require extra memory. Question becomes, how much larger the memory cache needs to be than our working set?
The working set is 200 MB (if we want to reach that 60 fps on the imaginary game on our Trinity platform). How much more memory we need for the cache? Is working set x2.5 enough (512 MB)? How about 5x (1 GB) or 10x (2 GB)?
Our virtual texture system has a static 1024 page cache (128x128 pixel pages, 2x DXT5 compressed layer per page). Our average working set per frame is around 200-400 pages, and it peaks as high as 600 pages. The cache is so small that it has to reload all textures if you spin the camera around in 360 degrees, but this doesn't matter, as the HDD streaming speed is enough to push new data in at steady pace. You never see any texture popping when rotating or moving the camera. The only occasion where you see texture popping is when the camera suddenly teleports to a completely different location (working set changes almost completely). In our game this only happens if you restart to a checkpoint or restart the level completely, so it's not a big deal (and we can predict it).
If the game behaves similarly to our existing console game, we need a cache size of around 3x the working set for texture data. Big percentage of the memory accessed per frame (or stored to the memory) goes to the textures. If we assume for a moment that all other memory accesses are as stable as texture accesses (cache multiplier of 3x) we only need 600 MB of memory for a fully working game. For some memory bandwidth hungry parts of the game this actually is true. And things are even better for some parts: shadow maps, post processing buffers, back buffer, etc are fully generated again every frame, so we need no extra memory storage to hold caches of these (cache multiplier is 1x).
Game logic streaming is a harder thing to analyze and generalize. For example our console game has a large free roaming outdoor world. It's nowhere as big as worlds in Skyrim for example, but the key point here is that we only keep a slice of the world in memory at once so the world size could theoretically be limitless (with no extra memory cost). Our view distance is 2 kilometers, so we do need to keep full representation of the game world in memory after that. Data quality required for a distance follows pretty much logarithmic scale (texture mip mapping, object geometry quality, heightfield quality, vegetation map quality, etc etc). Data required as distance grows shrinks dramatically. This is of course only true for easy cases such as graphics processing, heightfields, etc. Game logic doesn't automatically scale. However you must scale it manually to reach that 200 MB per frame memory access limit. Your game would slow down to halt if you just tried to simply read full AI data from every single individual NPC in the large scale world, no matter how simple your processing would be.
Our heightmap cache (used in physics, raycasts and terrain visualization) keeps around 4x the working set. We do physics simulation (and exact collision) only for things near the player (100 meters max). When an object enters this area, we add corresponding physics objects to our physics world. It's hard to exactly estimate how big percentage of our physics world structures are accessed per frame, but I would estimate around 10%. So we basially have a 10x working set "cache" for physics.
Basically no component in our game required more than 10x memory compared to its working set. Average requirement was around 3x. So theoretically a game with similar memory access patterns would only need 600 MB of memory on our example Trinity platform. And this includes as much texture resolution as you ever want (virtual texturing works that way). And it includes as much other (physics, game logic, etc) data as you can process per frame (given the limited bandwidth). Of course another game might need for example average of 10x working set for caches, but that's still only 2 GB. Assuming game is properly optimized (predictable memory accesses are must have for good performance) and utilizes JIT streaming well, it will not benefit much if we add more main memory to our Trinity platform beyond that 2 GB.
More memory of course makes developers life easier. Predicting data access patterns can be very hard for some styles of games and structures. But mindlessly increasing the cache sizes much beyond working set sizes doesn't help either (as we all know that increasing cache size beyond working set size gives on average only logarithmic improvement on cache hit rate = diminishing returns very quickly).
My conclusion: Usable memory amount is very much tied to available memory bandwidth. More bandwidth allows the games to access more memory. So it's kind of counterintuitive to swap faster smaller memory to a slower larger one. More available memory means that I want to access more memory, but in reality the slower bandwidth allows me to access less. So the percentage of accessible memory drops radically.
Historically memory performance has improved linearly (very slowly) compared to exponential (Moore's law) growth of CPU performance. Relative memory access times (latencies) have grown to be over 400x higher (in clock cycles) compared to first PC computers, and there's no signs that this development will slow down in the future, unless we invent some radically new ways of storing data. None of the currently known future technologies is going to solve the problem, just provide some band aid. So we need to adapt.
Some links to background information first:
1. Presentation by Sony R&D. Targeted for game technology programmers. Has a very good real life example how improving your memory access pattern can improve your performance by almost 10x. Also has nice charts (slides 17 and 18) showing how memory speed has increased historically compared to ALU:
http://harmful.cat-v.org/software/O...ls_of_Object_Oriented_Programming_GCAP_09.pdf
2. Benchmark results of a brand new x86 chip with unified memory architecture (CPU & GPU share the same memory & memory controller). Benchmark shows system performance with all available DDR3 speeds from DDR3-800 to DDR3-1866. All other system settings are identical, only memory bus bandwidth is scaled up/down. We can see an 80% performance (fps) improvement in the gaming benchmark just by increasing the DDR3 memory clock:
http://www.tomshardware.com/reviews/a10-5800k-a8-5600k-a6-5400k,3224-5.html
3. A GPU benchmark comparing old Geforce GTS 450 (1 GB, GDDR5) card to a brand new Kepler based Geforce GT 640 (2 GB, DDR3). The new Kepler based card has twice the memory amount and twice the ALU performance, but only half of the memory bandwidth (because of DDR3). Despite the much faster theoretical shader performance and twice the memory amount, it loses pretty badly in the benchmarks because of it's slower memory bus:
http://www.anandtech.com/show/5969/zotac-geforce-gt-640-review-
I completely disagree with this. And I try now to explain why. As a professional, you of course know most of the background facts, but I need to explain that first, so that my remarks later aren't standing without a factual base.In the console space, using 2GB as a disk cache alone will make for a better end user experience than 2x or even 3-4x gpu performance.
--- ---
I will use the x86 based Trinity APU [link 2] as my example system, as it has close enough performance and memory bandwidth compared to current generation consoles (it's only around 2x-4x faster overall) and it has unified memory (single memory bus shared between CPU & GPU). It's much easier to talk about a well known system, with lots of public benchmarks results around the net.
Let's assume we are developing a vsync locked 60 fps game, so each frame must complete in 16.6 ms time. Let's assume our Trinity system is equipped with the fastest DDR3 it supports (DDR3-1866). According to Tom's Hardware synthetic bandwidth benchmark, this configuration gives us 14 GB bandwidth per second. Divide that by 60, and we get 233 MB bandwidth per frame. Let's round that down to even 200 MB per frame to ease up our calculations. A real game newer utilizes memory bandwidth as well as a synthetic benchmark, so even the 200 MB per frame figure is optimistic.
Now I know that my game should never access more than 200 MB of unique memory per frame if I want to reach my vsync locked 60 fps. If I access more memory, my frame rate dips as the memory subsystem cannot give me enough data, and my CPU & GPU start stalling.
How about CPU & GPU caches? Caches only help with repeated data access to the same data. Caches do not allow us to access any more unique data per frame. Also it's worth noticing that if you access the same memory for example at beginning of your frame, at middle of your frame and at end of your frame, you will pay as much as if you did three unique memory accesses. Caches are very small, and old data gets replaced very fast. Our Trinity CPU has 4 MB of L2 cache and we move 200 MB of data to the cache every frame. Our cache gets fully replaced by new data (200/4 =) 50 times every frame. Data only stays in cache for 0.33 ms. If we access it again after this period, we must fetch it from the memory again (wasting our valuable 200 MB per frame bandwidth). It's not uncommon that a real game accesses every data in the current working set (on average) twice per frame, leaving us with 100 MB per frame unique accessible memory. Examples: Shadowmaps are first rendered (to textures in memory) and sampled later during lighting pass. Physics simulation moves objects (positions & rotations) and later in frame those same objects are rendered (accessing those same position and rotation datas again).
However let's keep the theoretical 200 MB per frame number, as engines differ, and access patterns differ (and we do not really want to got that far in the analysis). In a real game you can likely access only around 100 MB - 150 MB of unique memory per frame, so the forthcoming analysis is optimistic. A real game could likely access less memory and thus have a smaller working set.
So far we know that the processing and rendering of a single frame never requires more than 200 MB of memory (we can't reach 60 fps otherwise). If your game has a static scene, you will not need more memory than that. However static scenes are not much fun, and thus this scenario is highly unlikely in real games (except for maybe a chess game with a fixed camera). So the billion dollar question becomes, how much does the working set (memory accesses) change from frame to frame in a 60 fps game?
In a computer game, objects and cameras do not really "move" around, they get repositioned every frame. In order for this repositioning to look like smooth movement we can only change the positions very slightly from frame to frame. This basically means that our working set can only change slightly from frame to frame. According to my analysis (for our game), our working set changes around 1%-2% per frame in general case, and peaks at around 10%. Especially notable fact is that our virtual texturing system working set never changes more than 2% per frame (textures are the biggest memory user in most games).
We assume that a game with a similar memory access pattern (similarly changing working set from frame to frame) is running on our Trinity example platform. Basically this means that in average case our working set changes from 2 MB to 4 MB per frame, and it peaks at around 20 MB per frame. We can stream this much data from a standard HDD. However HDDs have long latencies, and long seek times, so we must stream data in advance and bundle data in slightly bigger chunks than we like to combat the slow seek time. Both streaming in advance (prefetching) and loading in bigger chunks (loading slightly wider working set) require extra memory. Question becomes, how much larger the memory cache needs to be than our working set?
The working set is 200 MB (if we want to reach that 60 fps on the imaginary game on our Trinity platform). How much more memory we need for the cache? Is working set x2.5 enough (512 MB)? How about 5x (1 GB) or 10x (2 GB)?
Our virtual texture system has a static 1024 page cache (128x128 pixel pages, 2x DXT5 compressed layer per page). Our average working set per frame is around 200-400 pages, and it peaks as high as 600 pages. The cache is so small that it has to reload all textures if you spin the camera around in 360 degrees, but this doesn't matter, as the HDD streaming speed is enough to push new data in at steady pace. You never see any texture popping when rotating or moving the camera. The only occasion where you see texture popping is when the camera suddenly teleports to a completely different location (working set changes almost completely). In our game this only happens if you restart to a checkpoint or restart the level completely, so it's not a big deal (and we can predict it).
If the game behaves similarly to our existing console game, we need a cache size of around 3x the working set for texture data. Big percentage of the memory accessed per frame (or stored to the memory) goes to the textures. If we assume for a moment that all other memory accesses are as stable as texture accesses (cache multiplier of 3x) we only need 600 MB of memory for a fully working game. For some memory bandwidth hungry parts of the game this actually is true. And things are even better for some parts: shadow maps, post processing buffers, back buffer, etc are fully generated again every frame, so we need no extra memory storage to hold caches of these (cache multiplier is 1x).
Game logic streaming is a harder thing to analyze and generalize. For example our console game has a large free roaming outdoor world. It's nowhere as big as worlds in Skyrim for example, but the key point here is that we only keep a slice of the world in memory at once so the world size could theoretically be limitless (with no extra memory cost). Our view distance is 2 kilometers, so we do need to keep full representation of the game world in memory after that. Data quality required for a distance follows pretty much logarithmic scale (texture mip mapping, object geometry quality, heightfield quality, vegetation map quality, etc etc). Data required as distance grows shrinks dramatically. This is of course only true for easy cases such as graphics processing, heightfields, etc. Game logic doesn't automatically scale. However you must scale it manually to reach that 200 MB per frame memory access limit. Your game would slow down to halt if you just tried to simply read full AI data from every single individual NPC in the large scale world, no matter how simple your processing would be.
Our heightmap cache (used in physics, raycasts and terrain visualization) keeps around 4x the working set. We do physics simulation (and exact collision) only for things near the player (100 meters max). When an object enters this area, we add corresponding physics objects to our physics world. It's hard to exactly estimate how big percentage of our physics world structures are accessed per frame, but I would estimate around 10%. So we basially have a 10x working set "cache" for physics.
Basically no component in our game required more than 10x memory compared to its working set. Average requirement was around 3x. So theoretically a game with similar memory access patterns would only need 600 MB of memory on our example Trinity platform. And this includes as much texture resolution as you ever want (virtual texturing works that way). And it includes as much other (physics, game logic, etc) data as you can process per frame (given the limited bandwidth). Of course another game might need for example average of 10x working set for caches, but that's still only 2 GB. Assuming game is properly optimized (predictable memory accesses are must have for good performance) and utilizes JIT streaming well, it will not benefit much if we add more main memory to our Trinity platform beyond that 2 GB.
More memory of course makes developers life easier. Predicting data access patterns can be very hard for some styles of games and structures. But mindlessly increasing the cache sizes much beyond working set sizes doesn't help either (as we all know that increasing cache size beyond working set size gives on average only logarithmic improvement on cache hit rate = diminishing returns very quickly).
My conclusion: Usable memory amount is very much tied to available memory bandwidth. More bandwidth allows the games to access more memory. So it's kind of counterintuitive to swap faster smaller memory to a slower larger one. More available memory means that I want to access more memory, but in reality the slower bandwidth allows me to access less. So the percentage of accessible memory drops radically.