Memory bandwidth vs memory amount *spin off*

Discussion in 'Console Technology' started by sebbbi, Jul 1, 2012.

  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,044
    The next gen speculation thread started to have interesting debate about memory bandwidth vs memory amount. I don't personally want to contribute to the next gen speculation, but the "memory bandwidth vs memory amount" topic is very interesting in it's own. So I decided to make a thread for this topic, as I have personally been doing a lot of memory access and bandwidth analysis lately for our console technology, and I have programmed our virtual texturing system (and many other JIT data streaming components).

    Historically memory performance has improved linearly (very slowly) compared to exponential (Moore's law) growth of CPU performance. Relative memory access times (latencies) have grown to be over 400x higher (in clock cycles) compared to first PC computers, and there's no signs that this development will slow down in the future, unless we invent some radically new ways of storing data. None of the currently known future technologies is going to solve the problem, just provide some band aid. So we need to adapt.

    Some links to background information first:

    1. Presentation by Sony R&D. Targeted for game technology programmers. Has a very good real life example how improving your memory access pattern can improve your performance by almost 10x. Also has nice charts (slides 17 and 18) showing how memory speed has increased historically compared to ALU:
    http://harmful.cat-v.org/software/O...ls_of_Object_Oriented_Programming_GCAP_09.pdf

    2. Benchmark results of a brand new x86 chip with unified memory architecture (CPU & GPU share the same memory & memory controller). Benchmark shows system performance with all available DDR3 speeds from DDR3-800 to DDR3-1866. All other system settings are identical, only memory bus bandwidth is scaled up/down. We can see an 80% performance (fps) improvement in the gaming benchmark just by increasing the DDR3 memory clock:
    http://www.tomshardware.com/reviews/a10-5800k-a8-5600k-a6-5400k,3224-5.html

    3. A GPU benchmark comparing old Geforce GTS 450 (1 GB, GDDR5) card to a brand new Kepler based Geforce GT 640 (2 GB, DDR3). The new Kepler based card has twice the memory amount and twice the ALU performance, but only half of the memory bandwidth (because of DDR3). Despite the much faster theoretical shader performance and twice the memory amount, it loses pretty badly in the benchmarks because of it's slower memory bus:
    http://www.anandtech.com/show/5969/zotac-geforce-gt-640-review-

    I completely disagree with this. And I try now to explain why. As a professional, you of course know most of the background facts, but I need to explain that first, so that my remarks later aren't standing without a factual base.

    --- ---

    I will use the x86 based Trinity APU [link 2] as my example system, as it has close enough performance and memory bandwidth compared to current generation consoles (it's only around 2x-4x faster overall) and it has unified memory (single memory bus shared between CPU & GPU). It's much easier to talk about a well known system, with lots of public benchmarks results around the net.

    Let's assume we are developing a vsync locked 60 fps game, so each frame must complete in 16.6 ms time. Let's assume our Trinity system is equipped with the fastest DDR3 it supports (DDR3-1866). According to Tom's Hardware synthetic bandwidth benchmark, this configuration gives us 14 GB bandwidth per second. Divide that by 60, and we get 233 MB bandwidth per frame. Let's round that down to even 200 MB per frame to ease up our calculations. A real game newer utilizes memory bandwidth as well as a synthetic benchmark, so even the 200 MB per frame figure is optimistic.

    Now I know that my game should never access more than 200 MB of unique memory per frame if I want to reach my vsync locked 60 fps. If I access more memory, my frame rate dips as the memory subsystem cannot give me enough data, and my CPU & GPU start stalling.

    How about CPU & GPU caches? Caches only help with repeated data access to the same data. Caches do not allow us to access any more unique data per frame. Also it's worth noticing that if you access the same memory for example at beginning of your frame, at middle of your frame and at end of your frame, you will pay as much as if you did three unique memory accesses. Caches are very small, and old data gets replaced very fast. Our Trinity CPU has 4 MB of L2 cache and we move 200 MB of data to the cache every frame. Our cache gets fully replaced by new data (200/4 =) 50 times every frame. Data only stays in cache for 0.33 ms. If we access it again after this period, we must fetch it from the memory again (wasting our valuable 200 MB per frame bandwidth). It's not uncommon that a real game accesses every data in the current working set (on average) twice per frame, leaving us with 100 MB per frame unique accessible memory. Examples: Shadowmaps are first rendered (to textures in memory) and sampled later during lighting pass. Physics simulation moves objects (positions & rotations) and later in frame those same objects are rendered (accessing those same position and rotation datas again).

    However let's keep the theoretical 200 MB per frame number, as engines differ, and access patterns differ (and we do not really want to got that far in the analysis). In a real game you can likely access only around 100 MB - 150 MB of unique memory per frame, so the forthcoming analysis is optimistic. A real game could likely access less memory and thus have a smaller working set.

    So far we know that the processing and rendering of a single frame never requires more than 200 MB of memory (we can't reach 60 fps otherwise). If your game has a static scene, you will not need more memory than that. However static scenes are not much fun, and thus this scenario is highly unlikely in real games (except for maybe a chess game with a fixed camera). So the billion dollar question becomes, how much does the working set (memory accesses) change from frame to frame in a 60 fps game?

    In a computer game, objects and cameras do not really "move" around, they get repositioned every frame. In order for this repositioning to look like smooth movement we can only change the positions very slightly from frame to frame. This basically means that our working set can only change slightly from frame to frame. According to my analysis (for our game), our working set changes around 1%-2% per frame in general case, and peaks at around 10%. Especially notable fact is that our virtual texturing system working set never changes more than 2% per frame (textures are the biggest memory user in most games).

    We assume that a game with a similar memory access pattern (similarly changing working set from frame to frame) is running on our Trinity example platform. Basically this means that in average case our working set changes from 2 MB to 4 MB per frame, and it peaks at around 20 MB per frame. We can stream this much data from a standard HDD. However HDDs have long latencies, and long seek times, so we must stream data in advance and bundle data in slightly bigger chunks than we like to combat the slow seek time. Both streaming in advance (prefetching) and loading in bigger chunks (loading slightly wider working set) require extra memory. Question becomes, how much larger the memory cache needs to be than our working set?

    The working set is 200 MB (if we want to reach that 60 fps on the imaginary game on our Trinity platform). How much more memory we need for the cache? Is working set x2.5 enough (512 MB)? How about 5x (1 GB) or 10x (2 GB)?

    Our virtual texture system has a static 1024 page cache (128x128 pixel pages, 2x DXT5 compressed layer per page). Our average working set per frame is around 200-400 pages, and it peaks as high as 600 pages. The cache is so small that it has to reload all textures if you spin the camera around in 360 degrees, but this doesn't matter, as the HDD streaming speed is enough to push new data in at steady pace. You never see any texture popping when rotating or moving the camera. The only occasion where you see texture popping is when the camera suddenly teleports to a completely different location (working set changes almost completely). In our game this only happens if you restart to a checkpoint or restart the level completely, so it's not a big deal (and we can predict it).

    If the game behaves similarly to our existing console game, we need a cache size of around 3x the working set for texture data. Big percentage of the memory accessed per frame (or stored to the memory) goes to the textures. If we assume for a moment that all other memory accesses are as stable as texture accesses (cache multiplier of 3x) we only need 600 MB of memory for a fully working game. For some memory bandwidth hungry parts of the game this actually is true. And things are even better for some parts: shadow maps, post processing buffers, back buffer, etc are fully generated again every frame, so we need no extra memory storage to hold caches of these (cache multiplier is 1x).

    Game logic streaming is a harder thing to analyze and generalize. For example our console game has a large free roaming outdoor world. It's nowhere as big as worlds in Skyrim for example, but the key point here is that we only keep a slice of the world in memory at once so the world size could theoretically be limitless (with no extra memory cost). Our view distance is 2 kilometers, so we do need to keep full representation of the game world in memory after that. Data quality required for a distance follows pretty much logarithmic scale (texture mip mapping, object geometry quality, heightfield quality, vegetation map quality, etc etc). Data required as distance grows shrinks dramatically. This is of course only true for easy cases such as graphics processing, heightfields, etc. Game logic doesn't automatically scale. However you must scale it manually to reach that 200 MB per frame memory access limit. Your game would slow down to halt if you just tried to simply read full AI data from every single individual NPC in the large scale world, no matter how simple your processing would be.

    Our heightmap cache (used in physics, raycasts and terrain visualization) keeps around 4x the working set. We do physics simulation (and exact collision) only for things near the player (100 meters max). When an object enters this area, we add corresponding physics objects to our physics world. It's hard to exactly estimate how big percentage of our physics world structures are accessed per frame, but I would estimate around 10%. So we basially have a 10x working set "cache" for physics.

    Basically no component in our game required more than 10x memory compared to its working set. Average requirement was around 3x. So theoretically a game with similar memory access patterns would only need 600 MB of memory on our example Trinity platform. And this includes as much texture resolution as you ever want (virtual texturing works that way). And it includes as much other (physics, game logic, etc) data as you can process per frame (given the limited bandwidth). Of course another game might need for example average of 10x working set for caches, but that's still only 2 GB. Assuming game is properly optimized (predictable memory accesses are must have for good performance) and utilizes JIT streaming well, it will not benefit much if we add more main memory to our Trinity platform beyond that 2 GB.

    More memory of course makes developers life easier. Predicting data access patterns can be very hard for some styles of games and structures. But mindlessly increasing the cache sizes much beyond working set sizes doesn't help either (as we all know that increasing cache size beyond working set size gives on average only logarithmic improvement on cache hit rate = diminishing returns very quickly).

    My conclusion: Usable memory amount is very much tied to available memory bandwidth. More bandwidth allows the games to access more memory. So it's kind of counterintuitive to swap faster smaller memory to a slower larger one. More available memory means that I want to access more memory, but in reality the slower bandwidth allows me to access less. So the percentage of accessible memory drops radically.
     
  2. ultragpu

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    3,844
    Location:
    Australia
    Very impressive analysis Sebbi and thanks for sharing your thoughts. I am curious though about the Sparse Voxel Octree technique used in UE4, it was mentioned to be memory intensive on the system, so do you think the same logic applies here as well?
     
  3. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Fantastic write up. And a very interesting topic..on the other thread therein talk of 8gb ram..but as you state having loads of ram like that can unbalanced the system and be not a great deal of use if there is not enough bandwidth...

    Nvidia have balanced their kepler 670-680 very well in regards to both parameters..where as amd gcn has massive amounts of ram..and also more bandwidth..but at least in the comparison I read all those extra resources only gave a slight advantage at high resolutions..so was that a waste on amds part? Or is it because pc games are not made to take advantage of all those resources in the same way a console would?

    8gb of ddr 3 looks to me to be a complete waste of time, because you would have to stick a mammoth wide (386 -512bit ?) Bus to achieve the required bandwidth to make the most of that ram.
    Of course you could have another edram setup...but then that wouldn't change the bandwidth to main ram...so again the use of having 8gb is not as advantageous as it looks at first glance.

    For me you are better having 4gb of lightning fast ram unified..something like gddr 5...that would be much more usefull imo.

    What about the latency and read speeds of a hdd vs a fast ssd??
    Would going for a smaller 2gb of ultra fast gddr 5 mated to a very fast ssd be more beneficial than 8gb ddr3, 64mb edram and a hdd?
     
  4. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,417
    Location:
    Planet Earth.
    There are still a number of games/engines that just load levels, those would benefit from more memory as it would mean bigger levels, I expect them to be a dying bread though.
    (Using memory as a giant I/O cache, subdividing the world at creation to have perfectly smooth game experience, quite understandable when you have to read from optical drives.)

    On a second note, I'd like to emphasis the need for ECC as memory amount and bandwidth increases. It's not a problem we can just continue to ignore, all modern CPU should already have ECC.
    (Google published results of 3–10×10−9 error/bit·h.)

    On a third note I'd urge gamers not to purchase those CPU with embedded (immediate mode) GPU, hoping that if enough people do that Intel/AMD will improve their offer for "just" CPU.
    (Plus that's really a waste to have something you won't use :( )
     
  5. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,636
    If in a stream like benchmark AMD is only getting ~50% memory bandwidth then they either have a bottleneck at some point between the CPU and the memory controller or have a sub-optimal memory controller.

    In the case in question, we are talking about a max memory bandwidth in the range of 75-100 GB/s and a realistic bandwidth on the order of 70-80% of peak for a range of 56-75 GB/s. With gives a per frame bandwidth of roughly 1GB or a bit higher at 60 FPS.

    This for the ranges you gave works out to an optimal memory size around 3-10 GB.

    But we haven't even factored in the issue of HDD bandwidth and access times vs optical bandwidth and access times. BR has an order of magnitude higher access latency than HDD and at its peak roughly half the bandwidth of a modern HDD. This further pushes up the pre-buffering/streaming requirement for an actual game engine.

    Then if the design has a high speed temporary buffer of reasonable size (32MB+), this also reduces the amount of non-static texture data that must be stored and read further increasing the relative size of the texture bandwidth and therefore the streaming texture cache space required.

    So while the original post was informative, I don't believe that it captures the real scope nor details of this as it relates to what might be seen in next gen hardware. It also is an example of a single game design that may or may not have levels and load time between levels. Part of my comment on the 2 GB as a load cache specifically relates to games that do have levels and load times between levels. Using 2GB to stream in the next level data while on the current level would without a doubt improve the user experience.
     
  6. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,636
    I'm not so sure that they'll be a dying breed. Nor that having a large game world really limits things esp if there are methods of fast travel (as the speed of movement through the game world increases the rate at which and the size of which your streaming must work increases.

    Not really going to happen. The issue is that consumers just don't care. For important data structures, the realistic course is self check-sums.

    The embedded GPUs really aren't aimed at gamers. They are primarily designed for cost/packaging/board/thermal reasons. If you want to game, you will always be better off with a discrete GPU until such a time as moderate sized on package memories become viable/standard. though we are getting closer, a single wide IO DRAM will be able to provide in the range of 100-200 GB/s of bandwidth and between 512-1024 MB of capacity. Combined with a main memory in the range of 50 GB/s and the integrated GPUs will finally be able to stretch their legs. Realistically, that is all about 3-5 years out for mainstream computers at the front edge. Lots of other markets though would prefer if PCs got their sooner so they could leverage off of them.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,044
    But a very fast PC hard drive is only around 90 MB/s. With a perfectly linear access pattern (requires data duplication on assets) it takes 11 seconds to load each gigabyte of data. For example to fill 8 gigabytes of memory you need 88 seconds, and to fill 12 gigabytes of memory you need 133 seconds.

    On 5400 RPM hard drives (used in laptops and gaming consoles) the loading times would be double of that: 176 seconds for 8 GB and 266 seconds for 12 GB.

    Anything over 20 seconds, and the game experience gets degraded. I have stopped playing some games (for example Gran Turismo 5) because of too long loading times. I personally prefer my games (the games I develop) to have zero loading times if at all possible. We had 3 second loading times in Trials Evolution (because we streamed almost everything).
    Yes, but it's still a pure memory benchmark that has a perfectly optimized memory access pattern. I wouldn't expect to have even 50% of that in a real game. And if you count all the double memory accesses a real game needs during it's frame (for example shadow map rendering), you can halve that figure again as well.
    You used Battlefield 3 as your example. They are using lots of similar techniques than we are. Lots of data streaming and only a small subset in memory. This whitepaper explains their streaming system pretty well (excellent reading btw): http://publications.dice.se/attachments/GDC12_Terrain_in_Battlefield3.pdf

    Of course if you are having completely linear game and you have extra 2 GB to burn, you can load your next level at the same time you play the current one. But I don't see this as good use of resources. Instead of this extra 2 GB I would choose higher clocked memory to improve my frame rate, increase my view distance and to add more dynamic physics driven stuff to my levels (all these require bandwidth).
     
  8. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    11,179

    So all the people crying for 4GB in PS4 are nuts? :razz:

    You also build your argument around 60 FPS, which is nice and all, but the vast vast majority of games are 30. We even saw devs like Insomniac publicly announce a switch from 60 to 30 this gen (for Ratchet).

    So, using Aaronspink's analysis at 1GB 60 FPS, a typical 30 FPS game could access 2GB per frame. I guess this does put a hard limit on some things, but what if the content changes rapidly frame to frame (as it would seem to in any video game...)? Wouldn't the 8 GB system be at an advantage over the 2B that now has to go to an HDD or something magnitudes slower to get new data? In essence the 8GB system could "buffer" 3 additional frames, vs 0 for the 2GB. Or am I totally not getting it?
     
  9. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran Subscriber

    Joined:
    Jan 11, 2008
    Messages:
    3,219
    Location:
    New Zealand
    This explains why PC games let you save and reload from anywhere whereas console games use checkpoints, I get it now.

    25GB/S = ~2.5-8.3GB working set.
    50GB/S = 5-16.6GB
    100GB/S = 10-33.2GB

    Does that make sense?

    25GB / 30 * 3:1-10:1?

    Effectively that means that current consoles could have used a lot more memory even at the same bandwidth. Am I right?
     
  10. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,636
    The problem is that the bandwidth/latency curve is actually worse for something like GDDR5 @ 5 Gb/s than DDR4 at 3 Gb/s. ie, GDDR5 has roughly the same latency as DDR4 but needs to fill more bandwidth, so in general it is going to deliver lower efficiency. In both cases you handle this somewhat by going to small sub-channel widths. This increases your effective latency somewhat due to the latency caused by the transfer overhead but have higher sustained bandwidth due to high occupancy per transaction per sub-channel (assuming you have a reasonable interleave function of course, most of these in modern systems are somewhat complex hash functions that tend to work fairly well esp when given 8+ sub-channels to interleave between).

    Intermediate data is better served by a high speed buffer memory rather than the main pool.

    But given enough thread level parallelism, which in a graphics workload you will have in spades, you should be able to sustain on the order of 80-90% effective bandwidth with narrow channel DDR4 given an appropriately designed memory controller.

    The reality of the situation is you effectively end up paying a ~4x capacity cost for around 50% higher bandwidth between GDDR and DDR. This will likely get worse over time as the types of device that need/can afford GDDR get smaller further pushing up the cost of GDDR relative to DDR.

    The trend is pretty clear, in that conserving large pool bandwidth is going to be more of a factor going forward with some relief coming from relatively small high bandwidth intermediate buffer memories (eDRAM and stack/wide io dram).


    The problem is that higher speed memory is going to be increasingly boutique. So the reality is would you rather have 4x the memory or <50% higher bandwidth...

    Another interesting conundrum, is say you have 2-3 GB per frame of bandwidth but only 2 GB of memory. In that case, you simply cannot use the bandwidth as you will never be able to stream in assets at a high enough rate.
     
  11. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    1,842

    Can you please provide a link to this 100GB DDR3 memory? Thanks.
     
  12. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    1,842
    No, he's saying that people thinking 8GB at 1/4 the bandwidth is somehow better are crazy. :cool:
     
  13. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    34,857
    Location:
    Under my bridge
    The majority discussion about the choice of RAM for next gen machines wasn't addressing the discussion of BW versus capacity, so has been moved to its own discussion here.
     
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    34,857
    Location:
    Under my bridge
    Looking at the counter argument aaronspink has made, do you think you could draw up a minimum amount of RAM needed for a given bandwidth to be effective? At some point there just won't be enough RAM to fit the content. At the other end, which you argue probably isn't as high as people imagine, the RAM will be an excess and a waste of money. Where does the sweet spot lie?

    You can always burn up more RAM or BW with your game. If you have an excess of BW, just use loads of transparency and particles, like PS2. An imbalance too far in either direction will limit the options devs have and force them into specific approaches, or add excessive cost to the machines. So as I say to sebbbi above, where does the sweetspot lie given a specified bandwidth?

    Most Economical Capacity = BW (GB/s) / 25 or something similar?
     
  15. Kb-Smoker

    Regular

    Joined:
    Aug 26, 2005
    Messages:
    614
    seem like an ssd would solve some of the problems with having 2GB but given the cost its unlikely.

    So as dice/epic have stated 2GB would not be enough wouldn't that be really telling of this debate where we have 2GB vs 8GB. For them to make public comment that means either Sony would not listen to direct feedback or what?? I remember epic doing the same last Gen with x360 but dont remember them doing it in public.

    I feel the debate would change almost 100% if we are comparing 4 GB gddr5 vs 8GB DDR3/4.
     
  16. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    1,842
    This is the first question to answer. IMO, 2GB at any bandwidth may be enough to start the next gen, but like 512MB this gen, it'll soon become a limiter. 4GB is the likely sweet spot. I don't see a general use case for 8GB vs 4GB, especially if it comes at 1/2 bandwidth cost.
     
  17. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran Subscriber

    Joined:
    Jan 11, 2008
    Messages:
    3,219
    Location:
    New Zealand
    One thing perhaps to keep in mind is that the OS may use a lot of memory as the needs of the device expands between generations. It may have a very high residence to use ratio. I.E. It may use a lot of memory, however it likely won't use a lot of bandwidth.
     
  18. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    1,842

    You bring up a good point that the memory amount covers both cpu and gpu.

    That said, I currently have Win7 Pro, 8 browser windows, 2 Adobe pdf's, 1 vpn connection, 2 rdp sessions, and a full virus scan running (which hammers the piss out of my cpu) and I'm still only utilizing 1.3GB of memory. At 2GB that wouldn't much for a game running too but at 4GB I'd still have 2.7GB available (but no cpu :evil:).
     
  19. ERP

    ERP
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Location:
    Redmond, WA
    While I agree in principle with the OP, I wanted to point some things out.
    The available memory bandwidth per frame is only interesting as it applies to total memory if you can predict with some degree of certainty which 200MB or so you're going to touch, and you can actually get it off disk before you will need it.

    So while above some threshold more memory doesn't help you with higher res textures that doesn't make it useless.

    The big issue with a lot of memory as has been pointed out is filling it. I think it's interesting for parametric data that expands many times it's read size and can't be computed on use. What most would refer to procedural content, at some level you can consider parametric content to be data compression with extreme compression ratios.
    Really the memory is still just a cache, but it's a cache for computation rather than disk reads. It's one area I'd be seriously looking at going forwards.

    Voxelization of data is interesting, and slight changes in orientation, can radically change the way structures like this are walked, requiring vastly more memory than the actual amount walked. You also don't want to have to keep writing the data wasting you're bandwidth when you're repeatedly reading it. Again a computation cache.

    As an aside The Sony paper is interesting but doesn't age well, you can still kill yourself with virtual function calls, but it's nothing like it was circa PS2 or worse PS1 with direct mapped single digit kilobyte caches.
     
  20. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    34,857
    Location:
    Under my bridge
    This is one of the most obvious advantages, the old trade between computation power and storage. Rather than recomputing varied assets on the fly, they could be saved on a single creation.
     

Share This Page

Loading...