Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 01-Jul-2012, 09:34   #1
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default Memory bandwidth vs memory amount *spin off*

The next gen speculation thread started to have interesting debate about memory bandwidth vs memory amount. I don't personally want to contribute to the next gen speculation, but the "memory bandwidth vs memory amount" topic is very interesting in it's own. So I decided to make a thread for this topic, as I have personally been doing a lot of memory access and bandwidth analysis lately for our console technology, and I have programmed our virtual texturing system (and many other JIT data streaming components).

Historically memory performance has improved linearly (very slowly) compared to exponential (Moore's law) growth of CPU performance. Relative memory access times (latencies) have grown to be over 400x higher (in clock cycles) compared to first PC computers, and there's no signs that this development will slow down in the future, unless we invent some radically new ways of storing data. None of the currently known future technologies is going to solve the problem, just provide some band aid. So we need to adapt.

Some links to background information first:

1. Presentation by Sony R&D. Targeted for game technology programmers. Has a very good real life example how improving your memory access pattern can improve your performance by almost 10x. Also has nice charts (slides 17 and 18) showing how memory speed has increased historically compared to ALU:
http://harmful.cat-v.org/software/OO...ng_GCAP_09.pdf

2. Benchmark results of a brand new x86 chip with unified memory architecture (CPU & GPU share the same memory & memory controller). Benchmark shows system performance with all available DDR3 speeds from DDR3-800 to DDR3-1866. All other system settings are identical, only memory bus bandwidth is scaled up/down. We can see an 80% performance (fps) improvement in the gaming benchmark just by increasing the DDR3 memory clock:
http://www.tomshardware.com/reviews/...0k,3224-5.html

3. A GPU benchmark comparing old Geforce GTS 450 (1 GB, GDDR5) card to a brand new Kepler based Geforce GT 640 (2 GB, DDR3). The new Kepler based card has twice the memory amount and twice the ALU performance, but only half of the memory bandwidth (because of DDR3). Despite the much faster theoretical shader performance and twice the memory amount, it loses pretty badly in the benchmarks because of it's slower memory bus:
http://www.anandtech.com/show/5969/z...gt-640-review-

Quote:
Originally Posted by aaronspink View Post
In the console space, using 2GB as a disk cache alone will make for a better end user experience than 2x or even 3-4x gpu performance.
I completely disagree with this. And I try now to explain why. As a professional, you of course know most of the background facts, but I need to explain that first, so that my remarks later aren't standing without a factual base.

--- ---

I will use the x86 based Trinity APU [link 2] as my example system, as it has close enough performance and memory bandwidth compared to current generation consoles (it's only around 2x-4x faster overall) and it has unified memory (single memory bus shared between CPU & GPU). It's much easier to talk about a well known system, with lots of public benchmarks results around the net.

Let's assume we are developing a vsync locked 60 fps game, so each frame must complete in 16.6 ms time. Let's assume our Trinity system is equipped with the fastest DDR3 it supports (DDR3-1866). According to Tom's Hardware synthetic bandwidth benchmark, this configuration gives us 14 GB bandwidth per second. Divide that by 60, and we get 233 MB bandwidth per frame. Let's round that down to even 200 MB per frame to ease up our calculations. A real game newer utilizes memory bandwidth as well as a synthetic benchmark, so even the 200 MB per frame figure is optimistic.

Now I know that my game should never access more than 200 MB of unique memory per frame if I want to reach my vsync locked 60 fps. If I access more memory, my frame rate dips as the memory subsystem cannot give me enough data, and my CPU & GPU start stalling.

How about CPU & GPU caches? Caches only help with repeated data access to the same data. Caches do not allow us to access any more unique data per frame. Also it's worth noticing that if you access the same memory for example at beginning of your frame, at middle of your frame and at end of your frame, you will pay as much as if you did three unique memory accesses. Caches are very small, and old data gets replaced very fast. Our Trinity CPU has 4 MB of L2 cache and we move 200 MB of data to the cache every frame. Our cache gets fully replaced by new data (200/4 =) 50 times every frame. Data only stays in cache for 0.33 ms. If we access it again after this period, we must fetch it from the memory again (wasting our valuable 200 MB per frame bandwidth). It's not uncommon that a real game accesses every data in the current working set (on average) twice per frame, leaving us with 100 MB per frame unique accessible memory. Examples: Shadowmaps are first rendered (to textures in memory) and sampled later during lighting pass. Physics simulation moves objects (positions & rotations) and later in frame those same objects are rendered (accessing those same position and rotation datas again).

However let's keep the theoretical 200 MB per frame number, as engines differ, and access patterns differ (and we do not really want to got that far in the analysis). In a real game you can likely access only around 100 MB - 150 MB of unique memory per frame, so the forthcoming analysis is optimistic. A real game could likely access less memory and thus have a smaller working set.

So far we know that the processing and rendering of a single frame never requires more than 200 MB of memory (we can't reach 60 fps otherwise). If your game has a static scene, you will not need more memory than that. However static scenes are not much fun, and thus this scenario is highly unlikely in real games (except for maybe a chess game with a fixed camera). So the billion dollar question becomes, how much does the working set (memory accesses) change from frame to frame in a 60 fps game?

In a computer game, objects and cameras do not really "move" around, they get repositioned every frame. In order for this repositioning to look like smooth movement we can only change the positions very slightly from frame to frame. This basically means that our working set can only change slightly from frame to frame. According to my analysis (for our game), our working set changes around 1%-2% per frame in general case, and peaks at around 10%. Especially notable fact is that our virtual texturing system working set never changes more than 2% per frame (textures are the biggest memory user in most games).

We assume that a game with a similar memory access pattern (similarly changing working set from frame to frame) is running on our Trinity example platform. Basically this means that in average case our working set changes from 2 MB to 4 MB per frame, and it peaks at around 20 MB per frame. We can stream this much data from a standard HDD. However HDDs have long latencies, and long seek times, so we must stream data in advance and bundle data in slightly bigger chunks than we like to combat the slow seek time. Both streaming in advance (prefetching) and loading in bigger chunks (loading slightly wider working set) require extra memory. Question becomes, how much larger the memory cache needs to be than our working set?

The working set is 200 MB (if we want to reach that 60 fps on the imaginary game on our Trinity platform). How much more memory we need for the cache? Is working set x2.5 enough (512 MB)? How about 5x (1 GB) or 10x (2 GB)?

Our virtual texture system has a static 1024 page cache (128x128 pixel pages, 2x DXT5 compressed layer per page). Our average working set per frame is around 200-400 pages, and it peaks as high as 600 pages. The cache is so small that it has to reload all textures if you spin the camera around in 360 degrees, but this doesn't matter, as the HDD streaming speed is enough to push new data in at steady pace. You never see any texture popping when rotating or moving the camera. The only occasion where you see texture popping is when the camera suddenly teleports to a completely different location (working set changes almost completely). In our game this only happens if you restart to a checkpoint or restart the level completely, so it's not a big deal (and we can predict it).

If the game behaves similarly to our existing console game, we need a cache size of around 3x the working set for texture data. Big percentage of the memory accessed per frame (or stored to the memory) goes to the textures. If we assume for a moment that all other memory accesses are as stable as texture accesses (cache multiplier of 3x) we only need 600 MB of memory for a fully working game. For some memory bandwidth hungry parts of the game this actually is true. And things are even better for some parts: shadow maps, post processing buffers, back buffer, etc are fully generated again every frame, so we need no extra memory storage to hold caches of these (cache multiplier is 1x).

Game logic streaming is a harder thing to analyze and generalize. For example our console game has a large free roaming outdoor world. It's nowhere as big as worlds in Skyrim for example, but the key point here is that we only keep a slice of the world in memory at once so the world size could theoretically be limitless (with no extra memory cost). Our view distance is 2 kilometers, so we do need to keep full representation of the game world in memory after that. Data quality required for a distance follows pretty much logarithmic scale (texture mip mapping, object geometry quality, heightfield quality, vegetation map quality, etc etc). Data required as distance grows shrinks dramatically. This is of course only true for easy cases such as graphics processing, heightfields, etc. Game logic doesn't automatically scale. However you must scale it manually to reach that 200 MB per frame memory access limit. Your game would slow down to halt if you just tried to simply read full AI data from every single individual NPC in the large scale world, no matter how simple your processing would be.

Our heightmap cache (used in physics, raycasts and terrain visualization) keeps around 4x the working set. We do physics simulation (and exact collision) only for things near the player (100 meters max). When an object enters this area, we add corresponding physics objects to our physics world. It's hard to exactly estimate how big percentage of our physics world structures are accessed per frame, but I would estimate around 10%. So we basially have a 10x working set "cache" for physics.

Basically no component in our game required more than 10x memory compared to its working set. Average requirement was around 3x. So theoretically a game with similar memory access patterns would only need 600 MB of memory on our example Trinity platform. And this includes as much texture resolution as you ever want (virtual texturing works that way). And it includes as much other (physics, game logic, etc) data as you can process per frame (given the limited bandwidth). Of course another game might need for example average of 10x working set for caches, but that's still only 2 GB. Assuming game is properly optimized (predictable memory accesses are must have for good performance) and utilizes JIT streaming well, it will not benefit much if we add more main memory to our Trinity platform beyond that 2 GB.

More memory of course makes developers life easier. Predicting data access patterns can be very hard for some styles of games and structures. But mindlessly increasing the cache sizes much beyond working set sizes doesn't help either (as we all know that increasing cache size beyond working set size gives on average only logarithmic improvement on cache hit rate = diminishing returns very quickly).

My conclusion: Usable memory amount is very much tied to available memory bandwidth. More bandwidth allows the games to access more memory. So it's kind of counterintuitive to swap faster smaller memory to a slower larger one. More available memory means that I want to access more memory, but in reality the slower bandwidth allows me to access less. So the percentage of accessible memory drops radically.
sebbbi is offline   Reply With Quote
Old 01-Jul-2012, 10:49   #2
ultragpu
Member
 
Join Date: Apr 2004
Location: Australia
Posts: 3,026
Default

Very impressive analysis Sebbi and thanks for sharing your thoughts. I am curious though about the Sparse Voxel Octree technique used in UE4, it was mentioned to be memory intensive on the system, so do you think the same logic applies here as well?
ultragpu is offline   Reply With Quote
Old 01-Jul-2012, 10:59   #3
french toast
Senior Member
 
Join Date: Jan 2012
Location: Leicestershire - England
Posts: 1,634
Default

Fantastic write up. And a very interesting topic..on the other thread therein talk of 8gb ram..but as you state having loads of ram like that can unbalanced the system and be not a great deal of use if there is not enough bandwidth...

Nvidia have balanced their kepler 670-680 very well in regards to both parameters..where as amd gcn has massive amounts of ram..and also more bandwidth..but at least in the comparison I read all those extra resources only gave a slight advantage at high resolutions..so was that a waste on amds part? Or is it because pc games are not made to take advantage of all those resources in the same way a console would?

8gb of ddr 3 looks to me to be a complete waste of time, because you would have to stick a mammoth wide (386 -512bit ?) Bus to achieve the required bandwidth to make the most of that ram.
Of course you could have another edram setup...but then that wouldn't change the bandwidth to main ram...so again the use of having 8gb is not as advantageous as it looks at first glance.

For me you are better having 4gb of lightning fast ram unified..something like gddr 5...that would be much more usefull imo.

What about the latency and read speeds of a hdd vs a fast ssd??
Would going for a smaller 2gb of ultra fast gddr 5 mated to a very fast ssd be more beneficial than 8gb ddr3, 64mb edram and a hdd?
french toast is offline   Reply With Quote
Old 01-Jul-2012, 11:15   #4
Rodéric
a.k.a. Ingenu
 
Join Date: Feb 2002
Location: Carnon Plage, France.
Posts: 2,943
Default

There are still a number of games/engines that just load levels, those would benefit from more memory as it would mean bigger levels, I expect them to be a dying bread though.
(Using memory as a giant I/O cache, subdividing the world at creation to have perfectly smooth game experience, quite understandable when you have to read from optical drives.)

On a second note, I'd like to emphasis the need for ECC as memory amount and bandwidth increases. It's not a problem we can just continue to ignore, all modern CPU should already have ECC.
(Google published results of 3–10×10−9 error/bit·h.)

On a third note I'd urge gamers not to purchase those CPU with embedded (immediate mode) GPU, hoping that if enough people do that Intel/AMD will improve their offer for "just" CPU.
(Plus that's really a waste to have something you won't use )
__________________
So many things to do, and yet so little time to spend...
Rodéric is offline   Reply With Quote
Old 01-Jul-2012, 11:45   #5
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,596
Default

If in a stream like benchmark AMD is only getting ~50% memory bandwidth then they either have a bottleneck at some point between the CPU and the memory controller or have a sub-optimal memory controller.

In the case in question, we are talking about a max memory bandwidth in the range of 75-100 GB/s and a realistic bandwidth on the order of 70-80% of peak for a range of 56-75 GB/s. With gives a per frame bandwidth of roughly 1GB or a bit higher at 60 FPS.

This for the ranges you gave works out to an optimal memory size around 3-10 GB.

But we haven't even factored in the issue of HDD bandwidth and access times vs optical bandwidth and access times. BR has an order of magnitude higher access latency than HDD and at its peak roughly half the bandwidth of a modern HDD. This further pushes up the pre-buffering/streaming requirement for an actual game engine.

Then if the design has a high speed temporary buffer of reasonable size (32MB+), this also reduces the amount of non-static texture data that must be stored and read further increasing the relative size of the texture bandwidth and therefore the streaming texture cache space required.

So while the original post was informative, I don't believe that it captures the real scope nor details of this as it relates to what might be seen in next gen hardware. It also is an example of a single game design that may or may not have levels and load time between levels. Part of my comment on the 2 GB as a load cache specifically relates to games that do have levels and load times between levels. Using 2GB to stream in the next level data while on the current level would without a doubt improve the user experience.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 01-Jul-2012, 11:57   #6
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,596
Default

Quote:
Originally Posted by Rodéric View Post
There are still a number of games/engines that just load levels, those would benefit from more memory as it would mean bigger levels, I expect them to be a dying bread though.
(Using memory as a giant I/O cache, subdividing the world at creation to have perfectly smooth game experience, quite understandable when you have to read from optical drives.)
I'm not so sure that they'll be a dying breed. Nor that having a large game world really limits things esp if there are methods of fast travel (as the speed of movement through the game world increases the rate at which and the size of which your streaming must work increases.

Quote:
On a second note, I'd like to emphasis the need for ECC as memory amount and bandwidth increases. It's not a problem we can just continue to ignore, all modern CPU should already have ECC.
(Google published results of 3–10×10−9 error/bit·h.)
Not really going to happen. The issue is that consumers just don't care. For important data structures, the realistic course is self check-sums.

Quote:
On a third note I'd urge gamers not to purchase those CPU with embedded (immediate mode) GPU, hoping that if enough people do that Intel/AMD will improve their offer for "just" CPU.
(Plus that's really a waste to have something you won't use )
The embedded GPUs really aren't aimed at gamers. They are primarily designed for cost/packaging/board/thermal reasons. If you want to game, you will always be better off with a discrete GPU until such a time as moderate sized on package memories become viable/standard. though we are getting closer, a single wide IO DRAM will be able to provide in the range of 100-200 GB/s of bandwidth and between 512-1024 MB of capacity. Combined with a main memory in the range of 50 GB/s and the integrated GPUs will finally be able to stretch their legs. Realistically, that is all about 3-5 years out for mainstream computers at the front edge. Lots of other markets though would prefer if PCs got their sooner so they could leverage off of them.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 01-Jul-2012, 12:54   #7
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by Rodéric View Post
There are still a number of games/engines that just load levels, those would benefit from more memory as it would mean bigger levels, I expect them to be a dying bread though.
But a very fast PC hard drive is only around 90 MB/s. With a perfectly linear access pattern (requires data duplication on assets) it takes 11 seconds to load each gigabyte of data. For example to fill 8 gigabytes of memory you need 88 seconds, and to fill 12 gigabytes of memory you need 133 seconds.

On 5400 RPM hard drives (used in laptops and gaming consoles) the loading times would be double of that: 176 seconds for 8 GB and 266 seconds for 12 GB.

Anything over 20 seconds, and the game experience gets degraded. I have stopped playing some games (for example Gran Turismo 5) because of too long loading times. I personally prefer my games (the games I develop) to have zero loading times if at all possible. We had 3 second loading times in Trials Evolution (because we streamed almost everything).
Quote:
Originally Posted by aaronspink View Post
If in a stream like benchmark AMD is only getting ~50% memory bandwidth then they either have a bottleneck at some point between the CPU and the memory controller or have a sub-optimal memory controller.
Yes, but it's still a pure memory benchmark that has a perfectly optimized memory access pattern. I wouldn't expect to have even 50% of that in a real game. And if you count all the double memory accesses a real game needs during it's frame (for example shadow map rendering), you can halve that figure again as well.
Quote:
Originally Posted by aaronspink View Post
It also is an example of a single game design that may or may not have levels and load time between levels. Part of my comment on the 2 GB as a load cache specifically relates to games that do have levels and load times between levels. Using 2GB to stream in the next level data while on the current level would without a doubt improve the user experience.
You used Battlefield 3 as your example. They are using lots of similar techniques than we are. Lots of data streaming and only a small subset in memory. This whitepaper explains their streaming system pretty well (excellent reading btw): http://publications.dice.se/attachme...ttlefield3.pdf

Of course if you are having completely linear game and you have extra 2 GB to burn, you can load your next level at the same time you play the current one. But I don't see this as good use of resources. Instead of this extra 2 GB I would choose higher clocked memory to improve my frame rate, increase my view distance and to add more dynamic physics driven stuff to my levels (all these require bandwidth).
sebbbi is offline   Reply With Quote
Old 01-Jul-2012, 13:39   #8
Rangers
Regular
 
Join Date: Aug 2006
Posts: 9,477
Default

Quote:
Originally Posted by sebbbi View Post

Of course if you are having completely linear game and you have extra 2 GB to burn, you can load your next level at the same time you play the current one. But I don't see this as good use of resources. Instead of this extra 2 GB I would choose higher clocked memory to improve my frame rate, increase my view distance and to add more dynamic physics driven stuff to my levels (all these require bandwidth).

So all the people crying for 4GB in PS4 are nuts?

You also build your argument around 60 FPS, which is nice and all, but the vast vast majority of games are 30. We even saw devs like Insomniac publicly announce a switch from 60 to 30 this gen (for Ratchet).

So, using Aaronspink's analysis at 1GB 60 FPS, a typical 30 FPS game could access 2GB per frame. I guess this does put a hard limit on some things, but what if the content changes rapidly frame to frame (as it would seem to in any video game...)? Wouldn't the 8 GB system be at an advantage over the 2B that now has to go to an HDD or something magnitudes slower to get new data? In essence the 8GB system could "buffer" 3 additional frames, vs 0 for the 2GB. Or am I totally not getting it?
Rangers is offline   Reply With Quote
Old 01-Jul-2012, 13:51   #9
Squilliam
Beyond3d isn't defined yet
 
Join Date: Jan 2008
Location: New Zealand
Posts: 3,172
Default

This explains why PC games let you save and reload from anywhere whereas console games use checkpoints, I get it now.

25GB/S = ~2.5-8.3GB working set.
50GB/S = 5-16.6GB
100GB/S = 10-33.2GB

Does that make sense?

25GB / 30 * 3:1-10:1?

Effectively that means that current consoles could have used a lot more memory even at the same bandwidth. Am I right?
__________________
It all makes sense now: Gay marriage legalized on the same day as marijuana makes perfect biblical sense.
Leviticus 20:13 "A man who lays with another man should be stoned". Our interpretation has been wrong all these years!
Squilliam is offline   Reply With Quote
Old 01-Jul-2012, 13:52   #10
aaronspink
Senior Member
 
Join Date: Jun 2003
Posts: 2,596
Default

Quote:
Originally Posted by sebbbi View Post
Yes, but it's still a pure memory benchmark that has a perfectly optimized memory access pattern. I wouldn't expect to have even 50% of that in a real game. And if you count all the double memory accesses a real game needs during it's frame (for example shadow map rendering), you can halve that figure again as well.
The problem is that the bandwidth/latency curve is actually worse for something like GDDR5 @ 5 Gb/s than DDR4 at 3 Gb/s. ie, GDDR5 has roughly the same latency as DDR4 but needs to fill more bandwidth, so in general it is going to deliver lower efficiency. In both cases you handle this somewhat by going to small sub-channel widths. This increases your effective latency somewhat due to the latency caused by the transfer overhead but have higher sustained bandwidth due to high occupancy per transaction per sub-channel (assuming you have a reasonable interleave function of course, most of these in modern systems are somewhat complex hash functions that tend to work fairly well esp when given 8+ sub-channels to interleave between).

Intermediate data is better served by a high speed buffer memory rather than the main pool.

But given enough thread level parallelism, which in a graphics workload you will have in spades, you should be able to sustain on the order of 80-90% effective bandwidth with narrow channel DDR4 given an appropriately designed memory controller.

The reality of the situation is you effectively end up paying a ~4x capacity cost for around 50% higher bandwidth between GDDR and DDR. This will likely get worse over time as the types of device that need/can afford GDDR get smaller further pushing up the cost of GDDR relative to DDR.

The trend is pretty clear, in that conserving large pool bandwidth is going to be more of a factor going forward with some relief coming from relatively small high bandwidth intermediate buffer memories (eDRAM and stack/wide io dram).


Quote:
Of course if you are having completely linear game and you have extra 2 GB to burn, you can load your next level at the same time you play the current one. But I don't see this as good use of resources. Instead of this extra 2 GB I would choose higher clocked memory to improve my frame rate, increase my view distance and to add more dynamic physics driven stuff to my levels (all these require bandwidth).
The problem is that higher speed memory is going to be increasingly boutique. So the reality is would you rather have 4x the memory or <50% higher bandwidth...

Another interesting conundrum, is say you have 2-3 GB per frame of bandwidth but only 2 GB of memory. In that case, you simply cannot use the bandwidth as you will never be able to stream in assets at a high enough rate.
__________________
Aaron Spink
speaking for myself inc.
aaronspink is offline   Reply With Quote
Old 01-Jul-2012, 13:53   #11
upnorthsox
Senior Member
 
Join Date: May 2008
Posts: 1,429
Default

Quote:
Originally Posted by aaronspink View Post
If in a stream like benchmark AMD is only getting ~50% memory bandwidth then they either have a bottleneck at some point between the CPU and the memory controller or have a sub-optimal memory controller.

In the case in question, we are talking about a max memory bandwidth in the range of 75-100 GB/s and a realistic bandwidth on the order of 70-80% of peak for a range of 56-75 GB/s. With gives a per frame bandwidth of roughly 1GB or a bit higher at 60 FPS.

This for the ranges you gave works out to an optimal memory size around 3-10 GB.

But we haven't even factored in the issue of HDD bandwidth and access times vs optical bandwidth and access times. BR has an order of magnitude higher access latency than HDD and at its peak roughly half the bandwidth of a modern HDD. This further pushes up the pre-buffering/streaming requirement for an actual game engine.

Then if the design has a high speed temporary buffer of reasonable size (32MB+), this also reduces the amount of non-static texture data that must be stored and read further increasing the relative size of the texture bandwidth and therefore the streaming texture cache space required.

So while the original post was informative, I don't believe that it captures the real scope nor details of this as it relates to what might be seen in next gen hardware. It also is an example of a single game design that may or may not have levels and load time between levels. Part of my comment on the 2 GB as a load cache specifically relates to games that do have levels and load times between levels. Using 2GB to stream in the next level data while on the current level would without a doubt improve the user experience.

Can you please provide a link to this 100GB DDR3 memory? Thanks.
upnorthsox is offline   Reply With Quote
Old 01-Jul-2012, 13:57   #12
upnorthsox
Senior Member
 
Join Date: May 2008
Posts: 1,429
Default

Quote:
Originally Posted by Rangers View Post
So all the people crying for 4GB in PS4 are nuts?

You also build your argument around 60 FPS, which is nice and all, but the vast vast majority of games are 30. We even saw devs like Insomniac publicly announce a switch from 60 to 30 this gen (for Ratchet).

So, using Aaronspink's analysis at 1GB 60 FPS, a typical 30 FPS game could access 2GB per frame. I guess this does put a hard limit on some things, but what if the content changes rapidly frame to frame (as it would seem to in any video game...)? Wouldn't the 8 GB system be at an advantage over the 2B that now has to go to an HDD or something magnitudes slower to get new data? In essence the 8GB system could "buffer" 3 additional frames, vs 0 for the 2GB. Or am I totally not getting it?
No, he's saying that people thinking 8GB at 1/4 the bandwidth is somehow better are crazy.
upnorthsox is offline   Reply With Quote
Old 01-Jul-2012, 17:02   #13
Shifty Geezer
uber-Troll!
 
Join Date: Dec 2004
Location: Under my bridge
Posts: 30,837
Default

The majority discussion about the choice of RAM for next gen machines wasn't addressing the discussion of BW versus capacity, so has been moved to its own discussion here.
__________________
Shifty Geezer
...
Flashing Samsung mobile firmwares. Know anything about this? Then please advise me at -
http://forum.beyond3d.com/showthread.php?p=1862910
Shifty Geezer is offline   Reply With Quote
Old 01-Jul-2012, 17:09   #14
Shifty Geezer
uber-Troll!
 
Join Date: Dec 2004
Location: Under my bridge
Posts: 30,837
Default

Quote:
Originally Posted by sebbbi View Post
My conclusion: Usable memory amount is very much tied to available memory bandwidth. More bandwidth allows the games to access more memory. So it's kind of counterintuitive to swap faster smaller memory to a slower larger one. More available memory means that I want to access more memory, but in reality the slower bandwidth allows me to access less. So the percentage of accessible memory drops radically.
Looking at the counter argument aaronspink has made, do you think you could draw up a minimum amount of RAM needed for a given bandwidth to be effective? At some point there just won't be enough RAM to fit the content. At the other end, which you argue probably isn't as high as people imagine, the RAM will be an excess and a waste of money. Where does the sweet spot lie?

Quote:
Originally Posted by aaronspink View Post
Another interesting conundrum, is say you have 2-3 GB per frame of bandwidth but only 2 GB of memory. In that case, you simply cannot use the bandwidth as you will never be able to stream in assets at a high enough rate.
You can always burn up more RAM or BW with your game. If you have an excess of BW, just use loads of transparency and particles, like PS2. An imbalance too far in either direction will limit the options devs have and force them into specific approaches, or add excessive cost to the machines. So as I say to sebbbi above, where does the sweetspot lie given a specified bandwidth?

Most Economical Capacity = BW (GB/s) / 25 or something similar?
__________________
Shifty Geezer
...
Flashing Samsung mobile firmwares. Know anything about this? Then please advise me at -
http://forum.beyond3d.com/showthread.php?p=1862910
Shifty Geezer is offline   Reply With Quote
Old 01-Jul-2012, 17:27   #15
Kb-Smoker
Member
 
Join Date: Aug 2005
Posts: 614
Default

seem like an ssd would solve some of the problems with having 2GB but given the cost its unlikely.

So as dice/epic have stated 2GB would not be enough wouldn't that be really telling of this debate where we have 2GB vs 8GB. For them to make public comment that means either Sony would not listen to direct feedback or what?? I remember epic doing the same last Gen with x360 but dont remember them doing it in public.

I feel the debate would change almost 100% if we are comparing 4 GB gddr5 vs 8GB DDR3/4.
Kb-Smoker is offline   Reply With Quote
Old 01-Jul-2012, 17:42   #16
upnorthsox
Senior Member
 
Join Date: May 2008
Posts: 1,429
Default

Quote:
Originally Posted by Shifty Geezer View Post
Looking at the counter argument aaronspink has made, do you think you could draw up a minimum amount of RAM needed for a given bandwidth to be effective? At some point there just won't be enough RAM to fit the content. At the other end, which you argue probably isn't as high as people imagine, the RAM will be an excess and a waste of money. Where does the sweet spot lie?
You can always burn up more RAM or BW with your game. If you have an excess of BW, just use loads of transparency and particles, like PS2. An imbalance too far in either direction will limit the options devs have and force them into specific approaches, or add excessive cost to the machines. So as I say to sebbbi above, where does the sweetspot lie given a specified bandwidth?

Most Economical Capacity = BW (GB/s) / 25 or something similar?
This is the first question to answer. IMO, 2GB at any bandwidth may be enough to start the next gen, but like 512MB this gen, it'll soon become a limiter. 4GB is the likely sweet spot. I don't see a general use case for 8GB vs 4GB, especially if it comes at 1/2 bandwidth cost.
upnorthsox is offline   Reply With Quote
Old 01-Jul-2012, 17:47   #17
Squilliam
Beyond3d isn't defined yet
 
Join Date: Jan 2008
Location: New Zealand
Posts: 3,172
Default

One thing perhaps to keep in mind is that the OS may use a lot of memory as the needs of the device expands between generations. It may have a very high residence to use ratio. I.E. It may use a lot of memory, however it likely won't use a lot of bandwidth.
__________________
It all makes sense now: Gay marriage legalized on the same day as marijuana makes perfect biblical sense.
Leviticus 20:13 "A man who lays with another man should be stoned". Our interpretation has been wrong all these years!
Squilliam is offline   Reply With Quote
Old 01-Jul-2012, 18:15   #18
upnorthsox
Senior Member
 
Join Date: May 2008
Posts: 1,429
Default

Quote:
Originally Posted by Squilliam View Post
One thing perhaps to keep in mind is that the OS may use a lot of memory as the needs of the device expands between generations. It may have a very high residence to use ratio. I.E. It may use a lot of memory, however it likely won't use a lot of bandwidth.

You bring up a good point that the memory amount covers both cpu and gpu.

That said, I currently have Win7 Pro, 8 browser windows, 2 Adobe pdf's, 1 vpn connection, 2 rdp sessions, and a full virus scan running (which hammers the piss out of my cpu) and I'm still only utilizing 1.3GB of memory. At 2GB that wouldn't much for a game running too but at 4GB I'd still have 2.7GB available (but no cpu ).
upnorthsox is offline   Reply With Quote
Old 01-Jul-2012, 18:32   #19
ERP
Moderator
 
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,669
Default

While I agree in principle with the OP, I wanted to point some things out.
The available memory bandwidth per frame is only interesting as it applies to total memory if you can predict with some degree of certainty which 200MB or so you're going to touch, and you can actually get it off disk before you will need it.

So while above some threshold more memory doesn't help you with higher res textures that doesn't make it useless.

The big issue with a lot of memory as has been pointed out is filling it. I think it's interesting for parametric data that expands many times it's read size and can't be computed on use. What most would refer to procedural content, at some level you can consider parametric content to be data compression with extreme compression ratios.
Really the memory is still just a cache, but it's a cache for computation rather than disk reads. It's one area I'd be seriously looking at going forwards.

Voxelization of data is interesting, and slight changes in orientation, can radically change the way structures like this are walked, requiring vastly more memory than the actual amount walked. You also don't want to have to keep writing the data wasting you're bandwidth when you're repeatedly reading it. Again a computation cache.

As an aside The Sony paper is interesting but doesn't age well, you can still kill yourself with virtual function calls, but it's nothing like it was circa PS2 or worse PS1 with direct mapped single digit kilobyte caches.
ERP is offline   Reply With Quote
Old 01-Jul-2012, 18:54   #20
Shifty Geezer
uber-Troll!
 
Join Date: Dec 2004
Location: Under my bridge
Posts: 30,837
Default

Quote:
Originally Posted by ERP View Post
The big issue with a lot of memory as has been pointed out is filling it. I think it's interesting for parametric data that expands many times it's read size and can't be computed on use. What most would refer to procedural content, at some level you can consider parametric content to be data compression with extreme compression ratios.
Really the memory is still just a cache, but it's a cache for computation rather than disk reads. It's one area I'd be seriously looking at going forwards.
This is one of the most obvious advantages, the old trade between computation power and storage. Rather than recomputing varied assets on the fly, they could be saved on a single creation.
__________________
Shifty Geezer
...
Flashing Samsung mobile firmwares. Know anything about this? Then please advise me at -
http://forum.beyond3d.com/showthread.php?p=1862910
Shifty Geezer is offline   Reply With Quote
Old 01-Jul-2012, 19:39   #21
tunafish
Member
 
Join Date: Aug 2011
Posts: 408
Default

Quote:
Originally Posted by Rodéric View Post
On a second note, I'd like to emphasis the need for ECC as memory amount and bandwidth increases.
No, it doesn't. Everyone keeps ignoring that in the google report, they found bit errors to scale linearly with the physical volume of the ram, and not the amount of bits. This agrees with the theory that nearly all bit errors are caused by radiation, either from the materials the chips are made from, or from outside sources. As you scale the ram to smaller processes, the bit error rates go down. 8 chips of ram always have roughly the same amount of errors, regardless if they are 2GB of DDR4 or 512MB of GDDR3.
tunafish is offline   Reply With Quote
Old 01-Jul-2012, 22:40   #22
Rodéric
a.k.a. Ingenu
 
Join Date: Feb 2002
Location: Carnon Plage, France.
Posts: 2,943
Default

I meant it was related to the amount of data going through the chips, and having more DIMM/chips.
(We usually have more DIMM when we get more memory, but you are correct.)
I'm not sure whether the google report talks about what I meant though...

Also we may be going off-topic.
__________________
So many things to do, and yet so little time to spend...
Rodéric is offline   Reply With Quote
Old 02-Jul-2012, 00:11   #23
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by aaronspink View Post
In the case in question, we are talking about a max memory bandwidth in the range of 75-100 GB/s and a realistic bandwidth on the order of 70-80% of peak for a range of 56-75 GB/s. With gives a per frame bandwidth of roughly 1GB or a bit higher at 60 FPS.
I thought we were talking about known current systems and current unified memory architectures (Trinity, Sandy/Ivy Bridge, maybe even Xbox 360), not some rumored future ones. We only have actual facts and benchmark data of existing systems (everything else is pure speculation, and as I said earlier, I am not interested in participating in next gen speculation).

Please don't tell me you think a 70-100 GB/s unified memory architecture is considered "slow" by today's standards. Not even Intel's highest end 12 thread Sandy Bridge E and the fully enabled 16 thread Xeon server CPU versions are equipped with a memory system that fast. Quad channel DDR3-1600 is the fastest officially supported, and it provides a 51 GB/s theoretical bandwidth (37 GB/s in benchmarks, not far from AMDs utilization percentages: http://www.anandtech.com/show/5091/i...gh-end-alive/4). These chips cost 1000$+ and the motherboards supporting quad channel memory aren't cheap either.

Lets look at highest end desktop APUs available with unified memory. Dual channel DDR3-1600 is the maximum officially supported memory for Intel's flagship desktop APU (Ivy Bridge). Dual channel DDR3-1833 is the maximum officially supported memory for AMD's flagship desktop APU (Trinity). Memory bandwidths are 25.6 GB/s and 29.2 GB/s respectively. These figures match perfectly with my calculations for the "slow" memory system (common DDR3 memory at highest commonly available clocks).

Of course you can find memory kits designed for CPU overclockers. I actually bought these kind of premium memory sticks to my old Q6600 based desktop. The problem with these kind of enthusiast kits is that they are produced in very low quantities (cherry picked parts), and thus the price is very high. For example cheapest DDR3-2400 kit (2 x 4 GB) I found from newegg.com was G.SKILL Ripjaws Z series at 96.99$. In comparison you will find standard DDR-1600 kits (2 x 4 GB) for 40.99$. As DDR-1600 is the highest officially supported on Intel platforms, it is commonly used in brand new high end gaming desktops, and thus is the most relevant high volume product that we can still somehow qualify as "slow and cheap".
Quote:
Originally Posted by aaronspink View Post
Then if the design has a high speed temporary buffer of reasonable size (32MB+), this also reduces the amount of non-static texture data that must be stored and read further increasing the relative size of the texture bandwidth and therefore the streaming texture cache space required.
Relatively large manual high speed "caches" such as the Xbox 360 EDRAM are very good for reducing redundant bandwidth usage (especially for GPU rendering). EDRAM removes all the memory bandwidth waste you get from blending, overdraw, MSAA and z-buffering. Basically you get all these for free. The bandwidth free overdraw of course also helps with shadowmaps as well, but since Xbox 360 cannot sample from EDRAM, you have to eventually copy the shadowmap to main memory (consumes memory bandwidth) and sample it from there (consumes memory bandwidth just like any static texture). Same is true for g-buffer rendering and sampling (must be copied eventually to main memory and sampled from there consuming memory bandwidth).

However no matter how excellent EDRAM is, it cannot increase the maximum total accessible unique memory per frame. It can "only" (drastically) reduce the waste for double (or even higher) access counts to same memory regions, and thus get us more near to the theoretical maximum (= 200 MB unique memory per frame, assuming we still use the current highest end desktop APU unified memory systems as our "system of choice"). I have already stated in many threads how much I like the EDRAM in Xbox 360, so I don't do that again
Quote:
Originally Posted by ERP View Post
The available memory bandwidth per frame is only interesting as it applies to total memory if you can predict with some degree of certainty which 200MB or so you're going to touch, and you can actually get it off disk before you will need it.
Of course. Without exact knowledge of your access patterns and excellent prediction algorithms and good fall back plans (stalling doesn't count ) you need to keep considerable extra overhead data in your memory (just in case).
Quote:
Originally Posted by ERP View Post
So while above some threshold more memory doesn't help you with higher res textures that doesn't make it useless.
Extra memory is of course always a good thing to have. It allows you to keep some (hard to predict) data components permanently in memory. And it saves you development time as well. That's not an insignificant gain. More is always better, unless it means we have to compromise somewhere else. Aaronspink stated he would prefer to have 2 GB of extra memory instead of a 3-4x faster GPU, and that's something I cannot agree with (especially if that GPU is 3-4x slower because of bandwith limitations that in other hand limit the usability of the extra 2 GB I would get in the trade).
Quote:
Originally Posted by ERP View Post
What most would refer to procedural content, at some level you can consider parametric content to be data compression with extreme compression ratios.
Really the memory is still just a cache, but it's a cache for computation rather than disk reads. It's one area I'd be seriously looking at going forwards.
Parametric content (artist controlled) will be very important in the future. However I also see it as a way to reduce memory accesses. Why would you store the calculations to memory, if you can recalculate it every time to the L1 cache instead and waste no bandwidth at all? ALU is basically free (compared to memory accesses), and it will become even more free in the future (while memory accesses will remain expensive).

However if the parametric generation consumes more bandwidth than the access of the generated data, then I am a huge supporter for caching it. For example in our virtual texturing system, the terrain texture is generated (blended with a complex formula) from a huge amount of artist placed decals. In the worst case areas there are almost 10 layers of decals on top of each other, but we burn that data once to the virtual texture cache, and during the terrain rendering a single texture lookup is enough (generated data gets repeatedly reused 60 times per second just like loaded data from HDD).
Quote:
Originally Posted by ERP View Post
As an aside The Sony paper is interesting but doesn't age well, you can still kill yourself with virtual function calls
That's not the main point of the paper. Yes it's nice that you can evade some branches and virtual calls, but the main point (and main performance gain) is the improved memory access pattern. Component model is a good approach, and many developers are using it in their newest engines.

Last edited by sebbbi; 02-Jul-2012 at 00:23.
sebbbi is offline   Reply With Quote
Old 02-Jul-2012, 01:22   #24
ERP
Moderator
 
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,669
Default

Quote:
Originally Posted by sebbbi View Post
That's not the main point of the paper. Yes it's nice that you can evade some branches and virtual calls, but the main point (and main performance gain) is the improved memory access pattern. Component model is a good approach, and many developers are using it in their newest engines.
That's certainly a valid point, but IME outside of a few small blocks of code that transform streams of data, modern CPU's rarely suffer L2 cache misses, and almost never miss the ICache.
Most of the win in the streaming case is not poluting the cache with data you will never read.

As an aside one of the things that irritates me about new college grads is the lack of understanding of basic memory architecture, and behavior. None of this stuff is rocket science.
ERP is offline   Reply With Quote
Old 02-Jul-2012, 01:35   #25
Brimstone
B3D Shockwave Rider
 
Join Date: Feb 2002
Posts: 1,835
Default

Wasn't the Jon Olick demo of the ID Tech6 stuff over 1 gig with just a single model on screen?
__________________
When God plays an online shooter he plays Shadowrun. He buys resurrection first round and selects Dwarf.

www.shadowrunshow.com
Brimstone is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:44.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.