Will GPUs with 4GB VRAM age poorly?

Just my 2 cents on it real quick, I think 4GB is going to be fine for a while still if you're gaming at 1080. I think when you go larger res than that you should really be looking at more for any kind of "future proofing". (I don't believe in the term, but I get its incorrect usage.)
Screen resolution is no more the main factor for VRAM usage since... shader advent?
 
I hate to break it to you guys, but from now on GPUs with 4GB will age very poorly, because I just ordered a new laptop with 16GB VRAM. :runaway:

Seriously though, the last time I ordered a new laptop, end of 2012, I got a laptop with 4GB VRAM (K5000M), which was 4 times as much as its predecessor had and the max you could get in a laptop. At that time people thought I was nuts.

Now, we are already moving beyond 4GB in some games. 6-8GB VRAM usage will be normal in about 2 years for "high" or "ultra" configurations". PS5 or XBox Next will at least double ram, which will give VRAM usage another boost. 4K will become the new Full-HD.

I think in 4-5 years 16GB will be standard for middle-class GPUs. Everybody will wonder how 4GB could ever have been enough, just like today with 1GB.
 
I think in 4-5 years 16GB will be standard for middle-class GPUs. Everybody will wonder how 4GB could ever have been enough, just like today with 1GB.
Let's say you are playing at 144 fps (high frame rate monitor + high end GPU). You want to access all 16 GB of data every frame (otherwise part of the data will just sit there doing nothing = WASTE). 144 frames/s * 16 GB/frame = 2304 GB/s. That's a lot of bandwidth. Usually over 50% of the bandwidth is used by repeatedly accessing the render targets and other temporary data. BF1 presentation describes that their 4K mem layout (RTs and other temps) is around 500 MB. So if we assume 50% of bandwidth is used to access this small 0.5 GB region, the remaining 15.5GB has only half of the bandwidth left. So in order to access it all on every frame, you need 4.5 TB/s memory bandwidth.

This explains why I am not a big believer of huge VRAM sizes, until we get higher bandwidth memory systems. I am eagerly waiting for Vega's memory paging system. AMDs Vega slides point out that around half of the allocated GPU memory is not actively used (wasted) in current titles. The ratio of wasted allocated memory will only increase when 8 GB and 12 GB cards become common. I am surprised if the active data set in future games using 12 GB of VRAM is more than 4 GB (accessing that 4 GB at 144 fps would need over 1 TB bandwidth with same assumptions as above). Nvidia has similar paging technology in their Pascal cards. P100 with NVlink shows very nice paging results with massive CUDA tasks that access memory amount far beyond GPUs capacity. Hopefully this is the future. Waste of memory needs to end soon :)
 
How well would pci-e work for random small access? It might be that in pc word swapping from gpu is not such a great idea unless one does large block transfers that are not so time critical?
 
How well would pci-e work for random small access? It might be that in pc word swapping from gpu is not such a great idea unless one does large block transfers that are not so time critical?
You wouldn't perform small accesses. Rather you'd migrate pages of memory which will be a minimum of 4kB and generally larger like 64kB. PCI-E can handle these sizes fine.
 
Let's say you are playing at 144 fps (high frame rate monitor + high end GPU). You want to access all 16 GB of data every frame (otherwise part of the data will just sit there doing nothing = WASTE). 144 frames/s * 16 GB/frame = 2304 GB/s. That's a lot of bandwidth. Usually over 50% of the bandwidth is used by repeatedly accessing the render targets and other temporary data. BF1 presentation describes that their 4K mem layout (RTs and other temps) is around 500 MB. So if we assume 50% of bandwidth is used to access this small 0.5 GB region, the remaining 15.5GB has only half of the bandwidth left. So in order to access it all on every frame, you need 4.5 TB/s memory bandwidth.

This explains why I am not a big believer of huge VRAM sizes, until we get higher bandwidth memory systems. I am eagerly waiting for Vega's memory paging system. AMDs Vega slides point out that around half of the allocated GPU memory is not actively used (wasted) in current titles. The ratio of wasted allocated memory will only increase when 8 GB and 12 GB cards become common. I am surprised if the active data set in future games using 12 GB of VRAM is more than 4 GB (accessing that 4 GB at 144 fps would need over 1 TB bandwidth with same assumptions as above). Nvidia has similar paging technology in their Pascal cards. P100 with NVlink shows very nice paging results with massive CUDA tasks that access memory amount far beyond GPUs capacity. Hopefully this is the future. Waste of memory needs to end soon :)
I'd like to challenge your assumption here. Not every byte of VRAM memory needs to be accessed every frame in order to be useful. Having larger video will enable more detailed worlds (no more close-up blurriness) and less compromise. It will enable faster loading times (or no loading times at all once everything is in memory) and instant teleporting in big open worlds. I know your're an engine programmer, you live and breath efficiency, but historically there has never been something as too much memory (640kb should be enough *cough* *cough*). Having to worry less about memory pressure will give engine programmers such as yourself more time to do other things. At least until those 16 GB become too small once again.
 
I'd like to challenge your assumption here. Not every byte of VRAM memory needs to be accessed every frame in order to be useful. Having larger video will enable more detailed worlds (no more close-up blurriness) and less compromise. It will enable faster loading times (or no loading times at all once everything is in memory) and instant teleporting in big open worlds.
Loading data to fill 16GB of memory takes 4x longer time than loading data to fill 4GB of memory. Good streaming technologies are essential in reducing the loading time. The gap between storage and RAM speed is getting wider every day. If you load everything during the loading screen, you will need to load for a considerably longer time.

A good streaming technology will not keep up-close details of anything in memory, except those surfaces close to the player character. As the mip map distance is logarithmic, the area around the player that would access highest texture mip level is very small. The streaming system will of course load data from longer radius to ensure that the data is present when needed, but there's no reason to keep all the highest mip level data loaded permanently in memory. If you would do this in a AAA game, then even a 16 GB GPU wouldn't be enough in current games (to provide results identical to a 4 GB GPU with a good streaming system).

I agree that instant teleporting is a problem for all data streaming systems. However the flip side would be to load everything to memory and that drastically increases level loading times. But contrary to common belief, a very fine grained system (such as virtual texturing) actually handles instant teleporting better than coarse grained streaming systems. This is because virtual texturing only needs to load data to render a single frame. You can load 1080p worth of texel pages in <200ms. This feels still instant. With a more coarse grained system (load a whole area), you would need to wait for a lot longer. Loading everything at startup is obviously impossible for open worlds. 50 GB BR disc doesn't fit to the memory (and there might be downloadable DLCs areas in the game world as well). You need at least some form of streaming. My experience is that fine grained is better that coarse grained. But only a handful of developers have implemented fine grained streaming systems, as the engineering and maintenance is a huge effort.

I have been developing several console games (last gen and current) that allowed players to create levels containing all game assets (even 2 GB of assets in single level on a console with 512 MB of memory). We didn't limit asset usage at all. There was a single big game world that contained all the levels. With a fine grained streaming system (including virtual texturing for all texture data) we managed to hit 3-5 second loading time for levels. This is what is possible with a good fine grained streaming system.
I know your're an engine programmer, you live and breath efficiency, but historically there has never been something as too much memory (640kb should be enough *cough* *cough*). Having to worry less about memory pressure will give engine programmers such as yourself more time to do other things. At least until those 16 GB become too small once again.
Wasting memory is easy, but it comes with a cost. HBM1 for example didn't become widely used because it was capped to 4 GB. All current PC games would be fine with 4 GB if memory was used sparingly. But as developers are wasting memory, products with larger amount (8 GB) of slower memory are winning the race. The problem here is that the faster memory would give improved visuals, as faster memory = can use better looking algorithms. But we instead need to settle on larger amount of slower memory, since memory management is not done in a good way. Larger memory size always means that the memory needs to be further away from the processing unit. This means that it is slower. Larger != no compromise.

Custom memory paging (such as software virtual texturing) and custom fine grained streaming systems are complex and require lots of developer resources and maintenance. This is a bit similar to automated caches vs scratchpad memories (Cell SPUs and GPU groupshared memory vs automated L1/L2/L3 CPU caches). Automated system is a bit less efficient in worst case (and uses more energy), but requires much smaller amount of developer work. Hopefully Vega's automated memory paging system delivers similar gains to game memory management. Developer could load huge amount of assets and textures to system RAM without thinking about GPU memory at all, but only the currently active set of memory pages are resident on the GPU (fully automated). In the best case this is like fully automated tiled resources for everything. No developer intervention needed. CUDA (Pascal P100) also offers a paging hint API for the developer. This way you can tell the system in advance if you know that some data is needed. This is a bit similar to CPU cache prefetch hints. Way better than fully manual system, but you also have just right amount of control when you need it. This is the future.
 
Last edited:
...
This is because virtual texturing only needs to load data to render a single frame. You can load 1080p worth of texel pages in <200ms. This feels still instant. With a more coarse grained system (load a whole area), you would need to wait for a lot longer. Loading everything at startup is obviously impossible for open worlds. 50 GB BR disc doesn't fit to the memory (and there might be downloadable DLCs areas in the game world as well). You need at least some form of streaming. My experience is that fine grained is better that coarse grained. But only a handful of developers have implemented fine grained streaming systems, as the engineering and maintenance is a huge effort.

I have been developing several console games (last gen and current) that allowed players to create levels containing all game assets (even 2 GB of assets in single level on a console with 512 MB of memory). We didn't limit asset usage at all. There was a single big game world that contained all the levels. With a fine grained streaming system (including virtual texturing for all texture data) we managed to hit 3-5 second loading time for levels. This is what is possible with a good fine grained streaming system.

Wasting memory is easy, but it comes with a cost. HBM1 for example didn't become widely used because it was capped to 4 GB. All current PC games would be fine with 4 GB if memory was used sparingly. But as developers are wasting memory, products with larger amount (8 GB) of slower memory are winning the race. The problem here is that the faster memory would give improved visuals, as faster memory = can use better looking algorithms. But we instead need to settle on larger amount of slower memory, since memory management is not done in a good way. Larger memory size always means that the memory needs to be further away from the processing unit. This means that it is slower. Larger != no compromise.

Custom memory paging (such as software virtual texturing) and custom fine grained streaming systems are complex and require lots of developer resources and maintenance. This is a bit similar to automated caches vs scratchpad memories (Cell SPUs and GPU groupshared memory vs automated L1/L2/L3 CPU caches). Automated system is a bit less efficient in worst case (and uses more energy), but requires much smaller amount of developer work. Hopefully Vega's automated memory paging system delivers similar gains to game memory management. Developer could load huge amount of assets and textures to system RAM without thinking about GPU memory at all, but only the currently active set of memory pages are resident on the GPU (fully automated). In the best case this is like fully automated tiled resources for everything. No developer intervention needed. CUDA (Pascal P100) also offers a paging hint API for the developer. This way you can tell the system in advance if you know that some data is needed. This is a bit similar to CPU cache prefetch hints. Way better than fully manual system, but you also have just right amount of control when you need it. This is the future.

But arent there more "off the shelf" engines that support fine-grained/virtual texturing streaming system these days?
Two that come to mind is ID Tech latest engine and also UE4 with some middleware such as Granite.

Like you I do think the future is around unified memory/paging for gaming or a form of it, but my concern is that this is still not anytime soon even with Vega.
One reason is that AMD carefully used Deux Ex Mankind Divided only with 2GB VRAM in their demo for their HBCC solution, in reality their solution should had been compared to both 4GB and 8GB but we know why they did not, so to me this is the future but further out than Vega.
And by then Intel would have a more general Optane 'Cache' consumer solution, which IMO would be preferable as it does not rely upon GPU drivers but hooks very well into the OS; it would be interesting to see how this shapes up as a potential solution using a smaller 100GB Optane cache SSD (substantially lower latencies than standard SSD).
MIcron at some point will also release a product in the future.

Cheers
 
I'm very curious how the pathological use case would be handled. I wonder if amd has implemented some sort of mechanism to gather misses to packed bundles which would be serviced by cpu and then unpacked on gpu side. If you needed 10 bytes here, 1kB there etc. this could optimize pci-e traffic nicely but of course would add to complexity/silicon size/latency.

Streaming in game engine is much easier than implementing random swapping that doesn't hit cases with user observable hitches.Especially so if you don't have bus/io fabric optimized for two way fairly small random memory access patterns :) Texture probably is the "easy" use case as that can always be reloaded from disk. Swapping some game generated data structures would need 2 way traffic where data is stored to main ram as part of swapping.
 
But arent there more "off the shelf" engines that support fine-grained/virtual texturing streaming system these days?
Two that come to mind is ID Tech latest engine and also UE4 with some middleware such as Granite.
Yes there are. But most games are not using virtual texturing or similar very fine grained streaming solutions. Most engines have obviously more coarse grained streaming solutions available. Otherwise you wouldn't use them for most AAA games at all.
Like you I do think the future is around unified memory/paging for gaming or a form of it, but my concern is that this is still not anytime soon even with Vega.
One reason is that AMD carefully used Deux Ex Mankind Divided only with 2GB VRAM in their demo for their HBCC solution, in reality their solution should had been compared to both 4GB and 8GB but we know why they did not, so to me this is the future but further out than Vega.
And by then Intel would have a more general Optane 'Cache' consumer solution, which IMO would be preferable as it does not rely upon GPU drivers but hooks very well into the OS; it would be interesting to see how this shapes up as a potential solution using a smaller 100GB Optane cache SSD (substantially lower latencies than standard SSD).
MIcron at some point will also release a product in the future.
Intel's Optane is the other way around. The speed difference of the system RAM and the disc is growing all the time. We need eventually another level of cache between them. SSDs improved the situation, but SSDs have scaled very slowly down in price and up in capacity. We need a fast, large and robust caching solution between the storage and the memory.

However the system RAM size is also scaling up rapidly and prices are scaling down rapidly. It is more economical to keep textures in system RAM and and page in only the active set to GPU memory by demand.
I'm very curious how the pathological use case would be handled. I wonder if amd has implemented some sort of mechanism to gather misses to packed bundles which would be serviced by cpu and then unpacked on gpu side. If you needed 10 bytes here, 1kB there etc. this could optimize pci-e traffic nicely but of course would add to complexity/silicon size/latency.

Streaming in game engine is much easier than implementing random swapping that doesn't hit cases with user observable hitches.Especially so if you don't have bus/io fabric optimized for two way fairly small random memory access patterns :) Texture probably is the "easy" use case as that can always be reloaded from disk. Swapping some game generated data structures would need 2 way traffic where data is stored to main ram as part of swapping.
I am talking about paging from system RAM to GPU video memory. This is significantly simpler, lower latency and higher bandwidth operation compared to streaming missing pages from HDD (like virtual texturing does). We are talking about fractional millisecond latencies instead of 10+ millisecond latencies. We have seen low end GPUs that extend their memory by system RAM and directly sample textures from there. See here: http://www.trustedreviews.com/opinions/ati-hypermemory-vs-nvidia-turbocache-updated.

Animation by definition is smooth to fool the eye of movement (instead of separate images). This means that huge majority of the data that is accessed during two consecutive frames is identical. My experience is that roughly 1%-2% of the active data set changes per frame during normal game play (and I am talking about a tight virtual texturing implementation with a 256 MB texture pool). We are talking about a ~5 megabytes of PCI-E transfer per frame. Obviously camera teleport needs to stream more, but dropping a frame or two during camera teleport isn't actually visible. My experience is that you need at least 250 ms pause during camera teleport to "notice" it. But this obviously is game specific. Tracer in Overwatch for example would be unplayable if teleports weren't instantaneous. However Tracer jumps very short distances and faces the same direction, so streaming should be able to keep up pretty well. Time will tell how this works. If the GPU can also directly sample textures over PCI-Express, it could do so and make the whole page resident in background after that.
 
Last edited:
That brings me a question. A next-gen console with 16GB of ram would be sufficient, then?
This discussion has no relevance to unified memory systems. I am solely talking about paging from system RAM to GPU memory. You still need large system RAM. But system RAM is cheap.
 
Loading data to fill 16GB of memory takes 4x longer time than loading data to fill 4GB of memory. Good streaming technologies are essential in reducing the loading time. The gap between storage and RAM speed is getting wider every day. If you load everything during the loading screen, you will need to load for a considerably longer time.

When you talk about "streaming" in the context of being an alternative to "loading" (i.e. "loading screens"), would Star Citizen's "Mega Map" initiative be a good example of taking "streaming" about as far as you could (see 13:51 below)?

Since Star Citizen is attempting to have very large "maps" (i.e. solar system size), my understanding is that they are putting together a novel system to load one empty "map" with a very brief loading screen and then stream in things around the player (e.g. the ground, buildings, other space ships, etc) as the player moves around the solar system.


This discussion has no relevance to unified memory systems. I am solely talking about paging from system RAM to GPU memory. You still need large system RAM. But system RAM is cheap.

How much cheaper is system RAM (e.g. DDR4?) compared to video RAM (e.g. GDDR5, HBM?)?

I don't see explicit mentions of "GDDR5" or "HBM" on somewhere like DRAMeXchange. Otherwise, it's relatively straightforward to compare the flash in SSDs or the DRAM chips in RAM DIMMs. Teasing out GDDR5 pricing is tougher.

http://www.dramexchange.com/

I've always wondered why GDDR5 isn't used as system memory in more integrated systems (laptops, etc) where system memory is already permanently soldered to the motherboard. It seems especially useful in APU situations where integrated GPUs are often starved of bandwidth.
 
How much cheaper is system RAM (e.g. DDR4?) compared to video RAM (e.g. GDDR5, HBM?)?
It's not only about the price. It isn't possible to scale fast memory size that large. And I am not talking about GDDR5. Fast = HBM2 or MCDRAM or something else.

For example Intel's newest Xeon Phi processor supports up to:
- 384 GB of DDR4 memory (102 GB/s)
- 16 GB of MCDRAM (400+ GB/s)

Xeon Phi MCDRAM can be configured as cache to the DDR4 main memory. Similarly in future desktop: GPUs 8 GB of HBM2 could be configured as a cache to 64 GB of DDR4 main memory. Could work either at cache line granularity or at page granularity.
 
I've always wondered why GDDR5 isn't used as system memory
The high latency of GDDR5 memory prevents that, it can work perfectly well in a GPU environment, As GPUs hide latency well with their parallel nature. However, in a general system environment, they are not ideal. CPUs need as little latency as possible, hence why DDR4 or DDR3 is preferable.
 
The high latency of GDDR5 memory prevents that, it can work perfectly well in a GPU environment, As GPUs hide latency well with their parallel nature. However, in a general system environment, they are not ideal. CPUs need as little latency as possible, hence why DDR4 or DDR3 is preferable.
Is that why consoles are so weak? Would GDDR5 bottleneck a Zen processor?
 
The high latency of GDDR5 memory prevents that, it can work perfectly well in a GPU environment, As GPUs hide latency well with their parallel nature. However, in a general system environment, they are not ideal. CPUs need as little latency as possible, hence why DDR4 or DDR3 is preferable.
I thought this is a long dismissed theory with the reality that GDDR5 has similar timings to DDR3. The high latency observed on GPUs is a result of its operating clock speed (absolute) and throughput-oriented memory pipeline (relative).
 
One reason is that AMD carefully used Deux Ex Mankind Divided only with 2GB VRAM in their demo for their HBCC solution, in reality their solution should had been compared to both 4GB and 8GB but we know why they did not
Why didn't they?
 
Wasting memory is easy, but it comes with a cost. HBM1 for example didn't become widely used because it was capped to 4 GB. All current PC games would be fine with 4 GB if memory was used sparingly. But as developers are wasting memory, products with larger amount (8 GB) of slower memory are winning the race.

Is there actually an engine in production use that does do well with increasing resolution on a 4 GiB budget? And by that I mean also beyond UHD resolution. Of course, given your other assets are slim enough, you will get very far with a 4 GiB framebuffer, but even engines pretty good at streaming like in Doom do offer some options to really kill 4 GiByte cards. Whether that's waste or not... well, you rarely get a linear increase in image quality for your GFLOPS or GBytes.
 
Is there actually an engine in production use that does do well with increasing resolution on a 4 GiB budget? And by that I mean also beyond UHD resolution. Of course, given your other assets are slim enough, you will get very far with a 4 GiB framebuffer, but even engines pretty good at streaming like in Doom do offer some options to really kill 4 GiByte cards. Whether that's waste or not... well, you rarely get a linear increase in image quality for your GFLOPS or GBytes.
Current engines haven't been primarily designed for 4K in mind. Render targets and other temporary buffers have so far been quite small, but 4K makes them 4x larger compared to 1080p. Developers are improving their memory managers and rendering pipelines to utilize memory better.

Brand new Frostbite GDC presentation is perfect example of this:
http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

Pages 57 and 58 describe their PC 4K GPU memory utilization. Old system used 1042 MB with 4K. New system uses only 472 MB. This is a modern engine with lots of post processing passes. Assets (textures and meshes) are obviously additional memory cost on top of this, and this is where a good fine grained texture streaming technology helps a lot (whether it is a fully custom solution or automatic GPU caching/paging solution).
 
Back
Top