Will GPUs with 4GB VRAM age poorly?

Discussion in 'Architecture and Products' started by DavidGraham, Jul 10, 2016.

  1. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    579
    Likes Received:
    283
    Screen resolution is no more the main factor for VRAM usage since... shader advent?
     
  2. N00b

    Regular

    Joined:
    Mar 11, 2005
    Messages:
    683
    Likes Received:
    100
    I hate to break it to you guys, but from now on GPUs with 4GB will age very poorly, because I just ordered a new laptop with 16GB VRAM. :runaway:

    Seriously though, the last time I ordered a new laptop, end of 2012, I got a laptop with 4GB VRAM (K5000M), which was 4 times as much as its predecessor had and the max you could get in a laptop. At that time people thought I was nuts.

    Now, we are already moving beyond 4GB in some games. 6-8GB VRAM usage will be normal in about 2 years for "high" or "ultra" configurations". PS5 or XBox Next will at least double ram, which will give VRAM usage another boost. 4K will become the new Full-HD.

    I think in 4-5 years 16GB will be standard for middle-class GPUs. Everybody will wonder how 4GB could ever have been enough, just like today with 1GB.
     
  3. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Let's say you are playing at 144 fps (high frame rate monitor + high end GPU). You want to access all 16 GB of data every frame (otherwise part of the data will just sit there doing nothing = WASTE). 144 frames/s * 16 GB/frame = 2304 GB/s. That's a lot of bandwidth. Usually over 50% of the bandwidth is used by repeatedly accessing the render targets and other temporary data. BF1 presentation describes that their 4K mem layout (RTs and other temps) is around 500 MB. So if we assume 50% of bandwidth is used to access this small 0.5 GB region, the remaining 15.5GB has only half of the bandwidth left. So in order to access it all on every frame, you need 4.5 TB/s memory bandwidth.

    This explains why I am not a big believer of huge VRAM sizes, until we get higher bandwidth memory systems. I am eagerly waiting for Vega's memory paging system. AMDs Vega slides point out that around half of the allocated GPU memory is not actively used (wasted) in current titles. The ratio of wasted allocated memory will only increase when 8 GB and 12 GB cards become common. I am surprised if the active data set in future games using 12 GB of VRAM is more than 4 GB (accessing that 4 GB at 144 fps would need over 1 TB bandwidth with same assumptions as above). Nvidia has similar paging technology in their Pascal cards. P100 with NVlink shows very nice paging results with massive CUDA tasks that access memory amount far beyond GPUs capacity. Hopefully this is the future. Waste of memory needs to end soon :)
     
    homerdog, nnunn, Pixel and 9 others like this.
  4. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    1,528
    Likes Received:
    377
    Location:
    Earth
    How well would pci-e work for random small access? It might be that in pc word swapping from gpu is not such a great idea unless one does large block transfers that are not so time critical?
     
  5. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,432
    Likes Received:
    261
    You wouldn't perform small accesses. Rather you'd migrate pages of memory which will be a minimum of 4kB and generally larger like 64kB. PCI-E can handle these sizes fine.
     
    Lightman and BRiT like this.
  6. N00b

    Regular

    Joined:
    Mar 11, 2005
    Messages:
    683
    Likes Received:
    100
    I'd like to challenge your assumption here. Not every byte of VRAM memory needs to be accessed every frame in order to be useful. Having larger video will enable more detailed worlds (no more close-up blurriness) and less compromise. It will enable faster loading times (or no loading times at all once everything is in memory) and instant teleporting in big open worlds. I know your're an engine programmer, you live and breath efficiency, but historically there has never been something as too much memory (640kb should be enough *cough* *cough*). Having to worry less about memory pressure will give engine programmers such as yourself more time to do other things. At least until those 16 GB become too small once again.
     
    nnunn likes this.
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Loading data to fill 16GB of memory takes 4x longer time than loading data to fill 4GB of memory. Good streaming technologies are essential in reducing the loading time. The gap between storage and RAM speed is getting wider every day. If you load everything during the loading screen, you will need to load for a considerably longer time.

    A good streaming technology will not keep up-close details of anything in memory, except those surfaces close to the player character. As the mip map distance is logarithmic, the area around the player that would access highest texture mip level is very small. The streaming system will of course load data from longer radius to ensure that the data is present when needed, but there's no reason to keep all the highest mip level data loaded permanently in memory. If you would do this in a AAA game, then even a 16 GB GPU wouldn't be enough in current games (to provide results identical to a 4 GB GPU with a good streaming system).

    I agree that instant teleporting is a problem for all data streaming systems. However the flip side would be to load everything to memory and that drastically increases level loading times. But contrary to common belief, a very fine grained system (such as virtual texturing) actually handles instant teleporting better than coarse grained streaming systems. This is because virtual texturing only needs to load data to render a single frame. You can load 1080p worth of texel pages in <200ms. This feels still instant. With a more coarse grained system (load a whole area), you would need to wait for a lot longer. Loading everything at startup is obviously impossible for open worlds. 50 GB BR disc doesn't fit to the memory (and there might be downloadable DLCs areas in the game world as well). You need at least some form of streaming. My experience is that fine grained is better that coarse grained. But only a handful of developers have implemented fine grained streaming systems, as the engineering and maintenance is a huge effort.

    I have been developing several console games (last gen and current) that allowed players to create levels containing all game assets (even 2 GB of assets in single level on a console with 512 MB of memory). We didn't limit asset usage at all. There was a single big game world that contained all the levels. With a fine grained streaming system (including virtual texturing for all texture data) we managed to hit 3-5 second loading time for levels. This is what is possible with a good fine grained streaming system.
    Wasting memory is easy, but it comes with a cost. HBM1 for example didn't become widely used because it was capped to 4 GB. All current PC games would be fine with 4 GB if memory was used sparingly. But as developers are wasting memory, products with larger amount (8 GB) of slower memory are winning the race. The problem here is that the faster memory would give improved visuals, as faster memory = can use better looking algorithms. But we instead need to settle on larger amount of slower memory, since memory management is not done in a good way. Larger memory size always means that the memory needs to be further away from the processing unit. This means that it is slower. Larger != no compromise.

    Custom memory paging (such as software virtual texturing) and custom fine grained streaming systems are complex and require lots of developer resources and maintenance. This is a bit similar to automated caches vs scratchpad memories (Cell SPUs and GPU groupshared memory vs automated L1/L2/L3 CPU caches). Automated system is a bit less efficient in worst case (and uses more energy), but requires much smaller amount of developer work. Hopefully Vega's automated memory paging system delivers similar gains to game memory management. Developer could load huge amount of assets and textures to system RAM without thinking about GPU memory at all, but only the currently active set of memory pages are resident on the GPU (fully automated). In the best case this is like fully automated tiled resources for everything. No developer intervention needed. CUDA (Pascal P100) also offers a paging hint API for the developer. This way you can tell the system in advance if you know that some data is needed. This is a bit similar to CPU cache prefetch hints. Way better than fully manual system, but you also have just right amount of control when you need it. This is the future.
     
    #227 sebbbi, Mar 25, 2017
    Last edited: Mar 25, 2017
    homerdog, nnunn, Gubbi and 10 others like this.
  8. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    But arent there more "off the shelf" engines that support fine-grained/virtual texturing streaming system these days?
    Two that come to mind is ID Tech latest engine and also UE4 with some middleware such as Granite.

    Like you I do think the future is around unified memory/paging for gaming or a form of it, but my concern is that this is still not anytime soon even with Vega.
    One reason is that AMD carefully used Deux Ex Mankind Divided only with 2GB VRAM in their demo for their HBCC solution, in reality their solution should had been compared to both 4GB and 8GB but we know why they did not, so to me this is the future but further out than Vega.
    And by then Intel would have a more general Optane 'Cache' consumer solution, which IMO would be preferable as it does not rely upon GPU drivers but hooks very well into the OS; it would be interesting to see how this shapes up as a potential solution using a smaller 100GB Optane cache SSD (substantially lower latencies than standard SSD).
    MIcron at some point will also release a product in the future.

    Cheers
     
  9. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    1,528
    Likes Received:
    377
    Location:
    Earth
    I'm very curious how the pathological use case would be handled. I wonder if amd has implemented some sort of mechanism to gather misses to packed bundles which would be serviced by cpu and then unpacked on gpu side. If you needed 10 bytes here, 1kB there etc. this could optimize pci-e traffic nicely but of course would add to complexity/silicon size/latency.

    Streaming in game engine is much easier than implementing random swapping that doesn't hit cases with user observable hitches.Especially so if you don't have bus/io fabric optimized for two way fairly small random memory access patterns :) Texture probably is the "easy" use case as that can always be reloaded from disk. Swapping some game generated data structures would need 2 way traffic where data is stored to main ram as part of swapping.
     
    nnunn likes this.
  10. msia2k75

    Regular Newcomer

    Joined:
    Jul 26, 2005
    Messages:
    326
    Likes Received:
    29
    That brings me a question. A next-gen console with 16GB of ram would be sufficient, then?
     
  11. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Yes there are. But most games are not using virtual texturing or similar very fine grained streaming solutions. Most engines have obviously more coarse grained streaming solutions available. Otherwise you wouldn't use them for most AAA games at all.
    Intel's Optane is the other way around. The speed difference of the system RAM and the disc is growing all the time. We need eventually another level of cache between them. SSDs improved the situation, but SSDs have scaled very slowly down in price and up in capacity. We need a fast, large and robust caching solution between the storage and the memory.

    However the system RAM size is also scaling up rapidly and prices are scaling down rapidly. It is more economical to keep textures in system RAM and and page in only the active set to GPU memory by demand.
    I am talking about paging from system RAM to GPU video memory. This is significantly simpler, lower latency and higher bandwidth operation compared to streaming missing pages from HDD (like virtual texturing does). We are talking about fractional millisecond latencies instead of 10+ millisecond latencies. We have seen low end GPUs that extend their memory by system RAM and directly sample textures from there. See here: http://www.trustedreviews.com/opinions/ati-hypermemory-vs-nvidia-turbocache-updated.

    Animation by definition is smooth to fool the eye of movement (instead of separate images). This means that huge majority of the data that is accessed during two consecutive frames is identical. My experience is that roughly 1%-2% of the active data set changes per frame during normal game play (and I am talking about a tight virtual texturing implementation with a 256 MB texture pool). We are talking about a ~5 megabytes of PCI-E transfer per frame. Obviously camera teleport needs to stream more, but dropping a frame or two during camera teleport isn't actually visible. My experience is that you need at least 250 ms pause during camera teleport to "notice" it. But this obviously is game specific. Tracer in Overwatch for example would be unplayable if teleports weren't instantaneous. However Tracer jumps very short distances and faces the same direction, so streaming should be able to keep up pretty well. Time will tell how this works. If the GPU can also directly sample textures over PCI-Express, it could do so and make the whole page resident in background after that.
     
    #231 sebbbi, Mar 25, 2017
    Last edited: Mar 25, 2017
    homerdog, HTupolev and manux like this.
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    This discussion has no relevance to unified memory systems. I am solely talking about paging from system RAM to GPU memory. You still need large system RAM. But system RAM is cheap.
     
    BRiT likes this.
  13. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    When you talk about "streaming" in the context of being an alternative to "loading" (i.e. "loading screens"), would Star Citizen's "Mega Map" initiative be a good example of taking "streaming" about as far as you could (see 13:51 below)?

    Since Star Citizen is attempting to have very large "maps" (i.e. solar system size), my understanding is that they are putting together a novel system to load one empty "map" with a very brief loading screen and then stream in things around the player (e.g. the ground, buildings, other space ships, etc) as the player moves around the solar system.



    How much cheaper is system RAM (e.g. DDR4?) compared to video RAM (e.g. GDDR5, HBM?)?

    I don't see explicit mentions of "GDDR5" or "HBM" on somewhere like DRAMeXchange. Otherwise, it's relatively straightforward to compare the flash in SSDs or the DRAM chips in RAM DIMMs. Teasing out GDDR5 pricing is tougher.

    http://www.dramexchange.com/

    I've always wondered why GDDR5 isn't used as system memory in more integrated systems (laptops, etc) where system memory is already permanently soldered to the motherboard. It seems especially useful in APU situations where integrated GPUs are often starved of bandwidth.
     
  14. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    It's not only about the price. It isn't possible to scale fast memory size that large. And I am not talking about GDDR5. Fast = HBM2 or MCDRAM or something else.

    For example Intel's newest Xeon Phi processor supports up to:
    - 384 GB of DDR4 memory (102 GB/s)
    - 16 GB of MCDRAM (400+ GB/s)

    Xeon Phi MCDRAM can be configured as cache to the DDR4 main memory. Similarly in future desktop: GPUs 8 GB of HBM2 could be configured as a cache to 64 GB of DDR4 main memory. Could work either at cache line granularity or at page granularity.
     
    nnunn likes this.
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,718
    Likes Received:
    2,454
    The high latency of GDDR5 memory prevents that, it can work perfectly well in a GPU environment, As GPUs hide latency well with their parallel nature. However, in a general system environment, they are not ideal. CPUs need as little latency as possible, hence why DDR4 or DDR3 is preferable.
     
  16. ProspectorPete

    Regular Newcomer

    Joined:
    Feb 1, 2017
    Messages:
    414
    Likes Received:
    137
    Is that why consoles are so weak? Would GDDR5 bottleneck a Zen processor?
     
  17. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    I thought this is a long dismissed theory with the reality that GDDR5 has similar timings to DDR3. The high latency observed on GPUs is a result of its operating clock speed (absolute) and throughput-oriented memory pipeline (relative).
     
    Pixel, homerdog and Putas like this.
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Why didn't they?
     
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,796
    Likes Received:
    2,054
    Location:
    Germany
    Is there actually an engine in production use that does do well with increasing resolution on a 4 GiB budget? And by that I mean also beyond UHD resolution. Of course, given your other assets are slim enough, you will get very far with a 4 GiB framebuffer, but even engines pretty good at streaming like in Doom do offer some options to really kill 4 GiByte cards. Whether that's waste or not... well, you rarely get a linear increase in image quality for your GFLOPS or GBytes.
     
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Current engines haven't been primarily designed for 4K in mind. Render targets and other temporary buffers have so far been quite small, but 4K makes them 4x larger compared to 1080p. Developers are improving their memory managers and rendering pipelines to utilize memory better.

    Brand new Frostbite GDC presentation is perfect example of this:
    http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

    Pages 57 and 58 describe their PC 4K GPU memory utilization. Old system used 1042 MB with 4K. New system uses only 472 MB. This is a modern engine with lots of post processing passes. Assets (textures and meshes) are obviously additional memory cost on top of this, and this is where a good fine grained texture streaming technology helps a lot (whether it is a fully custom solution or automatic GPU caching/paging solution).
     
    Lightman, AlBran, BRiT and 1 other person like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...