Is 4GB enough for a high-end GPU in 2015?

sebbbi · Jun 16, 2015

Ethatron said:
Not sure what you're referring to. R8G8 is better than BC5 whenever your average angle is above ~12 degree. Otherwise BC5 is only better if your encoder can recognize the 8.6 fixed point precision possible with the hardware, or if the encoder doesn't if it happens to be better by chance as the hardware does 8.6 anyway.

Modern GPUs (already last gen consoles) do not truncate BC5 interpolated value pair (to 8 bit) after decompression. This allows more precision than 8 bit uncompressed. However the BC3 format alpha is truncated to 8 bit after decompression (so this particular format never exceeds 8 bit uncompressed in quality).

8 bit uncompressed channel provides 256 different values. This is not enough for large smoothly curved high specular surfaces (such as car hoods) especially in physically based lighting pipelines. You can see some minor banding in the highlights.

BC5 has two 8 bit endpoint palette for a 4x4 tile. There is 3 bit interpolation value between the endpoints (8 different values). On a large smooth curved surface the palette endpoints are very close to each other. In areas with the most notable banding the endpoints differ by one. In this case the 3 bit interpolation gives you 6 extra values between the 8 bit endpoints. This produces quality higher than 10 bit uncompressed channels. Crytek's few years old SIGGRAPH presentation describes the benefits of this normal texture compression method.

Rough areas suffer some LSB bit loss from BC5, but often this is impossible to see by naked eye, since rough areas (by definition) do not have smooth highlights (banding is not possible). Also if you use toksvig mapping or some other specular AA method, it will smooth your highlights at rough areas, further hiding this particular issue. In the end BC5 with Crytek style texture compressor beats uncompressed R8B8 in quality and needs half the bandwidth and half the memory.

mczak · Jun 16, 2015

sebbbi said:
Modern GPUs (already last gen consoles) do not truncate BC5 interpolated value pair (to 8 bit) after decompression. This allows more precision than 8 bit uncompressed. However the BC3 format alpha is truncated to 8 bit after decompression (so this particular format never exceeds 8 bit uncompressed in quality).

Interesting. I always thought BC4/5 is just like the alpha channel of BC3. The WGF specs though actually indeed mention the precision is higher. Though according to specs, higher precision for BC3 alpha would be allowed as well, just not required.

Ethatron · Jun 16, 2015

sebbbi said:
Modern GPUs (already last gen consoles) do not truncate BC5 interpolated value pair (to 8 bit) after decompression.

Correct, they use about 8.6+ fixed point.

sebbbi said:
This allows more precision than 8 bit uncompressed.

Not unconditionally, and not the way you mean it. If the angle is above 12 degrees (which is 8/255 BTW) you get technically more bits for the interpolated values than in the R8G8 case, but they don't lead to more precise interpolated values per-se as you have less bins than R8G8 in the first place and you're not free to choose them. They might be better, but they are more likely worst. Depends on the content.

sebbbi said:
8 bit uncompressed channel provides 256 different values. This is not enough for large smoothly curved high specular surfaces (such as car hoods) especially in physically based lighting pipelines. You can see some minor banding in the highlights.

That's the <12 degree case.

sebbbi said:
[...] Crytek's few years old SIGGRAPH presentation describes the benefits of this normal texture compression method.

No. Anton just describes that if you ought to have a BC5 compressor which supports float->BC5 compression that "by chance" you get better normals from the 8.6+ interpolation. He doesn't propose to write an actual coder which knows anything about it. Which means that by chance a tile could also end up worst. You can also use 8bit source material and end up having a similar "chance" to get better or worst results.

I wrote a BC5 compressor that actually knows about the higher precision, and is likely the best after pure brute-force. It also found it's way into the CryEngine. Feel free to check if you can get even better normal maps.

sebbbi said:
In the end BC5 with Crytek style texture compressor beats uncompressed R8B8 in quality and needs half the bandwidth and half the memory.

It has half the bandwidth, but it isn't unconditionally better, BC5 is lossy after all.

Ethatron · Jun 16, 2015

mczak said:
Interesting. I always thought BC4/5 is just like the alpha channel of BC3. The WGF specs though actually indeed mention the precision is higher. Though according to specs, higher precision for BC3 alpha would be allowed as well, just not required.

No, BC4 and BC5 behave different than the BC3 alpha channel, even if the block-coding is identical. I think they didn't want to change the specs for DXT5 "after-the-fact" and break old hardwares.

sebbbi · Jun 16, 2015

Ethatron said:
Anton just describes that if you ought to have a BC5 compressor which supports float->BC5 compression that "by chance" you get better normals from the 8.6+ interpolation. He doesn't propose to write an actual coder which knows anything about it. Which means that by chance a tile could also end up worst. You can also use 8bit source material and end up having a similar "chance" to get better or worst results.

I am talking about the "exhaustive" compute shader compressor that operates on 16 bit (per channel) source normals. This compressor also doesn't choose the best value simply by minimizing the (square) error of the 2 channels, it decodes the 2 channel BC5 compressed data block back to normalized 3d vectors and compares it against the original normal vectors. Maybe you are talking about an earlier presentation? I am not 100% sure that the presentation I was referring to was held at SIGGRAPH (it might have been an earlier one).

Or maybe the compressor I am talking about is your compressor

But the conclusion stands, BC5 is an excellent normal map compression format assuming you have a good compressor. With 16 bit source data it often looks better than R8G8 and is only half the price. BC5 is also good for storing material properties (roughness, specular, emissive, etc). It has a nice "pseudo float" property. Channels (and texture regions) that are filled with closer to zero values get more precision. This is useful especially for non-linear data.

Ethatron · Jun 16, 2015

sebbbi said:
I am talking about the "exhaustive" compute shader compressor that operates on 16 bit (per channel) source normals. This compressor also doesn't choose the best value simply by minimizing the (square) error of the 2 channels, it decodes the 2 channel BC5 compressed data block back to normalized 3d vectors and compares it against the original normal vectors. Maybe you are talking about an earlier presentation? I am not 100% sure that the presentation I was referring to was held at SIGGRAPH (it might have been an earlier one).

It should be "Reaching the speed of light" from SIGGRAPH 2010, yes. Such a brute force compressor is/was not practical for a production pipeline, it simply takes too long. Anton simply stated how to achieve the maximum possible quality with BC5. Although he missed the problem of 255/2, only signed or 254 valued unsigned formats can represent perfect up-vectors, regardless if 2 or 3 channel encoded.

sebbbi said:
Or maybe the compressor I am talking about is your compressor

I thought back then the criteria for maximum quality was plain and obvious (I'm in signal processing and data compression for 20 years), but it took me a while to get an algorithm done which is pretty much within 99.5% (or more) of brute-force in 0.5% (or less) of the time. So, in a way I'd say yes.

sebbbi said:
But the conclusion stands, BC5 is an excellent normal map compression format assuming you have a good compressor.

Yes, ofc, no doubt about that. It's just not the golden bullet. Not to mention the interpolation-problems arising from the parallel projection.

sebbbi said:
With 16 bit source data it often looks better than R8G8 and is only half the price. BC5 is also good for storing material properties (roughness, specular, emissive, etc). It has a nice "pseudo float" property. Channels (and texture regions) that are filled with closer to zero values get more precision. This is useful especially for non-linear data.

Indeed. Interestingly for smooth textures it's possible to store sRGB values in the black domain with smaller bin-size than sRGB8, which in turn invalidades the white-compression of sRGB giving back space there as well. You need to live with the typical block-artefacts in high variance areas though. Verrry useful for low frequency lightmaps. (BC4 has no sRGB permutation, so no chance to get best of both.)

sebbbi · Jun 16, 2015

Ethatron said:
No, BC4 and BC5 behave different than the BC3 alpha channel, even if the block-coding is identical. I think they didn't want to change the specs for DXT5 "after-the-fact" and break old hardwares.

The original reason why DXT5 (now called BC3) was decompressed to RGBA8 was that this gave full rate filtering and occupied 32 bits per pixel in texture cache. Back then the texture caches stored uncompressed texture data (DXT blocks were decompressed to the cache). Nowadays GPU L1 caches contain compressed data. This gives 4x improvement on storage capacity for BC formats at a small extra cost for filtering (just a few extra transistors for fixed function palette interpolation and bit shuffling). Improved cache utilization is a good reason why nobody should ignore the BC compressed formats on modern GPUs.

NVIDIAs old GPUs decoded DXT1 (BC1) to R5G5B5A1 (16 bpp) in their texture cache. Other GPUs (including NVIDIA nowadays) decode this to RGBA8, as all modern GPUs have full rate filtering for RGBA8. BC5 was originally created by ATI and was called 3Dc. BC5 was decoded to the texture cache as RG16 (at least on ATI hardware). All 32 bpp normalized formats have traditionally been full rate on all ATI cards, so there was no performance loss compared to RG8 decoding. ATI texture cache also didn't provide any storage gains for less that 32 bpp textures. I believe these are the reasons why BC5 (also known as 3Dc, ATI2 and DXN) decoded to more bits, and this is still a big advantage of this format.

sebbbi · Jun 16, 2015

Ethatron said:
(BC4 has no sRGB permutation, so no chance to get best of both.)

Yes. I was some time ago looking for an optimal format for store a distance field. I was bummed not to find BC4_sRGB in the DX11 format list. I don't understand why only the lousy quality RGB triplets get sRGB support. Even the BC3 alpha channel doesn't get sRGB (it is linear while the RGB is in gamma space).

BC7 is nice because it has sRGB support and higher quality than BC3 (at same cost). It is the best sRGB option available. Unfortunately BC7 compressors are still way slower than the others. You definitely don't want to reconvert all your BC7 textures for each new data revision.

Alessio1989 · Jun 17, 2015

sebbbi said:
Yes. I was some time ago looking for an optimal format for store a distance field. I was bummed not to find BC4_sRGB in the DX11 format list. I don't understand why only the lousy quality RGB triplets get sRGB support. Even the BC3 alpha channel doesn't get sRGB (it is linear while the RGB is in gamma space).

BC7 is nice because it has sRGB support and higher quality than BC3 (at same cost). It is the best sRGB option available. Unfortunately BC7 compressors are still way slower than the others. You definitely don't want to reconvert all your BC7 textures for each new data revision.

Isn't TexConv of DirectXTex fast enough (gpu accelerated)? https://directxtex.codeplex.com/wikipage?title=Texconv
AMD also proved a new Compress lib recently http://developer.amd.com/tools-and-sdks/graphics-development/amdcompress/

Kaarlisk · Jun 19, 2015

A little OT, but at least I stumbled upon a review (not new; I had somehow missed it) that has allayed my fears that buying a 4GB GTX 960 was completely useless.
http://www.gamersnexus.net/guides/1888-evga-supersc-4gb-960-benchmark-vs-2gb/Page-2
Yes, there are games where there is zero difference between 2/4GB. There are games where 4GB gets an advantage which is useless anyway because the average frame rate is still unplayable.
And there are also games where, at playable frame rates, there is noticeably less stutter with the 4GB card

revan · Jun 24, 2015

"Are 4GB HBM enough?
Fiji is set to the new and much more advanced memory type HBM, but is limited to 4,096 MB. AMD asserts that the four gigabytes enough for future games. The fact that Radeon R9 290X and Radeon R9 390 (X) is already set to eight gigabytes, should not contradict this argument. Fiji goes according to AMD significantly more economical with the memory to as the GPUs with tethered GDDR5. H is followed up this statement in numerous games.

Memory Usage
If you look at the memory usage in different games, quickly notice that the Radeon R9 Fury X actually behaves differently than, for example, the Radeon R9 390X and the Radeon R9 290X. While the GDDR5 graphics card, for example, in Assassin's Creed Unity approve seven gigabytes of memory, the Radeon R9 Fury X comes in the same game with around 3,950 MB narrowly on the four gigabyte limit, without the at first glance differences in performance result.

Even more significant is the new treatment of memory in Call of Duty: Advanced Warfare and Middle-earth: Shadow of Mordor. In first-person shooter on the other hand, the HBM-based graphics card is only 3.1 gigabytes, other four-gigabyte card the full memory. And also in Middle-earth, which is actually denounced as VRAM eaters, the assignment remains with 3,771 megabytes a little lower. "

http://www.computerbase.de/2015-06/amd-radeon-r9-fury-x-test/11/

"Memory is full" is not the same with "memory is required" it seems..

snc · Jun 24, 2015

Yeah, 4gb is enough... http://hardocp.com/article/2015/06/24/amd_radeon_r9_fury_x_video_card_review/6#.VYrz8xbjnYk

ArcticCircle · Jun 24, 2015

snc said:
Yeah, 4gb is enough... http://hardocp.com/article/2015/06/24/amd_radeon_r9_fury_x_video_card_review/6#.VYrz8xbjnYk

I noticed something similar in Pcper Fury X review. For example GTA 5 show really high frametime spikes and I'll get that same effect when I ran out of VRAM. And given how muddy and blurry some of GTA 5 textures are, that's not a good thing.

Alessio1989 · Jun 24, 2015

revan said:
"Are 4GB HBM enough?
Fiji is set to the new and much more advanced memory type HBM, but is limited to 4,096 MB. AMD asserts that the four gigabytes enough for future games. The fact that Radeon R9 290X and Radeon R9 390 (X) is already set to eight gigabytes, should not contradict this argument. Fiji goes according to AMD significantly more economical with the memory to as the GPUs with tethered GDDR5. H is followed up this statement in numerous games.

Memory Usage
If you look at the memory usage in different games, quickly notice that the Radeon R9 Fury X actually behaves differently than, for example, the Radeon R9 390X and the Radeon R9 290X. While the GDDR5 graphics card, for example, in Assassin's Creed Unity approve seven gigabytes of memory, the Radeon R9 Fury X comes in the same game with around 3,950 MB narrowly on the four gigabyte limit, without the at first glance differences in performance result.

Even more significant is the new treatment of memory in Call of Duty: Advanced Warfare and Middle-earth: Shadow of Mordor. In first-person shooter on the other hand, the HBM-based graphics card is only 3.1 gigabytes, other four-gigabyte card the full memory. And also in Middle-earth, which is actually denounced as VRAM eaters, the assignment remains with 3,771 megabytes a little lower. "

http://www.computerbase.de/2015-06/amd-radeon-r9-fury-x-test/11/

"Memory is full" is not the same with "memory is required" it seems..

Yeah, WDDM 1.x is such Jurassic

snc · Jun 26, 2015

link "4gb is enough"

Alessio1989 · Jun 26, 2015

snc said:
link "4gb is enough"

Thanks WDDM 1.x for that u.u. It stutters more like at ~3.1 - 3.2GB, considering only a small part of the VRAM pool is reserved for the OS (not ~800Mb for sure) all this should be caused by memory external fragmentation.

Malo · Jun 26, 2015

Alessio1989 said:
Thanks WDDM 1.x for that u.u. It stutters more like at ~3.1 - 3.2GB, considering only a small part of the VRAM pool is reserved for the OS (not ~800Mb for sure) all this should be caused by memory external fragmentation.

Then why does a 980ti not experience the same issues?

Alessio1989 · Jun 26, 2015

Malo said:
Then why does a 980ti not experience the same issues?

I don't know, bad drivers or just bad different memory management implementation, a good point to start to investigate this issue could be a GPUView session on GTA5 and AMD Fury X (another interesting detail could be know the exposed GPU pre-emption granularity to DXGI).
You can compare DX 11 (and prior versions) memory management like Java handles unreferenced objects in memory, where the garbage collector is implementation defined by the JVM (the driver): the efficiency of (de)allocation and the heap (de)fragmentation is quite all (but not totally) hidden to the developer. Manual memory management of D3D12 and the WDDM 2.0 memory reservation model (which is more "console-like") will provide developers the tools to solve this issues.

Dominik D · Jun 26, 2015

For full-screen applications preemption granularity and engine dependencies don't matter that much and GPU is exclusive to you and e.g. DWM won't try to preempt you.

Alessio1989 · Jun 26, 2015

I know, that's called "exclusive" mode... I was just curios to see if there are any changes from the other GCN GPUs, especially considering the presentation mode changes that will affect Windows 10 (full-screen included).
Also, does anyone know if there are issues on Windowed mode too?

Is 4GB enough for a high-end GPU in 2015?

Yak Mechanicum