I don’t know what tessellation shaders are but mesh shaders were introduced in Turing. Can’t really trace it back to AMD.
Mesh shaders first appeared with Tyring Pretransforming instances at BVH build time would be borderline suicidal for performance (and memory...
RDNA2 doesn’t have it but it has nothing to do with accelerating BVH building. It accelerates traversal in the presence of instanced geometry...
There is a misconception on this forum that using ML/DL for graphics means throwing away any prior knowledge about a problem and just replace an...
Well, I guess we'll have to shut down research centers, university departments and corporate research labs. No doubt there is a lot of poor work...
Culling is likely to require a small fixed number of clock cycles, while scan conversion requires a variable amount of work and time. The optimal...
The most likely and charitable explanation is that one vendor had more time to refine and improve their implementation of DXR while the other had...
I believe you're missing the forest for the trees. These are early but tremendously encouraging developments. No one has claimed you should throw...
Yes, they are MIMD units on both Turing and Ampere.
Full flexibility possibly comes with an hefty price tag since BVH traversal can be highly irregular and doesn't map well to SIMD cores. There's...
Solving PDEs with neural networks is still in its infancy but it's making rapid progress. For instance this was a recent breakthrough by a...
If you simply use brute force you won't get very far..(ahh, if it were that easy..)
A100 FP64 peak via tensor cores is 19.5TF/s.
Mirror-like reflections are not exactly a worst case scenario. Glossy reflections/refractions and GI on the other hand generate more challenging...
DXR 1.1 is fully supported on Turing and Ampere and it works very well. Also it is already possible to support stochastic LOD in DXR at full...
You don’t need a deep stack to efficiently handle traversal, whether it’s stored on dedicated memory, registers or cache. For instance see this...
Yup, but I was thinking about treelet leaves, not of the whole BVH. Regardless, it was just a way to say that there is not a strict need to read...
(disclaimer: I have no idea how it works, this is just an hypothesis) CU sends ray data + pointer to BVH node(s) to RT unit, which fetches the BVH...
It matters in case each traversal steps needs to be taken on the CUs, i.e. the pointer chasing happens in shader code while the intersection...
With half of the CUs..
Yep, I’ve seen that image posted on twitter and some of the numbers were completely wrong.
180W for Navi 10? If you start there you get 215W for the chip alone, without considering the much higher clock (if true) and everything else on...
If a 5700XT with 40 CUs and 1.755 GHz game clock has a TDP of 225W (https://www.techpowerup.com/gpu-specs/radeon-rx-5700-xt.c3339), a 72 CUs part...
Does TGP represent the power used by the gfx board?
If the BVH traversal logic is indeed handled by the CUs.. it's not a given the texture units fetch BVH and triangle data. Perhaps the CUs do it...
Although it does use the tensor cores.
QW-Net doesn't reconstruct higher resolution images (i.e. supersampling without super resolution). See below for more details: [MEDIA]
Upsampling an image using CNNs is relatively easy and fast these days. To do it in a temporally stable fashion while adding information (and not...
AFAIK no modern GPU works this way. Each pixel runs on a SIMD/T lane.
DirectML is just an API for implementing certain classes of DNNs. It won't magically put a superres/supersampling solution in the hands of developers.
That’s absolutely not the case. I can’t get into details but on some IMRs it can be done efficiently (barring pathological cases) with little to...
Not a waste as the output of TF32 matrix multiplication is FP32 and can be used as input to FP32 math.
Sadly GPUs almost never scale 2x, especially when they get very big.
Is it a ‘meaner’ part? I haven’t heard anything about it.
AFAIK TPUs go faster if you use BF16, but they also support FP32. Google say BF16 is close to a drop-in replacement for FP32, so it’s not...
All major DL frameworks, which call into NVIDIA APIs, will be automatically using TF32 by default. The value is not having to change your...
There’s no claim of 156 TF/s for FP32. That claim is for TF32, which is a mixed precision format for matrix multiplication and addition. The input...
That makes no sense. Fixing the acceleration structure depth either implies there is no upper bound on the number of children an internal node...
Convolutional networks, unlike fully connected ones, can be typically run at arbitrary resolutions and don't necessarily need re-training. As for...
On NV47/PS3 was very important to use FP16 registers where possible in order to save register space and get more threads running, which increased...
Using 0.57 as scaling factor per manufacturing node (more realistic than 0.5) we get 0.57*0.57*0.76(half node) -> ~0.25 scale factor, which would...
We are seeking a research scientist to join our team of graphics researchers in San Francisco, California. If you are passionate about graphics...
Process is just one of many variables, saying that this or that SoC, if manufactured with a different process, would be better or worse means...
You were linking gfx performance to the process, which doesn't mean much if we don't even know the area devoted to graphics in various SoCs. You...
Do you have a comparison in terms of performance per area?
This got to be the best B3D thread ever :)
GPUs haven't "ditched" any significant FF HW in ages. Several works at HPG/SIGGRAPH this year actually show renewed interest in adding FF HW...
That's not quite correct. Any DX11 GPU allow to "record" all fragments that contribute to a pixel into a variable size data structure (e.g. a...
From IVB to HSW the number of threads per EU when from 8 to 7, see developer guide:...
OIT is done by just adding some code to your shader so it interacts with MSAA the same way any other shader does, there is nothing special about...