AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
No, AMD isn't playing that game anymore.

I'm sorry but AMD CPUs are accidentally popular once more.

I have $800 on the side for a new card. I've love to get a new amd card. Last time I went with vega 56 for $375 on launch day. If we can double that price and I can get 50% more performance over a 2080ti esp with ray tracing and it has comparable features i'm down with it
 
Alleged benchmarks of MI100 shows it beating A100 in FP32 by a large margin, while losing in FP64, FP16, and mixed precisions by a large margin.

View attachment 4395


https://adoredtv.com/exclusive-cdna-and-mi100-presentation-slides-leak/

Makes total sense, I'd expect non matrix fp16 operations to also win. The a100 is clearly designed to maximize matrix multiplication machine learning, and unless AMD redesigned the GPU entirely I'd expect it to just be a fairly linear advancement over Vega.

I wonder how much demand there is for such a GPU?
 
In their earlier article, the facts of which they re-confirmed in the most recent one, they had an alleged MI100 slide showing a roughly quarter-rate FP64 at 9,5 TFLOPS and 42 TFLOPS for FP32. But What I don't get is how they would generate 150 TFLOPS of FP16 out of it and how MI100 would not be named MI150 then.

Then in their last slide shown above, there's a "2.4x better FP32 performance" mentioned (which, sans marketing, probably is rather +140%, yet, in the other slides from the same article, it shows a marginal advantage in "delivered SGEMM" capped at 300 Watt.

Either I need more coffee, or I just cannot make much sense out of these numbers.

FWIW, with standard GCN-CUs, you'd need somewhat north of 2210 MHz in a 120CU part for 34 TFLOPS of FP32. Giving it a 95% efficiency for SGEMM, it's something around 2330 MHz.
 
In their earlier article, the facts of which they re-confirmed in the most recent one, they had an alleged MI100 slide showing a roughly quarter-rate FP64 at 9,5 TFLOPS and 42 TFLOPS for FP32
It's 42TF peak for SGEMM FP32, not for teh usual ops.
how MI100 would not be named MI150 then.
They just name them bigger numbers now, even pretending MI60 doesn't exist and all.
 
In an Adoredtv video (a discussion between 3 people), some guy said the fp32 number is "tricky", as they finally get how they obtain that number. But he didn't reveal why.
 
Then maybe they just forgot to mention and footnote that little factoid here. Funny, because in some AMD slide decks, there is more space for footnotes than for actual content.
Not sure why these are even discussed in Ampere thread, but that's not an AMD slide.
 
Not sure why these are even discussed in Ampere thread, but that's not an AMD slide.
I realize, they are not AMD-branded, but "OEM system availability" seems to indicate, it's not a slide from a specific manufacturer. But whose are they?
 
I realize, they are not AMD-branded, but "OEM system availability" seems to indicate, it's not a slide from a specific manufacturer. But whose are they?
They're quite surely just Adored's own fabrications, I mean, what company would ever write stuff like "but not much else" on a slide?
upload_2020-7-31_12-54-30.png`

edit:
To be clear, I'm not saying it's impossible for him to have the real slide deck too, but those are not from AMD or any other company and I wouldn't bet the others are real either.

Like this one:
upload_2020-7-31_12-59-5.png

The AMD portion is copy/pasted from another source and someone forgot to fill in the same background color within letters like "0" and "O" etc
 
Last edited:
New patents for raytracing from AMD with Intersection Unit in each CU:
http://www.freepatentsonline.com/10706609.pdf
US10706609B1 Efficient data path for ray triangle intersection
https://patents.google.com/patent/US10706609B1/en

A division unit would be useful for a straightforward implementation of a certain type of ray-triangle intersection test that is useful in ray tracing operations. This certain type of ray-triangle intersection test includes a step that transforms the coordinate system into the viewspace of the ray, thereby reducing the problem of intersection to one of 2D triangle rasterization. However, a straightforward implementation of this transformation requires floating point division, as the transformation utilizes a shear operation to set the coordinate system such that the magnitudes of the ray direction on two of the axes are zero. This shear operation, when applied to the vertices of the triangle, requires multiplication by a ratio of the ray direction magnitude in one axis to the ray direction magnitude in another axis, which requires division. Instead of using the most straightforward implementation of this transform, the technique described herein scales the entire coordinate system by the magnitude of the ray direction in the axis that is the denominator of the shear ratio, which removes division.
...
Conceptually, the ray-triangle test involves projecting the triangle into the viewspace of the ray so that it is possible to perform a simpler test similar to testing for coverage in two dimensional rasterization of a triangle as is commonly performed in graphics processing pipelines. More specifically, projecting the triangle into the viewspace of the ray transforms the coordinate system so that the ray points downwards in the z direction and the x and y components of the ray are 0 ... The vertices of the triangle are transformed into this coordinate system. Such a transform allows the test for intersection to be made by simply asking whether the x, y coordinates of the ray fall within the triangle defined by the x, y coordinates of the vertices of the triangle, which is the rasterization operation described above.
...
The ray-triangle intersection unit 702 does not include a divider. Not including a divider is beneficial because a divider consumes a large amount of computer chip die area and power. Thus, not including a divider improves the amount of die area taken up by the ray intersection unit 139.
 
With the patents about amd RT, do we already know if this method can do things that Turing can't, or vice-versa ?
 
Of course the PCIe 4.0 4x connection isn't enough (to reach PS5 I/O throughput speeds). The GPU would need an embedded decompression block, which both consoles already have. That was my suggestion.
In fact, given how much more dynamic the GPU market is compared to the motherboard, CPU and storage ones, I think this is the most likely to come to fruition at the moment.

Yes I totally agree the best place for a hardware decompression block (or at least the most likely place) if the PC were to get one is on the GPU. I'd edge more towards it being a shader based implementation though if it were to exist at all. But if we do get that then that's most of the problem solved as far as I can see. The buses in a normal setup are still plenty fast enough to match or exceed even PS5 throughput if a similar level of compression is being used and the software IO overheads are reduced compared with today through for example unbuffered data transfers, GPUDirect Storage style DMA or whatever magic DirectStorage brings to the table.

That said, as long as it could be used like a regular SSD too, then an SSD directly connected to the GPU could have some big advantages, especially if it used 8 lanes. That would both allow it to double the speed of regular setups (assuming any drives supported it) and place it perfectly for a GPU based decompression solution to decompress all data coming off the disk rather than just that intended for the GPU itself. It's also bypass any technical limitations in DMA'ing data directly from the SSD to the GPU.

I think shader based decompression is more of a method to offload the CPU than to achieve higher performance (due to low thread counts).

I'm not sure how it would compare performance wise but I was more looking at it as a way to offload the CPU. PC GPU's will have plenty of horsepower to spare next gen in relation to consoles but it'll be a long time before CPU's can spare the equivalent of 5x Zen2 cores just to spend on decompression.

Bigger installs alone don't solve the I/O throughput limit.

Agreed. I suggest this as a way to sidestep the decompression bottleneck rather than speed up IO. Raw bandwidth isn't really an issue given that 7GB/s drives ar eincomping which even without compression are well into next gen console territory (little ahead of XSX, little behind PS5).

Getting very large installs with decompressed data with long initial load screens that then make the game occupy a truckload of RAM with decompressed data is a more plausible solution, if we're not getting dedicated decompression hardware on the PC anytime soon.

I'm not sure loading times would have to be especially long. You wouldn't need to fill up your main RAM with cached data just to start the game, that could stream in in the background. Lets assume you can start the game with "only" 8GB in VRAM. On a 7GB/s drive that's going to be ready in under 2 seconds. Even if you pre-cache another 120GB DRAM in the background that'd take less that 20 seconds of in game streaming and then you'd likely have the entire game cached in RAM!
 
Status
Not open for further replies.
Back
Top