AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
You realise that NVidia was doing attribute interpolation in shader code from barycentrics stored in shared memory way before AMD, don't you?
This is puny, of course I know, what do you want to prove with this?:-?

Anyway, vertex attributes consume a pitiful amount of the shared memory of each CU.
Do you have some more pecise numbers than "a pitiful amount"?
 
Honestly, i absoolutely not understand why Nvidia have not allready do it. I was prettty sure that it willl happend on Pascal allready, but like it is still not there, i will bet now for Volta. Its a logical evolution, it was extremely logic when AMD have introduct it, and it will be a logical suit for the nvidia architecture.

Possibly Nvidia are looking longer term with Temporal SIMD/SIMT type solution, still fits in with scalar execution.
Nvidia had a patent for a solution along this line back in 2011 created by Ronny Krashinsky.
http://www.freepatentsonline.com/y2013/0042090.html

Going back a few years, real world example of Temporal SIMD I think using Tesla core baseline:http://lpgpu.org/wp/wp-content/uploads/2013/06/dart.pdf and http://lpgpu.org/wp/wp-content/uploads/2014/02/PEGPUM_2014_tub_dart.pdf

Cheers
 
To change direction like that, would take time, most likely Volta would be the target to make such a change, as pipeline was what we have seen from nV since Kelpar has been evolutionary.

Its not a so radical change ( nothing like a u-turn ). But its basically a different way of think it. I dont know, maybe it could
Possibly Nvidia are looking longer term with Temporal SIMD/SIMT type solution, still fits in with scalar execution.
Nvidia had a patent for a solution along this line back in 2011 created by Ronny Krashinsky.
http://www.freepatentsonline.com/y2013/0042090.html

Going back a few years, real world example of Temporal SIMD I think using Tesla core baseline:http://lpgpu.org/wp/wp-content/uploads/2013/06/dart.pdf and http://lpgpu.org/wp/wp-content/uploads/2014/02/PEGPUM_2014_tub_dart.pdf

Cheers

Really interessant read, thank you. But quite different of the use of a scalar unit .. I mean.. its so simple, why dont take part of it.. it will even been a real leaps on CUDA capacity. ( CUDA who certainly need a good leaps now ( or a good kick in the ass )
 
Its not a so radical change ( nothing like a u-turn ). But its basically a different way of think it. I dont know, maybe it could


Really interessant read, thank you. But quite different of the use of a scalar unit .. I mean.. its so simple, why dont take part of it.. it will even been a real leaps on CUDA capacity. ( CUDA who certainly need a good leaps now ( or a good kick in the ass )
Glad you liked them and yeah I thought so to.
Well one thing we know about Nvidia is that they only do incremental transitional architecture changes, which Andrew Lauritzen gave a really good recent post example of: https://forum.beyond3d.com/threads/...nd-analysis-thread.57188/page-68#post-1919927 with context also to: https://forum.beyond3d.com/threads/...nd-analysis-thread.57188/page-69#post-1920350
Cheers
 
Do you have some more pecise numbers than "a pitiful amount"?
If you can't work them out for yourself then I guess you're excused from the thread.

Hint: it's four bytes per vertex per attribute. And on current GCN there's a maximum of 4 triangles per hardware thread. 40 threads per CU. You can do the rest of the math...

EDIT: per vertex, not per triangle
 
Last edited:
Thanks for the heads up, I have all but forgotten many things about Cayman. But this looks not much like asynchronous compute, since it talks specifically about executing multiple compute kernels without a context switch, but no mix of graphics and compute, which is what today's rage is all about. Or am I getting it wrong again?
Cayman could mix graphics and compute on the same CU (or SIMD as it was called then). So could Cypress (5870), Juniper, etc., but for Evergreen parts there was a shared wave launch point between graphics and compute. Cayman worked very much like async works today on AMD hardware.
 
OlegSH,

If you have xyzw for position that's 16 bytes/vertex, 48 bytes/triangle. There can be up to 16 triangles per PS wave so a max of 768 bytes just to interpolate position for dense triangles. If triangles are this dense though the bottleneck is likely geometry processing much of the time.
 
If you can't work them out for yourself then I guess you're excused from the thread.

Hint: it's four bytes per vertex per attribute. And on current GCN there's a maximum of 4 triangles per hardware thread. 40 threads per CU. You can do the rest of the math...

EDIT: per vertex, not per triangle
I don't think it's entirely correct to say 4 bytes per vertex. The interpolation doesn't use the per-vertex values directly - accordingly the values are named P0, P10, and P20 in the amd docs. So, it's more like 3 floats per triangle, rather than one per vertex. Obviously though, it amounts to the same thing.
But you make it sound like that's nothing. The hw limits are 32 attributes - and these can be vec4. So, in the end, you can have 32 (attributes) x 4 (channels) x12 (bytes per scalar value to interpolate) x4 (max 4 prims per thread) - for a single thread.
Luckily as far as I understand the hw does not actually need to reserve all LDS space for the worst case (4 prims per thread) - albeit my understanding about how this works is ending right there... (I don't really know where the max 4 prims per thread is coming from neither, certainly from the encoding of the vinterp instructions it's not visible since these just have a 15 bit mask indicating which quad in a thread is coming from a new primitive which is then used by the hw to determine the correct LDS address.)
 
Mczak, see my comment above yours. Assuming Jaws' use of thread means wavefront in AMD terminology up to 16 prims can contribute to one PS wave.
 
You can have a 300 ~ 400 Pc that provides same ~ better console experience, just turn your sight on the used hardware market, it shouldn't be that hard to reach Xo level of quality (which run games at 720p or 900p @ 30 fps. Heck, most of the time you will exceed it.
 
The consoles punch above their weight in either type of comparison, so you have to wonder why that argument was even being tried to be had in the first place.

Most often than not I can see an AMD 7850 perform very well against a PS4 in multiplatform games when similar settings are being used. Neither of them runs the "ultra" settings, but medium-high works well. Obviously you can't run platform exclusives on the PC and I'm sure there are examples to support any argument, but in general 7850 runs games pretty well.
 
  • Like
Reactions: xEx
You can have a 300 ~ 400 Pc that provides same ~ better console experience, just turn your sight on the used hardware market, it shouldn't be that hard to reach Xo level of quality (which run games at 720p or 900p @ 30 fps. Heck, most of the time you will exceed it.

If you are going to go used then.

Can a used 150+ USD PC match a 150+ USD console? Or less (you can find used XBO's on Ebay for under 120 USD)? At that point the comparison becomes meaningless. What if someone is willing to sell a used console for just 10 USD? Or what if someone is selling a used PC for 5 USD? At some point the comparison becomes ridiculous. Are 55-60+ million people going to be able to easily get a PC that can match PS4/XBO at 300-350 USD? That is after all current install base of consoles worldwide.

New pricing gives representative cost for similar gaming experiences. Consoles have no rivals for an equivalent graphical gaming experience.

Regards,
SB
 
OlegSH,

If you have xyzw for position that's 16 bytes/vertex, 48 bytes/triangle. There can be up to 16 triangles per PS wave so a max of 768 bytes just to interpolate position for dense triangles. If triangles are this dense though the bottleneck is likely geometry processing much of the time.
So with 768bytes per wave in that case (which is not the maximum) and 40 waves max per CU it would be 30kB, which leaves 34kB of LDS free. That is more than one can use in a single wave/workgroup. In reality, neither 40 waves per CU nor 16 triangles per wave are terribly probable, so one could up the number of parameters a bit, before running into a hard wall.

Edit:
@mczak, the limit of up to 16 triangles per wave comes from the fact, that a triangle always occupies at least a quad, no matter how tiny it is.
 
Last edited:
If you are going to go used then.

Can a used 150+ USD PC match a 150+ USD console? Or less (you can find used XBO's on Ebay for under 120 USD)? At that point the comparison becomes meaningless. What if someone is willing to sell a used console for just 10 USD? Or what if someone is selling a used PC for 5 USD? At some point the comparison becomes ridiculous. Are 55-60+ million people going to be able to easily get a PC that can match PS4/XBO at 300-350 USD? That is after all current install base of consoles worldwide.

New pricing gives representative cost for similar gaming experiences. Consoles have no rivals for an equivalent graphical gaming experience.
Invalid argument, the comparison here is not for price sake, it's for the experience, going console or PC largely doesn't depend on the price, but the overall ecosystem and user experience. Like ease of gaming, exclusives, playing accessibility with controllers.. etc. If the price of entry for the PC is the barrier, then mildly used PCs can provide same or better graphical fidelity. And going slightly higher in price can provide even better fidelity. Not to mention the capacity for expansion later on, by upgrading CPUs and GPUs to obtain more quality.
 
Cayman could mix graphics and compute on the same CU (or SIMD as it was called then). So could Cypress (5870), Juniper, etc., but for Evergreen parts there was a shared wave launch point between graphics and compute. Cayman worked very much like async works today on AMD hardware.
I wasn't aware of that - thanks for the enlightenment! :)
 
Most often than not I can see an AMD 7850 perform very well against a PS4 in multiplatform games when similar settings are being used. Neither of them runs the "ultra" settings, but medium-high works well. Obviously you can't run platform exclusives on the PC and I'm sure there are examples to support any argument, but in general 7850 runs games pretty well.

Agreed. There are cases were the console will punch above it's weight (Doom for example, although we await Vulcan to see how that changes) but there are cases of the reverse too (Alien Isolation) so overall I'd call it a wash. PC graphics cards do seem to compare pretty well pound for pound (specs not price) vs their console counterparts these days.

My 670 for example which isn't light years ahead of a 7850, especially in modern games is able to consistently play games at higher settings and/or framerates than those shown in the Digital Foundry face offs on PS4. It does fall down in the texture department though. That is certainly an area were AMD were more future proofed i n the early days of GCN.
 
AMD Radeon Rx 480 3DMark 11 Performance Benchmark Surfaces
Yes, I am fairly certain this is the real thing you guys, in the Futuremark database an entry appeared under device ID 67DF:C7, that Hardware ID has been tied to the AMD Radeon RX 480 for a while now. The product is listed at 1266 MHz clock frequency, which was announced for the AMD Radeon Rx 480.

Now the result that surfaced was from 3DMark 11, the Performance Benchmark. A bit underpowered benchmark I'd say. But the result is interesting none the less as the Radeon RX 480 would position itself precisely where it was expected to be, at a GeForce GTX 980 performance level closing in towards Hawaii (390) and Fiji (Fury).

The card also lists 8GB graphics memory, again that matches the announcement info. So yes, that looks to be the real thing. AMD announced the Radeon Polaris GPU in the form of the Radeon Rx 480 last week at Computex. A graphics card that will perform just above 5 TFLOPS With it's 150 TDP it'll have 36 CUs (x 64 shader processors per CU = 2304 Shader processors). The card will be available in both 4GB and 8GB versions and has 256-bit GDRR5 memory at 256 GB/s (= 8 Gbps effective much like the GeForce GTX 1070). The card will run in the 1267 MHz range on it's boost clock. The card will start at 199 USD for the 4GB model. 8GB models will likely become available at a higher price premium. The product did not get a launch date, but we expect late this month.

index.php
http://www.guru3d.com/news-story/amd-radeon-rx-480-3dmark-11-performance-benchmark-surfaces.html
 
Status
Not open for further replies.
Back
Top