AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

He basically confirms that primitive shaders allow seperate position and attribute shaders to allow more efficient "frustum'" culling. Where by implication a tile binning rasterizer also becomes more efficient and is interconnected with this feature.
It is, has that been made public yet or is still under NDA?

Or am I just confused..... :p ;)
 
He basically confirms that primitive shaders allow seperate position and attribute shaders to allow more efficient "frustum'" culling. Where by implication a tile binning rasterizer also becomes more efficient and is interconnected with this feature.
Are you sure it's only frustum culling? I think also they finde a way for better backspace culling.

Gamersnexus.net/Mike Mantor said:
"You can then product that with the eye-ray, and if it’s a positive result, it’s facing the direction of the view, and if it’s negative it’s a back-faced triangle and you don’t need to do it. […] State data goes in to whether or not you can opportunistically throw a triangle away. You can be rendering something where you can actually fly inside of an object, see the interior of it, and then when you come outside, you can see outside-in – in those instances, you can’t do back-faced culling.”
http://www.gamersnexus.net/guides/3010-primitive-discarding-in-vega-with-mike-mantor
 
Boost clock has changed for RX Vega. It's no longer the maximum clockspeed of the chip. AMD's boost number is now closer to NVIDIA's in that it's a guarantee, but a chip is allowed to go higher if it can.
It appears that max clock (DPM7) for Vega 64 air is 1630MHz (that's the card we have been shown in the leaks), max clocks for Vega 64 liquid is 1750MHz.
https://sapphirenation.net/sapphire-radeon-vega/
 
I was waiting for the interviewer to ask if the hardware and driver can make this happen automatically or developers have to specifically code for these primitive shaders.
 
At 70-100 MH/s for Ehereum, Vega is bound to fly off the shelves, regards wheter or not it is consuming 250 watts for this.
Part of me hopes this is true, part of me fears it is true.

Regarding mining...

Professional mining software developer here.

Ethash (the proof-of-work used in Ethereum) is memory bandwidth bound.

1 Ethash requires 8KB of memory bandwidth, 1 MH requires ~7.8 GB of memory bandwidth (1000000*8KB/1048576 ~= 7.8GB).

Vega56 has ~409.6 GB/s; 409.6/7.8 ~= 52.5 MH/s

Vega64 has ~483.8 GB/s; 483.8/7.8 ~= 63.4 MH/s

Both versions use SKHynix's HBM2 memory, which is a 1.6Gbps part @ 1.2volts (SKHynix's Q2'17 Databook). This is the memory speed in Vega56. The Vega64 runs the same part @ 1.89Gbps meaning it is factory overclocked and unlikely to go faster.

Tldr: Peak Ethash performance unlikely to exceed 65 MH/s on Vega64 and 55 MH/s on stock Vega56.

Add to that, that's THEORETICAL peak throughput. I was reading coders discuss maximum throughput with Ethereum and said something like 85% is the achievable maximum.

Quadro GP100 is said to get ~70MH/s. That's out of its 720GB/s theoretical maximum throughput on the memory controller. If PCGH tests are indication of reality, at 85% efficiency we're only going to get 42.5MH/s with 400GB/s practical bandwidth.
 
Regarding mining...
[ MEDIA=reddit]path=%2Fr%2FAmd%2Fcomments%2F6rm0cs%2F[ /MEDIA]
Add to that, that's THEORETICAL peak throughput. I was reading coders discuss maximum throughput with Ethereum and said something like 85% is the achievable maximum.

Quadro GP100 is said to get ~70MH/s. That's out of its 720GB/s theoretical maximum throughput on the memory controller. If PCGH tests are indication of reality, at 85% efficiency we're only going to get 42.5MH/s with 400GB/s practical bandwidth.
Thanks, that's an interesting posting. Even 50-60 MH/s would be quite a number, especially if possible with Vega 56 at more sane consumption levels (UV included). The problem I see here is the example Fiji set. It also uses HBM, which alongside with GDDR5X in 1080, 1080 Ti and Titan XP/p, does not seem to lend well to ETH requirements in terms of access patterns. What those memory types have in common is the rather large prefetch size or comparatively low command rates/clocks compared to the number of bits moved. I am not seeing this being remedied with HBM gen2 apart from the higher clocks in general - even in Pseudo Channel mode, all transferred DWORDS share one single AWORD.[/QUOTE]
 
Regarding mining...



Add to that, that's THEORETICAL peak throughput. I was reading coders discuss maximum throughput with Ethereum and said something like 85% is the achievable maximum.

Quadro GP100 is said to get ~70MH/s. That's out of its 720GB/s theoretical maximum throughput on the memory controller. If PCGH tests are indication of reality, at 85% efficiency we're only going to get 42.5MH/s with 400GB/s practical bandwidth.

And what kind of benefits we can expect from HBMC when properly configured? I'd still count this as unknown factor.
 
Regarding mining...



Add to that, that's THEORETICAL peak throughput. I was reading coders discuss maximum throughput with Ethereum and said something like 85% is the achievable maximum.

Quadro GP100 is said to get ~70MH/s. That's out of its 720GB/s theoretical maximum throughput on the memory controller. If PCGH tests are indication of reality, at 85% efficiency we're only going to get 42.5MH/s with 400GB/s practical bandwidth.
I don't know who the poster is but they're using Samsung HBM2, not SK Hynix
 
Has a graphics card manufacturer ever released a product of any kind running memory above its rated speed at stock settings—excluding overclocked partner versions?
 
So primitive shader now makes sense to me after watching the interview with Michael Mantor.

A key concept here is that culling takes place before computing attributes, which increases efficiency because:
  • shader cycles aren't wasted on computing attribute data for vertices that would be culled, because traditinal culling only happens as part of primitive assembly (fixed function stage) in the traditional pipeline, which happens just before rasterisation
  • pipeline inter-stage buffering isn't wasted on vertices that get culled, i.e. the pipeline is less likely to choke because the rasteriser (or fragment shader etc.) can't keep up
Since the primitive shader performs culling, it is effectively moving culling to much earlier in the pipeline. Culling is usually either a fixed function stage or the developer has coded the geometry shader to perform culling (if there is a geometry shader). The presence of a GS which does culling doesn't prevent the fixed function culling stage from doing any culling, just before rasterisation.

Sometimes primitive shader culling has no effect on the quantity of per-vertex attribute data, because there is no attribute data except for position. Shadow map and early depth passes are examples of simple vertex data with just position. So there are no savings related to the effort of shading non-position attributes, and buffer capacity savings are reduced.

In theory primitive shader culling can run at higher throughput than fixed function culling, simply because fixed function stages are never sized for the worst-case. So even ignoring attribute-shading and buffering capacity, culling during the primitive shader will increase overall pipeline throughput.

So, in what situation can the driver not convert a vertex shader (or a VS/GS pair or DS/GS pair or a DS) into a primitive shader with culling before vertex output?

Presumably it can't do this when fixed function culling is not part of the pipeline, e.g. when GS is performing stream out.

What other situations are there where automatic conversion by the driver is not possible?

In what situation would performance suffer due to an attempt to use a primitive shader? e.g. when there is practically no culling taking place (so that the fixed function is never a meaningful bottleneck) or when the primitive shader compiles to a VGPR hog, causing a poor-occupancy slow-down.
 
Ok, a question on the mining thing...even though it is "Unlikely to exceed 65 MH/s" isn't that still pretty insanely good? I thought they were all excited when it was supposed to hit 60 MH/s, did I miss something or get it wrong?
 
yes, 65 MH/s on ethereum would will be ungodly good. The best I've seen so far is around 37-ish, which both a Titan XP (here it is actually a tad better than 1080 ti due to it's full 384 bit memory bus) and a Vega FE can achieve with memory OC. There's one caveat though: Power - and in more dimensions than just plain cost.

The clostest I've seen so far is around 85% of this theoretical value of 7,8 GB per MH with plain GDDR5-interfaces as in 1050(Ti)-1070 and of course RX 470-580. G5X as well as HBM gen1 and 2 seems to underperform and plain FLOPS only play a very minor role.

For example a 1070 can go to ~31 MH/s with 9.2 gbps ram (OC), while a 1080 is below 26 with 11 gbps RAM.
 
yes, 65 MH/s on ethereum would will be ungodly good. The best I've seen so far is around 37-ish, which both a Titan XP (here it is actually a tad better than 1080 ti due to it's full 384 bit memory bus) and a Vega FE can achieve with memory OC. There's one caveat though: Power - and in more dimensions than just plain cost.

The clostest I've seen so far is around 85% of this theoretical value of 7,8 GB per MH with plain GDDR5-interfaces as in 1050(Ti)-1070 and of course RX 470-580. G5X as well as HBM gen1 and 2 seems to underperform and plain FLOPS only play a very minor role.

For example a 1070 can go to ~31 MH/s with 9.2 gbps ram (OC), while a 1080 is below 26 with 11 gbps RAM.
Thanks for the explanation, but could you dumb it down for me just a bit? I'm not getting it. :oops:
 
Back
Top