AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
1. Backface culling : Solved
2. View volume culling : Solved
3. Hidden surface culling : Solved in TBDR
4. Zero coverage culling, especially after tesselation
Ignorant question of the day: on today's non-Polaris GPUs, how are pipeline stalls visible to the programmer and how does one typically have to deal with it?
Depends on which type of stall you refer to, for the generic stall on memory transactions: Batched prefetch / read ahead and explicit use of the LDS.
Preferably not done explicitly in software, but rather implicitly by the hardware itself.

You can stall not only on memory transactions, but (especially for larger shader programs) also on the instruction cache. Which is much harder to account for, since you don't have explicit control over the corresponding cache.
 
I think a while ago someone was blown away of Polaris discard abilities. I think he says "in some cases its faster than anything in the market" referring to the discard capabilities that Polaris had.

Likely this one:
There are some situation where Polaris is incredibly fast. Faster than anything in the market. The secret is probably that special culling mechanism in the hardware, which helps the GPU to effectively cull those false positive primitives that aren't visible in the screen. Today's hardwares can't do this.
Single wavefront perfomance is also incredibly good. 10-100 times faster than anything in the market. This is good for VR.
 
  • Like
Reactions: xEx
you mean GP106?
How do you expect GP106 to reach Polaris 10 level of performance? 256bit bus? 128/192bit bus with GDDR5x ( whats the extra cost, the availability look like?, 4gb or 8gb). When do you think it will show up, if its making it for back to school it kind needs to be ASAP. If we assume GP106 is the same size as P10 with 256bit bus then based off the numbers before it would need to sell for $277 for 58% margin again NV could take a lower margin.
Considering that Nvidia traditionally (well, maybe rather recently) is able to get more effective performance out of cards with comparably lower bandwidth/FLOPS, I do not see, why a potential GP106 with a 192 bit wide bus should not be able to play in the same ballpark as the supposed RX480 numbers imply. After all, we're considering performance in the ballpark of R9 390/GTX 970, right? And GTX 970 had 224+32 bit (196 GB/s in 3.5-GB-mode) with it's segmented memory, and still managed to keep up as long as memory capacity was not an issue. With commodity 8-gbps-GDDR5 chips, you get to 192 GB/s on a 192 bit interface.

Let's assume GTX 1070 would be 100% bandwidth limited for a second. In our 1080p-performance index (1080p selected for not gimping the 970 further here, since that's not the point) for example, it scores 81,0 - whereas the 970 ist at 50,9. So, even with 100% assumed memory limitation (and according linear performance degradation), a 192-bit-1070 would be at 60,75. Which should indicate, that a 192b interface at 4 gbps should suffice for that level of performance at the very least.

Sure Polaris isn't looking like RV770 but it's nowhere near R600 and nowhere near as bad a situation you make out.
Depending on where you choose to look, there are numbers out there, that make Polaris P10 stand out from even a Fury X. So...
 
Considering that Nvidia traditionally (well, maybe rather recently) is able to get more effective performance out of cards with comparably lower bandwidth/FLOPS, I do not see, why a potential GP106 with a 192 bit wide bus should not be able to play in the same ballpark as the supposed RX480 numbers imply. After all, we're considering performance in the ballpark of R9 390/GTX 970, right? And GTX 970 had 224+32 bit (196 GB/s in 3.5-GB-mode) with it's segmented memory, and still managed to keep up as long as memory capacity was not an issue. With commodity 8-gbps-GDDR5 chips, you get to 192 GB/s on a 192 bit interface.

Let's assume GTX 1070 would be 100% bandwidth limited for a second. In our 1080p-performance index (1080p selected for not gimping the 970 further here, since that's not the point) for example, it scores 81,0 - whereas the 970 ist at 50,9. So, even with 100% assumed memory limitation (and according linear performance degradation), a 192-bit-1070 would be at 60,75. Which should indicate, that a 192b interface at 4 gbps should suffice for that level of performance at the very least.


Depending on where you choose to look, there are numbers out there, that make Polaris P10 stand out from even a Fury X. So...
Just to add.
The challenge for the GP106 is not necessarily the bus-bandwidth but the GPC/SM architecture in comparison to even the 970.
The 970 still had 4 GPC but with SM disabled in several of them giving average 3 SM per GPC.
The 1060 looks to be only 2 GPC, albeit with a full compliment of Pascal 5 SM per GPC, that may be too much of a compromise to reach around 970-980 levels.
Like you say, unfortunately we still cannot tell with any kind of reliability where the 480 sits in comparison due to the insane amount of diverse put out there 'rumours-leaks' that puts it all over the place for now, but that will be a critical factor whether it is closer to 390 or varies depending upon game-benchmark and if up to 390x.
Probably the most solid recent lead on performance is SapphireEd suggesting going from 970/390 to the 480 would be a sidegrade move; so 480 could be equal and with a little bit more to those.
Cheers
 
Last edited:
The most solid indication for performance, whether or not you trust all the leaked scores, is AMD itself. When they market 480 as entry to VR experience (which was R9 290 before), chances are, RX 470 won't reach that level of performance. That in turn means it is unlikely that RX 480 is massively (>25%*) faster than R9 290.

*my assumption based on gut feeling/experience not any information I received or calculation I have done. Insert whatever performance difference for 480/470 you like.
 
Last edited:
Depends on which type of stall you refer to, for the generic stall on memory transactions: Batched prefetch / read ahead and explicit use of the LDS.
Preferably not done explicitly in software, but rather implicitly by the hardware itself.

You can stall not only on memory transactions, but (especially for larger shader programs) also on the instruction cache. Which is much harder to account for, since you don't have explicit control over the corresponding cache.
What about changes of graphics state for the fixed-function portion, or at the API level? Changes to that context are still cited as having a performance cost, and penalties to the abstract graphics pipeline could be more obvious than items such as LDS or Icache penalties. Depending on what level of programming is being used, many of those items may not even be exposed.
 
The most solid indication for performance, whether or not you trust all the leaked scores, is AMD itself.
Agreed!
And taking that concept on a tangent, when the RX 480 reviews and benches come out in a few days, try to use AMD's own products as a comparison for RX480. Compare its cost, its graphics, its compute to the R9 290, the R9 390, the R9 Fury. Using AMD cards as a comparision really helps to isolate the actual hardware performance since all the AMD cards share the same basic strengths and weaknesses of driver and basic architecture. When deciding whether to buy, you should compare to NVidia cards, but that comparison will be much less clear with huge variation depending on DX11/DX12, per game, which specific benchmarks you pick, async battles, gsync vs freesync, etc, because those have large, existing, well known, variation between the two companies.

What we'll inevitably see this week in reviews and forums is an immediate comparison to NVidia cards like the last-gen GTX 980 and current-gen GTX 1070. Those comparisons will be loaded with so many implied caveats (the same per-game variation, DX11/DX12, async, etc) that the question of "how good is the new RX 480" will in fact be overshadowed by the NV vs AMD differences. So my hint to you, dear forum readers, is to use R9 290, R9 390 and R9 Fury as your comparison goalposts this week. When you scan down a page of Game X's FPS rankings, let your eyes first look for 390 and Fury, not GTX xxx. The NVidia comparisons are inevitable, interesting, and important, but expect that in discussions the known AMD vs NVidia differences will overshadow the actual new RX 480's (likely dramatic) improvements. Don't get distracted by those differences until you're about to buy a card.
 
The most solid indication for performance, whether or not you trust all the leaked scores, is AMD itself.
Actually AMD has been playing some intentional smoke & mirrors -game on the RX 480, so before the launch I wouldn't trust their numbers either
 
Actually AMD has been playing some intentional smoke & mirrors -game on the RX 480, so before the launch I wouldn't trust their numbers either
To be fair, SapphireEd very recent comment (Sapphire livestream yesterday I think) about 970/390 would be a sidegrade move (I assume this means it may still be a small % faster) are kind of aligned with those figures presented awhile ago from AMD, albeit as you say they were carefully chosen and with a specific context from AMD.

Cheers
 
To be fair, SapphireEd very recent comment about 970/390 would be a sidegrade move (I assume this means it may still be at times a small % faster) are kind of aligned with those figures presented awhile ago from AMD, albeit as you say they were carefully chosen and with a specific context from AMD.

Cheers
What I meant with smoke and mirrors wasn't referring to choosing most favourable scenarios for the RX 480 in their comparisons, that's just normal practice in any such comparison, not smoke & mirrors. And all smoke & mirror cases don't try to make the thing look better than it is
 
What I meant with smoke and mirrors wasn't referring to choosing most favourable scenarios for the RX 480 in their comparisons, that's just normal practice in any such comparison, not smoke & mirrors. And all smoke & mirror cases don't try to make the thing look better than it is
Sure, I was just commenting on the numbers aspect, and yeah I agree with you that I am not fully trusting any numbers until we see a broad range of reviews-benchmarks.
Cheers
 
Last edited:
Gracias a un buen colega, he podido tener acceso a algunos números reales de rendimiento de la RX 480. Los detalles son típicos de review a 1080p, así que saboreadlos bien
icon_wink.gif
:

Doom
Medios (base 380: 55-60fps):
RX 480: 163%
GTX 970: 147%
R9 380:100%

Mínimos (base 380: 40-45fps):
RX 480:158%
R9 380:100%
GTX 970: 93%

Firestrike
Graphics Score (base de la R9 380: 8700-8800 ptos):
RX 480: 149%
GTX 970: 142%
R9 380: 100%


Gears of War: Ultimate
Medios (base 380: 55-60 fps):
RX 480: 158%
GTX 970: 153%
R9 380: 100%

Mínimos (base 380: 50-55fps):
RX 480: 119%
GTX 970: 115%
R9 380: 100%

Saludos

P.D: La GTX 970 es una custom bastante subida de stock
lol.gif
, así que no está nada, nada mal

a frind of him have acces to this data, they were/will be used in a "standard 1080p review". the 970 was a custom with heavy OC.

http://foro.noticias3d.com/vbulleti...941&page=317&p=5520420&viewfull=1#post5520420
 
4. Zero coverage culling, especially after tesselation
This patent was filed by Intel last month so I doubt it's currently present on current generation GPUs. It seems this is more of an optimization to cull them faster and not culling more triangles. In current GPUs, if a triangle doesn't cover any pixels then it is culled.

You can stall not only on memory transactions, but (especially for larger shader programs) also on the instruction cache. Which is much harder to account for, since you don't have explicit control over the corresponding cache.
Well instruction cache is also a memory transaction ;-)
Detection of stalls as an application programmer is hard unless the HW vendors actually provide counters e.g. idle time of a particular unit (AMD's GPUShaderAnalyzer shows this). It's mostly based on conjecture and hit/trial process - if you are being stalled by texture fetches then turning the texturing off will improve your framerate; if stalled by instruction cache then reducing the size of your shader program will decrease them. Stalls mainly happen due to memory transactions as their lookup times are non-deterministic. Shader operations and other FF operations like depth test/blending have a deterministic time and hence HW is typically optimized to prevent stalls in them.
 
5. Pixel culling / Early out

Pixels could be primitives too. Regrouping/rearranging a thread block following an early out, or more importantly divergent paths, would likely help performance. Remove all the holes from the wavefronts that pop up during runtime. Should help during some alpha and postprocessing situations. It would need to be a really fast operation or only apply to long running shaders to make sense. Would seem to make a lot more sense for compute.

This patent was filed by Intel last month so I doubt it's currently present on current generation GPUs. It seems this is more of an optimization to cull them faster and not culling more triangles. In current GPUs, if a triangle doesn't cover any pixels then it is culled.
It seems likely the primitive discard may be a shader, so it wouldn't be surprising if it could be adapted to cull just about anything.
 
To be fair, SapphireEd very recent comment (Sapphire livestream yesterday I think) about 970/390 would be a sidegrade move (I assume this means it may still be a small % faster) are kind of aligned with those figures presented awhile ago from AMD

He only answers a direct question if it would make sense to upgrade a 390, and he said no, cards like the "390, 390x, 980, 980ti, fury" would be too close to be worth upgrading, not telling if it would be faster or slower of which of those... So he is not the one pointing out the 390 as target, and not really telling anything either. (https://www.twitch.tv/sapphirepr/v/74687671 from 1:23:50)
 
5. Pixel culling / Early out

Pixels could be primitives too. Regrouping/rearranging a thread block following an early out, or more importantly divergent paths, would likely help performance. Remove all the holes from the wavefronts that pop up during runtime. Should help during some alpha and postprocessing situations. It would need to be a really fast operation or only apply to long running shaders to make sense. Would seem to make a lot more sense for compute.

If you're talking about EarlyZ then it's part of hidden surface culling. Construction of pixel shader warps after EarlyZ seems like a sane choice to reduce divergence. But don't TBDRs already do this? For compute I don't think you can apply culling before running the shader as there is no concept of EarlyZ in compute. It's just data in -> compute shader -> data out; can't make any assumptions about the data itself.

It seems likely the primitive discard may be a shader, so it wouldn't be surprising if it could be adapted to cull just about anything.
Doing stuff in shader certainly gives you more programmability but it's not so good for power efficiency. Primitive discard in all GPUs has been fixed function since ages. It's a fixed function task meant to be done more efficiently by fixed function units.
 
He only answers a direct question if it would make sense to upgrade a 390, and he said no, cards like the "390, 390x, 980, 980ti, fury" would be too close to be worth upgrading, not telling if it would be faster or slower of which of those... So he is not the one pointing out the 390 as target, and not really telling anything either. (https://www.twitch.tv/sapphirepr/v/74687671 from 1:23:50)
Yes I listened to it the other day.
I guess it comes down to interpretation.
Telling someone not to upgrade from a 390 would normally suggest it is not that much more powerful.
He has to be somewhat vague due to NDA.

Cheers
Edit:
Wanted to check if my context was right.
The question was specifically "Would you recommend upgrading from a 390 to a 480".
And his response was NO, but expanded to include the models above that to talk about them being part of the enthusiast-performance type of cards.
 
Last edited:
R9 380X is the right comparison baseline for RX480, in my opinion. In terms of architecture, it's the closest, and in terms of market positioning it's also the closest. And if it's only 50-60% faster then that's definitely not as good at 1080 versus 980 where we see 70-80% faster. Obviously there are outliers.

The other thing I'm wondering is whether Polaris has hardware support for conservative rasterisation...
 
Status
Not open for further replies.
Back
Top