AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

The settings are typically low, and games with various effects like POM tend to turn them off.
The Sony had to get the word out about a fix or clarification to devs about using AF, as there was a trend of titles that omitted it, even when the Xbox One or PC versions had it.
 
Looks like it's time some review sites to pick up the AF perf metrics

7KpyLAZ.png
 
In the case of ROPs, could the shader engines be export-limited? CUs need to negotiate access to an export bus and buffers on the ROP end. If that's per shader engine, and the organization in the block diagram hints at this, then scaling CU count may not scale the data path and capacity for the ROPs.
Shader engines are always export limited. There's no point firing finished pixels at the ROPs faster than the ROPs can process them. The rasterisation rate is the fundamental for export. Shader engine rasterisation rate hasn't changed. Has it?

AMD used to queue and sort pixels as they left shading, to maintain ordering when ordering is required. I presume that still happens. That's a rendering mode choice made by the developer, so isn't always on.
 
http://forums.overclockers.co.uk/showpost.php?p=28199976&postcount=706

Just to confirm.

The AMD Radeon[emoji769] Fury X is an enthusiast graphics card designed to provide multi-display 4K gaming at 60Hz.

In addition, Active DisplayPort 1.2a-to-HDMI 2.0 adapters are set to debut this summer. These adapters will enable any graphics cards with DP1.2a outputs to deliver 4K@60Hz gaming on UHD televisions that support HDMI 2.0.
That confirms that they don't have HDMI 2.0 (probably because they didn't want to spend the R&D on a new transceiver for 28nm), but I really don't think that is a big deal. Isn't the latency in TVs too high anyway?
But for the few for whom it is: a passive converter is $25. Active converters are going to be quite a bit more expensive...
 
My videocard can send signals to my monitor via both HDMI and DP.

However, when connected via HDMI, it recognises the monitor as TV and sends to it suboptimal signal.
 
Shader engines are always export limited. There's no point firing finished pixels at the ROPs faster than the ROPs can process them.
It is quite possible for a shader engine to export at a sustained rate lower than what the ROPs can process. An overclocking test isn't going to make that distinction. A mix of wavefronts with sufficient ALU, TEX, or other resource consumption can lead to gaps in the utilization of the export bus and ROPs, as occupancy limits would prevent the launch of additional wavefronts with their own export requirements.
Increasing the number of wavefronts that can be kept in flight can bring the contended export path back into play, particularly since the bandwidth constraint has been significantly relaxed.

Some amount of buffering has to be done on the far end of the export process, and the ROPs have a tile import-filter-export loop going on that is intimately tied to how well memory or the compression subsystem can move them, so even if the ROPs themselves are not on a memory controller's clock domain, what they plug into is.


The rasterisation rate is the fundamental for export. Shader engine rasterisation rate hasn't changed. Has it?
MRTs? The rasterizer isn't going to guess how many outbound values there could be. Does it know what format or filtering demands may be placed on the ROPs after coverage has been determined?
 
It is quite possible for a shader engine to export at a sustained rate lower than what the ROPs can process. An overclocking test isn't going to make that distinction. A mix of wavefronts with sufficient ALU, TEX, or other resource consumption can lead to gaps in the utilization of the export bus and ROPs, as occupancy limits would prevent the launch of additional wavefronts with their own export requirements.
Increasing the number of wavefronts that can be kept in flight can bring the contended export path back into play, particularly since the bandwidth constraint has been significantly relaxed.

Some amount of buffering has to be done on the far end of the export process, and the ROPs have a tile import-filter-export loop going on that is intimately tied to how well memory or the compression subsystem can move them, so even if the ROPs themselves are not on a memory controller's clock domain, what they plug into is.
Fundamentally, I agree with you.

Whether the ROPs clock with the shader clocks or are fixed, overclocking the shaders is going to expose a limit somewhere in the render back end, whether it's actual fillrate, the delta colour compression system (if it is, like NVidia's, variable in throughput) or buffer cache non-reuse or MC queueing.

Unfortunately without a lot more data, there's no way to tell how all these subsystems' bottlenecks interact and which of them are actually causing substantially less than linear scaling with overclock.

One of the curiosities of coin mining is that multiple subsystems are "beating" against each other, such that the optimal performance can be gained by setting the memory clock to a precise, irrational, multiplier of the core clock, which is not necessarily the fastest memory clock attainable (and depending on mining algorithm can be a vastly lower than maximum memory clock). And there aren't any ROPs in that, just kernel execution, export, cache, MC and GDDR. Oh and the potential for temperature-/current-induced throttling.

And each card will have different optima for clocks, voltages, issue granularity...

So, erm, I'm afraid to say, it's practically impossible to characterise formally.

MRTs? The rasterizer isn't going to guess how many outbound values there could be. Does it know what format or filtering demands may be placed on the ROPs after coverage has been determined?
MRTs are just extra bytes written per work item, just the same as writing fatter pixels. I'm unsure what you're alluding to. The export rate is measured in bytes per clock. Which translates to match the baseline rate of the ROPs for the simplest pixels. At least, that's been the case for ages in AMD/ATI.

I'm certainly not trying to excuse the observed scaling. Even though AMD ditched fast double precision they couldn't fit in more ROPs. If Tonga is any guide, then they spent a hell of a lot of transistors on delta colour compression, instead of adding more ROPs.

I wonder if AMD's colour delta compression algorithm acts as a multiplier on certain kinds of fillrate.

Perhaps AMD is doing more buffering-and-sorting after export, to feed into the colour delta compression algorithm (or, it is a fully integrated colour delta compression buffer-and-sort after export). It might be a deliberately-long pipeline with some kind of fancy feed-forward that means that the effective fillrate averages higher than the nominal rate we associate with ROPs.

A configuration like this could help with small triangles as the quantity of pixel-quads with less than four shaded pixels increases substantially.

I just have no idea what's going on.
 
One of the curiosities of coin mining is that multiple subsystems are "beating" against each other, such that the optimal performance can be gained by setting the memory clock to a precise, irrational, multiplier of the core clock, which is not necessarily the fastest memory clock attainable (and depending on mining algorithm can be a vastly lower than maximum memory clock). And there aren't any ROPs in that, just kernel execution, export, cache, MC and GDDR. Oh and the potential for temperature-/current-induced throttling.
I've not delved deeply into that topic, but I have also read discussion that the limited dependence on bandwidth encourages underclocking the RAM.
Crossing clock domains does incur synchronization penalties. It sounds like the miners are empirically determining what clocks allow for the most favorable divisors in terms of wait states. GPUs are not monolithic clock domains, with at least the GDDR portion being an obvious boundary. The export bus itself could be hiding a clock boundary as well. The domains might be the same speed, but possibly not fully synchronous.

MRTs are just extra bytes written per work item, just the same as writing fatter pixels. I'm unsure what you're alluding to. The export rate is measured in bytes per clock.
I interpreted the phrase rasterization rate as being the number of pixels being output by the rasterizer at the head of each shader engine per cycle.
 
LegitReviews posted a news about Fury X driving 3x4K screens in Dirt Rally:

The most impressive part about the 12K AMD Eyefinity demo shown this week by AMD running the game title Dirt Rally is that they needed just one video card in the PC to push all those pixels at a playable frame rate, the AMD Radeon R9 Fury X. The frame rate was pushing 60FPS, which isn’t too bad considering what the single $649 graphics card is doing behind the scenes to push out all those pixels at an acceptable rate. AMD informed us that two Radeon R9 290X’s or a Radeon R9 295×2 get around 45-50 FPS on this exact setup.
Read more at http://www.legitreviews.com/12k-gam...ry-x-graphics-card_166585#kUudp7DK2vKbftl6.99

Apart from their special math to get 12K resolution by pairing 3 4K screens, it seems Fiji has some horespower. Wonder if this result is mainly possible thanks to Delta Colour Compression in the new chip as 290X CF has combined 620GB/s+ vs. 512GB of Fiji or 5632 shaders vs 4096 shaders on paper. This should mean CF wins every time. I'm assuming Dirt Rally is still using same engine as F1 2015 and previous Dirt titles which scales quite well with CF.
 
That's the thing, right? It's impressive that the current state of the art can drive 3x4k with acceptable frame rates, but in a vacuum without comparisons against other configurations, it's difficult to say whether or not it's just impressive or insanely great.
 
I've not delved deeply into that topic, but I have also read discussion that the limited dependence on bandwidth encourages underclocking the RAM.
Some coin algorithms place strong emphasis on bandwidth, Litecoin's use of scrypt being an attempt originally to defeat GPU acceleration of mining by requiring a vast working set :LOL:

So it's not simply a case of finding the lowest practicable memory clock. It very much depends on the algorithm.

Crossing clock domains does incur synchronization penalties. It sounds like the miners are empirically determining what clocks allow for the most favorable divisors in terms of wait states. GPUs are not monolithic clock domains, with at least the GDDR portion being an obvious boundary. The export bus itself could be hiding a clock boundary as well. The domains might be the same speed, but possibly not fully synchronous.
Yes agreed.

I interpreted the phrase rasterization rate as being the number of pixels being output by the rasterizer at the head of each shader engine per cycle.
It is.

Rasterisation rate matches fillrate for 32-bits per pixel shading. Export is denominated in bytes per clock, but that corresponds with RGBA 8-bit per channel. All three are aligned. Well, that's how things used to be.

I have a bad feeling we're talking at cross purposes.
 
LegitReviews posted a news about Fury X driving 3x4K screens in Dirt Rally:
Apart from their special math to get 12K resolution by pairing 3 4K screens, it seems Fiji has some horespower. Wonder if this result is mainly possible thanks to Delta Colour Compression in the new chip as 290X CF has combined 620GB/s+ vs. 512GB of Fiji or 5632 shaders vs 4096 shaders on paper. This should mean CF wins every time. I'm assuming Dirt Rally is still using same engine as F1 2015 and previous Dirt titles which scales quite well with CF.


On the other hand, the rendering on those two monitors to the sides don't seem to need a lot of work.

BTW:

iDMd7b5.png


Despite everything, it's always good to see an IHV acknowledging that gamers aren't dead after all. :)
 
It is.

Rasterisation rate matches fillrate for 32-bits per pixel shading. Export is denominated in bytes per clock, but that corresponds with RGBA 8-bit per channel. All three are aligned. Well, that's how things used to be.

I have a bad feeling we're talking at cross purposes.
I suppose I was approaching the contention issue in a design scenario where occupancy issues caused both the rasterizer and export bus to be underutilized in smaller CU counts. Raising the CU count could eventually reduce occupancy constraints to the point that enough wavefronts need enough export cycles that the underutilization goes away.

The ROPs came to my mind first because their execution process is linked to moving tiles in and out from memory, which would not scale with overclocking the core and would exert back pressure to wavefronts trying to export. My assumption, perhaps incorrect, was that the rasterizer's end of the process has a higher likelihood of sourcing its data from on-chip and so could benefit from higher GPU clocks relative to the more memory-heavy ROPs, leading to the rasterizer stage and the CU array both waiting with ready outputs for the ROPs to catch up and free up export buffers.
 
Now we have a pretty good sense of things. In certain respects, Fiji has grown by roughly half-again compared to Hawaii, including peak shader arithmetic, texture filtering capacity, and memory bandwidth. That 512 GB/s of memory bandwidth comes courtesy of HBM, Fiji's signature innovation, and puts the Fury X in a class by itself in at least one department.

In other respects, including peak triangle throughput for rasterization and pixel fill rates, Fiji is simply no more capable in theory than Hawaii. As a result, Fiji offers a very different mix of resources than its predecessor. There's tons more shader and computing power on tap, and the Fury X can access memory via its texturing units and HBM interfaces at much higher rates than the R9 290X.

In situations where a game's performance is limited primarily by shader effects processing, texturing, or memory bandwidth, the Fury X should easily outpace the 290X. On the other hand, if gaming performance is gated by any sort of ROP throughput—including raw pixel-pushing power, blending rates for multisampled anti-aliasing, or effects based on depth and stencil like shadowing—the Fury X has little to offer beyond the R9 290X. The same is true for geometry throughput.

The Fury X substantially outruns the GeForce GTX 980 Ti in terms of integer texture filtering, shader math rates, and memory bandwidth, too, since the 980 Ti more or less matches the 290X in those departments. But the Fury X has only about 70% of the ROP and triangle rasterization rates of the big GeForce.

With Fiji, AMD is offering a rather different vision of how GPUs ought to be used by game developers. That's one reason I'd expect to see continuing fights between the GPU vendors over what effects folks incorporate into PC games. Nvidia will likely emphasize geometric complexity and tessellation, and AMD will probably push for prettier pixels instead of more polygons.
http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed
 
One of the curiosities of coin mining is that multiple subsystems are "beating" against each other, such that the optimal performance can be gained by setting the memory clock to a precise, irrational, multiplier of the core clock, which is not necessarily the fastest memory clock attainable
Interesting. Might this be the reason for odd stock memory speeds, for example, GTX 960's 7010 (1753) MHz? Though there, they have a variable boost clock, maybe the steps are also fit to the memory clock?
Also interesting is whether the OEM's overclocks take this into account.
 
Back
Top