AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Discussion in 'Architecture and Products' started by iMacmatician, Apr 10, 2014.

Tags:
  1. 3dilettante

    3dilettante Legend Alpha

    The settings are typically low, and games with various effects like POM tend to turn them off.
    The Sony had to get the word out about a fix or clarification to devs about using AF, as there was a trend of titles that omitted it, even when the Xbox One or PC versions had it.
     
  2. [​IMG]
     
    elect, Lightman, Newguy and 3 others like this.
  3. Jawed

    Jawed Legend

    Fascinating idea.
     
  4. flopper

    flopper Newcomer

    http://forums.overclockers.co.uk/showpost.php?p=28199976&postcount=706

    Just to confirm.

    The AMD Radeon™ Fury X is an enthusiast graphics card designed to provide multi-display 4K gaming at 60Hz.

    In addition, Active DisplayPort 1.2a-to-HDMI 2.0 adapters are set to debut this summer. These adapters will enable any graphics cards with DP1.2a outputs to deliver 4K@60Hz gaming on UHD televisions that support HDMI 2.0.
     
    hoom and Lightman like this.
  5. Jawed

    Jawed Legend

    Shader engines are always export limited. There's no point firing finished pixels at the ROPs faster than the ROPs can process them. The rasterisation rate is the fundamental for export. Shader engine rasterisation rate hasn't changed. Has it?

    AMD used to queue and sort pixels as they left shading, to maintain ordering when ordering is required. I presume that still happens. That's a rendering mode choice made by the developer, so isn't always on.
     
    Razor1 likes this.
  6. silent_guy

    silent_guy Veteran Subscriber

    That confirms that they don't have HDMI 2.0 (probably because they didn't want to spend the R&D on a new transceiver for 28nm), but I really don't think that is a big deal. Isn't the latency in TVs too high anyway?
    But for the few for whom it is: a passive converter is $25. Active converters are going to be quite a bit more expensive...
     
  7. UniversalTruth

    UniversalTruth Veteran

    My videocard can send signals to my monitor via both HDMI and DP.

    However, when connected via HDMI, it recognises the monitor as TV and sends to it suboptimal signal.
     
  8. 3dilettante

    3dilettante Legend Alpha

    It is quite possible for a shader engine to export at a sustained rate lower than what the ROPs can process. An overclocking test isn't going to make that distinction. A mix of wavefronts with sufficient ALU, TEX, or other resource consumption can lead to gaps in the utilization of the export bus and ROPs, as occupancy limits would prevent the launch of additional wavefronts with their own export requirements.
    Increasing the number of wavefronts that can be kept in flight can bring the contended export path back into play, particularly since the bandwidth constraint has been significantly relaxed.

    Some amount of buffering has to be done on the far end of the export process, and the ROPs have a tile import-filter-export loop going on that is intimately tied to how well memory or the compression subsystem can move them, so even if the ROPs themselves are not on a memory controller's clock domain, what they plug into is.


    MRTs? The rasterizer isn't going to guess how many outbound values there could be. Does it know what format or filtering demands may be placed on the ROPs after coverage has been determined?
     
  9. Jawed

    Jawed Legend

    Fundamentally, I agree with you.

    Whether the ROPs clock with the shader clocks or are fixed, overclocking the shaders is going to expose a limit somewhere in the render back end, whether it's actual fillrate, the delta colour compression system (if it is, like NVidia's, variable in throughput) or buffer cache non-reuse or MC queueing.

    Unfortunately without a lot more data, there's no way to tell how all these subsystems' bottlenecks interact and which of them are actually causing substantially less than linear scaling with overclock.

    One of the curiosities of coin mining is that multiple subsystems are "beating" against each other, such that the optimal performance can be gained by setting the memory clock to a precise, irrational, multiplier of the core clock, which is not necessarily the fastest memory clock attainable (and depending on mining algorithm can be a vastly lower than maximum memory clock). And there aren't any ROPs in that, just kernel execution, export, cache, MC and GDDR. Oh and the potential for temperature-/current-induced throttling.

    And each card will have different optima for clocks, voltages, issue granularity...

    So, erm, I'm afraid to say, it's practically impossible to characterise formally.

    MRTs are just extra bytes written per work item, just the same as writing fatter pixels. I'm unsure what you're alluding to. The export rate is measured in bytes per clock. Which translates to match the baseline rate of the ROPs for the simplest pixels. At least, that's been the case for ages in AMD/ATI.

    I'm certainly not trying to excuse the observed scaling. Even though AMD ditched fast double precision they couldn't fit in more ROPs. If Tonga is any guide, then they spent a hell of a lot of transistors on delta colour compression, instead of adding more ROPs.

    I wonder if AMD's colour delta compression algorithm acts as a multiplier on certain kinds of fillrate.

    Perhaps AMD is doing more buffering-and-sorting after export, to feed into the colour delta compression algorithm (or, it is a fully integrated colour delta compression buffer-and-sort after export). It might be a deliberately-long pipeline with some kind of fancy feed-forward that means that the effective fillrate averages higher than the nominal rate we associate with ROPs.

    A configuration like this could help with small triangles as the quantity of pixel-quads with less than four shaded pixels increases substantially.

    I just have no idea what's going on.
     
  10. 3dilettante

    3dilettante Legend Alpha

    I've not delved deeply into that topic, but I have also read discussion that the limited dependence on bandwidth encourages underclocking the RAM.
    Crossing clock domains does incur synchronization penalties. It sounds like the miners are empirically determining what clocks allow for the most favorable divisors in terms of wait states. GPUs are not monolithic clock domains, with at least the GDDR portion being an obvious boundary. The export bus itself could be hiding a clock boundary as well. The domains might be the same speed, but possibly not fully synchronous.

    I interpreted the phrase rasterization rate as being the number of pixels being output by the rasterizer at the head of each shader engine per cycle.
     
  11. Lightman

    Lightman Veteran Subscriber

    LegitReviews posted a news about Fury X driving 3x4K screens in Dirt Rally:

    Apart from their special math to get 12K resolution by pairing 3 4K screens, it seems Fiji has some horespower. Wonder if this result is mainly possible thanks to Delta Colour Compression in the new chip as 290X CF has combined 620GB/s+ vs. 512GB of Fiji or 5632 shaders vs 4096 shaders on paper. This should mean CF wins every time. I'm assuming Dirt Rally is still using same engine as F1 2015 and previous Dirt titles which scales quite well with CF.
     
  12. silent_guy

    silent_guy Veteran Subscriber

    That's the thing, right? It's impressive that the current state of the art can drive 3x4k with acceptable frame rates, but in a vacuum without comparisons against other configurations, it's difficult to say whether or not it's just impressive or insanely great.
     
    nnunn likes this.
  13. Jawed

    Jawed Legend

    Some coin algorithms place strong emphasis on bandwidth, Litecoin's use of scrypt being an attempt originally to defeat GPU acceleration of mining by requiring a vast working set :lol:

    So it's not simply a case of finding the lowest practicable memory clock. It very much depends on the algorithm.

    Yes agreed.

    It is.

    Rasterisation rate matches fillrate for 32-bits per pixel shading. Export is denominated in bytes per clock, but that corresponds with RGBA 8-bit per channel. All three are aligned. Well, that's how things used to be.

    I have a bad feeling we're talking at cross purposes.
     

  14. On the other hand, the rendering on those two monitors to the sides don't seem to need a lot of work.

    BTW:

    [​IMG]

    Despite everything, it's always good to see an IHV acknowledging that gamers aren't dead after all. :)
     
  15. 3dilettante

    3dilettante Legend Alpha

    I suppose I was approaching the contention issue in a design scenario where occupancy issues caused both the rasterizer and export bus to be underutilized in smaller CU counts. Raising the CU count could eventually reduce occupancy constraints to the point that enough wavefronts need enough export cycles that the underutilization goes away.

    The ROPs came to my mind first because their execution process is linked to moving tiles in and out from memory, which would not scale with overclocking the core and would exert back pressure to wavefronts trying to export. My assumption, perhaps incorrect, was that the rasterizer's end of the process has a higher likelihood of sourcing its data from on-chip and so could benefit from higher GPU clocks relative to the more memory-heavy ROPs, leading to the rasterizer stage and the CU array both waiting with ready outputs for the ROPs to catch up and free up export buffers.
     
  16. gamervivek

    gamervivek Regular

    Yeah, I was there.

    https://forum.beyond3d.com/threads/...as-ati-5870-af-filtering-broken.49106/page-11

    But that was supposed to be a 'mad' smiley.
     
  17. pharma

    pharma Veteran

    http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed
     
    Razor1 likes this.
  18. gamervivek

    gamervivek Regular

  19. Kaarlisk

    Kaarlisk Regular Subscriber

    Interesting. Might this be the reason for odd stock memory speeds, for example, GTX 960's 7010 (1753) MHz? Though there, they have a variable boost clock, maybe the steps are also fit to the memory clock?
    Also interesting is whether the OEM's overclocks take this into account.
     
  20. Razor1

    Razor1 Veteran

Loading...

Share This Page

Loading...