AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

How would the register files be organized in these dynamic sized wavefront systems?
The register file is really 256x 2048-bit registers and wouldn't need to change. That's why I'm so impressed by this idea, because all the shenanigans involved in traditional dynamic wavefront formation relating to context manipulation and fragmentation are entirely obviated.

As I understand it, the current design uses three cycles to read operands (64 operands from address A, then address B and finally address C in FMA A * B + C) and a fourth to write resultants.

The 64 operand slots need to be "swizzled in time" in order to feed them into the SIMD: A, B and C 0-15, then A, B and C 16-31 etc. So GCN already has an "operand collector" to support this (not exactly a FIFO, but the lanes read from it as though it were a per-lane FIFO per operand).

This latter operation, modulated by the execution mask, then chooses which operands are delivered in the temporal predication scheme. If after four cycles of register operations (three reads, one write), when the "FIFO" is ready to deliver operands to the instruction that's about to start, there are only two operands in a specific lane, then that lane runs at half speed.

In GCN as it currently exists there has to be some kind of resultant collector, so that a single write cycle can send all 64 resultants to the register file. The resultants arrive in this collector over four cycles, but need to be written coherently as a single operation in a single cycle to a single address (with masking for resultants that should not be written - GCN already has masking for resultant writes, completely distinct from predication).

Similarly, in the temporal predication scheme, the resultant "FIFO" (it's just a buffer, but the document talks about a FIFO), modulated by the execution mask, enables delivery of resultants to the correct subset of registers. This would be a minor change to the existing design.

Couple these two collectors with a forwarding network so that resultants can be consumed by the immediately successive instruction and you have a drop in replacement for the existing pipeline that uses two collectors and a forwarding network :cool:

It is easy to see the potential improvements of utilization (in branchy code).
It's not merely branchy code but also supports wavefronts where not all lanes have meaningful data, e.g. quads with less then four active fragments.

However (if I understood properly) the register files consume more power than ALUs.
I don't understand that conclusion. The register file is pretty dumb. One could argue that fetching a 2048-bit register when only, say, four disparate 32-bit operands are required is wasteful, but that's a different class of problem.

Operand and resultant collection is already a separate workload in GCN. There's more work in temporal predication, because the execution mask (or it's proxy: count of actual operands) has to modulate lane clocking and resultant routing.

In the end all that's actually sought is a net gain in power efficiency.

Would this new design consume significantly more power in simple (non branchy) code?
The "null" scenario, where the execution mask is "-1", would lead to the regulators setting up all lanes to "full speed" and the resultant write operation working without any "shuffle". That's pretty close to "no cost". But there is still temporal predication functionality (transistors) sat there "idle", consuming power, etc.

The voltage regulation would appear to add power consumption regardless of whether predication was active or not. 4096 VALU lanes each with private voltage regulation seems to be about an order of magnitude more density in voltage regulation than seen anywhere else in computing (uneducated guess to be fair). On the other hand, contemporary super-efficient designs appear to feature high density voltage regulation merely to work.

A coarse grained voltage regulation architecture is going to suffer power losses simply because it's coarse grained (it's harder to distribute regulated voltages than it is to distribute unregulated voltages - impedance and latency are enemies over wide areas) and also because it's slow to react and because it results in more of the chip being supplied with the wrong voltage (or running at the wrong clock).

In general you spend transistors and increase architectural complexity because it results in more compute efficiency. See DCC as an example.

I can't see how we can quantify the pros and cons here. The proof will be in whether a GPU using this scheme is actually built.
 
In modern x86 CPUs, division latency depends on the actual values of the operand, although that may be too complex an operation for this scenario.
GCN uses macros for division...

The example does have a CU with multiple SIMDs, but talks about relative delay between lanes--possibly not in the same SIMD. That scenario makes different execution times more plausible, particularly if dealing with low-precision in one SIMD and heavy DP arithmetic in the other.
They'd still indirectly interact, since the rest of the CU may be synchronous and its arbitration cycles would otherwise be bound by the worst-case time of one of them.
Yes, I've now realised this is the particular case where CU-wide synchronisation is meaningful. The SALU is cross-CU and all of its interactions with the SIMDs and their register files depends on fitting into the four-clock cadence (where the four SIMDs each have a 1 clock offset, so that they all spread across the SALU's own four cycle cadence).

Other instruction types also interact meaningfully with this cadence, I presume: e.g. TEX and MEM both return results directly into RF.

My own un-resolved question relates to the fact that the register file is at the centre of practically every instruction, whether VALU or some other unit associated with a CU. The three-read/one-write cadence for the register file is, in many respects overloaded when all of these operations are happening "simultaneously".

When a CU sub-unit returns a result, it wants RF write bandwidth, but VALU will, generally, fully occupy RF write bandwidth.

A VALU hardware-thread switch (amongst wavefronts) generally incurs latency, usually 16 cycles. My guess is that this is one case when some RF overload is relieved. TEX or MEM result writes to RF may fit into this VALU downtime. The downtime might even be extended to cope with the RF operations that are "queued".

One could argue that RF clocking should be completely disconnected from VALU clocking, to enable less contention for RF cycles. On the other hand, perhaps the overload isn't generally so onerous that it would be worthwhile running RF at 50% (or more, say) higher clock than VALU. There can be long periods of VALU execution where there is zero RF contention.

RF clocking that's disconnected from VALU clocking might be coming anyway, but I can't see how temporal predication, per se, would require this disconnect.
 
When a CU sub-unit returns a result, it wants RF write bandwidth, but VALU will, generally, fully occupy RF write bandwidth.
Up to 70% of values are only read once and 50% of all values produced are only read once, within three instructions of being produced.

The RFC was able to reduce the number of MRF accesses by 40–70%, saving 36% of register file energy without a performance penalty.
https://www.cs.utexas.edu/users/skeckler/pubs/micro11.pdf
Using a RF cache and some arbitration for collisions within a CU it could be done. If those numbers hold, then half the time you wouldn't be writing to the RF. Reads could be even lower assuming the design didn't oversubscribe the RF. That same reduction should be sufficient to fetch multiple waves for the dynamic wavefront. Previously I assumed doubling the execution units to put more work on the RF, but dynamic waves might also work as only a single operand would be required for each.
 
GCN uses macros for division...
It was an example of an operation whose latency was data-dependent, and which was visible to the rest of the pipeline. GPU examples that might match the patent's claim about operand size would be the various precision modes. The 4-cycle cadence is maintained, but the actual propagation of signals from read to writeback would not be the same with operands ranging from 8 to 64 bits, plus the assortment of modifiers and special behaviors in GCN that might devolve into simpler bypassing of values or flushes to a hardwired value. A lot of theses cases could have been done in a shorter clock cycle at fixed voltage, if the clock wasn't sized to the longest single-cycle instruction. Alternately, the more shallow operations could be allowed to stretch their propagation time out to the full cycle by running at a lower voltage.

Yes, I've now realised this is the particular case where CU-wide synchronisation is meaningful. The SALU is cross-CU and all of its interactions with the SIMDs and their register files depends on fitting into the four-clock cadence (where the four SIMDs each have a 1 clock offset, so that they all spread across the SALU's own four cycle cadence).
The SALU or the CU's scheduling block could depend on the 4-cycle cadence. However, if a SIMD doesn't signal that it is ready, vector instruction issue could be skipped. Since VALU operations can vary based on certain conditions like tunable DP rate or possibly variable-latency operands like reading directly from the LDS, the CU may already have the ability to forego a VALU op if the SIMD isn't able to resolve itself fully in time.

Other instruction types also interact meaningfully with this cadence, I presume: e.g. TEX and MEM both return results directly into RF.
Possibly, they put results on a buffer/FIFO of some kind, and the SIMD's logic tries to read them in when it can. The vector register file could be dual-ported, with half the bandwidth used by the VALU, or the SIMD's logic might opportunistically steal accesses when operations don't need all operand cycles.

My own un-resolved question relates to the fact that the register file is at the centre of practically every instruction, whether VALU or some other unit associated with a CU. The three-read/one-write cadence for the register file is, in many respects overloaded when all of these operations are happening "simultaneously".
With AMD's exascale concept, only the vector ALUs and their crossbars are asynchronous, and at the same time the SRAM arrays are not kept at near-threshold voltages. One possibility when both conditions are true is that VALUs can at most be the same clock as the register file, and at other times are running at a lower rate with a modified adaptive clock+voltage scheme that tries to keep the lanes from deviating past a certain threshold.
 
The register file is really 256x 2048-bit registers and wouldn't need to change. That's why I'm so impressed by this idea, because all the shenanigans involved in traditional dynamic wavefront formation relating to context manipulation and fragmentation are entirely obviated.

As I understand it, the current design uses three cycles to read operands (64 operands from address A, then address B and finally address C in FMA A * B + C) and a fourth to write resultants.
SIMD execution units are 16 wide. SIMD executes one 64-wide instruction in 4 cycles. Four instructions are in execution at once. My guess is that first 3 cycles are spent in collecting registers, and the fourth executing (I have a theory how results are written, but it requires a bit more explaining). Lanes 0-15 start on cycle 0, lanes 16-31 start on cycle 1, lanes 32-47 on cycle 2, lanes 48-64 on cycle 3. Then next instruction starts and interleaves with the tail of current. This guarantees that the result of each 16-wide part is ready before the next instruction needs to read that 16-wide part. The 4-cycle cadence means that each 16-wide register file serves just one read per cycle. Thus I am guessing that there are four 16-wide register files per SIMD. There's no sense to make one big multi-ported register file. GCN1 and GCN2 had no direct way to reading registers of different lanes. All cross lane switches did go through LDS hardware (long variable latency). GCN3 adds DPP instructions for faster cross-lane swizzle inside SIMD. However DPP needs 2 cycles between write and read to the same register (other independent instructions or NOPs). Thus there's still no need to direct fast path to access registers of other 16-wide parts of the wave.

I would be surprised if the register file wasn't split to four 16-wide parts. Each with one read/write port (doesn't need a separate read & write port either). One big register file per SIMD wouldn't make any sense with the 4-cycle cadence. Obviously we will never know, since GCN hasn't got any bank conflicts or any explicit software scheduling requirements, so it's not possible to try to figure this out by measuring timing differences. This information is not needed by programmers, as it has zero effect on performance. There's no need to be open about your low level power saving features, unless they create bottlenecks that need to be avoided by software.
 
This SGEMM optimization article spills some details about Nvidia Kepler & Maxwell register banking (see chapter "Register Banks and Reuse"):
https://github.com/NervanaSystems/maxas/wiki/SGEMM

Every time I read these low level CUDA optimization articles, I am glad that GCN doesn't have register bank conflicts (it is basically the only GPU that doesn't). Fighting against the compiler to avoid register bank conflicts wouldn't be that much fun. Of course we need to fight against LDS bank conflicts, but at least you have full control over the LDS addressing. If the code runs badly it's always your fault :)
 
Using a RF cache and some arbitration for collisions within a CU it could be done. If those numbers hold, then half the time you wouldn't be writing to the RF. Reads could be even lower assuming the design didn't oversubscribe the RF.
For most of NVidia's history with a real register file (G80 onwards), the register file has been an expensive lumbering beast, using banking and various kinds of quite expensive operand collectors coupled to hardware-thread scoreboarding, to solve RF collisions. These days it's much more efficient, but it's still lumbering compared with GCN as banking and compiler targetting is still a requirement, as I understand it.

AMD's GPUs have had resultant forwarding for the "next" instruction since R300 (in GCN it is strictly for the next instruction, but in R300 the resultant could live for multiple instructions before being consumed). So it's not strictly-speaking a cache (LRF is a better term), and it doesn't mean that resultants aren't written to RF, but it's one of the reasons why AMD's register file has always supported extremely short execution pipeline latencies.

R300 architecture has multiple write ports. I'm unclear on whether GCN does (which would entirely solve the problem I described earlier) but I'm unsure if the expense is worth it in GCN (because of VALU downtime).

In my opinion what they are describing in that paper is very close to the architecture seen in CPUs. A small register file (LRF), backed by a cache hierarchy. Larrabee, with its nice swizzles across its lanes and relatively small RF, is the epitome of this kind of design.

I would like to see GPUs generalised to this kind of architecture: it's sickening that we're still fighting with register allocation. NVidia's larger RF these days is a significant advantage over GCN, in the meantime.

I've talked in the past about an alternative RF architecture for AMD: a design that supports only four hardware threads per SIMD (no more, no less) with each hardware thread having a private RF. This would enable more RF capacity. But AMD seems determined to keep us fighting register allocation.

That same reduction should be sufficient to fetch multiple waves for the dynamic wavefront. Previously I assumed doubling the execution units to put more work on the RF, but dynamic waves might also work as only a single operand would be required for each.
G80 was, because of its crazy expensive scheduling and register file and operand collector, already capable of doing all this (though it was never done for real). It could send DWORDs from arbitrary locations in RF to arbitrary lanes. (This actually requires a different mapping of registers versus hardware threads in RF, but that mapping is switched on by the compiler, at the cost of increased RF latency)...

If SIMD lanes can be clocked independently, then there really is no point in dynamic wavefront formation or building a CU from multiple, various-width, SIMDs.

NVidia's 32-wide SIMD design is more amenable to improved efficiency for predication. Turning off individual lanes is easier than having multiple clocking speeds (and by implication, "off") that this patent application describes. But with more than one SIMD, and the varying widths, things aren't so simple...
 
It was an example of an operation whose latency was data-dependent, and which was visible to the rest of the pipeline. GPU examples that might match the patent's claim about operand size would be the various precision modes. The 4-cycle cadence is maintained, but the actual propagation of signals from read to writeback would not be the same with operands ranging from 8 to 64 bits, plus the assortment of modifiers and special behaviors in GCN that might devolve into simpler bypassing of values or flushes to a hardwired value. A lot of theses cases could have been done in a shorter clock cycle at fixed voltage, if the clock wasn't sized to the longest single-cycle instruction. Alternately, the more shallow operations could be allowed to stretch their propagation time out to the full cycle by running at a lower voltage.
I think this explanation is the clearest and simplest match with the patent application in the sense of varying clocks, but it doesn't require per-lane clocking: all lanes are running the same instruction with the same bitness of operands. Output modifiers (e.g. 2x, 0.5x, don't write) are "fast" operations that don't come with a clocking penalty. It would be simpler to not have these sub-operations than to implement per-lane clocking just to make them synchronise, if a re-design resulted in them introducing a clocking penalty. They're used too infrequently, in general (5 or 10% of kernels have any instructions that do this?).

I don't like the way the patent dismisses "a few nanoseconds" delay, that apparently arises naturally due to lane clocking mismatches. This is basically saying that after a while one lane is a whole instruction behind another lane. It literally doesn't compute.

If, instead, the argument went like this: "we can't run a SIMD-16 at 3 GHz because we can't keep the lanes synchronised across the entire CU's four SIMDs, which do all need to be synchronised, let alone within each SIMD - instead we run at 1.5GHz because the tolerances are slack enough to make it work" followed by saying, "we can run the entire CU at a median rate of 3GHz, because individual lanes now adjust automatically around the 3GHz target" - then I'd be convinced.

Perhaps this is actually all the patent application is saying? It's not explicit in this though.

To be honest, getting substantially higher SIMD clocks solely because of this, even without the power savings from temporal predication, seems very much worthwhile.

The SALU or the CU's scheduling block could depend on the 4-cycle cadence. However, if a SIMD doesn't signal that it is ready, vector instruction issue could be skipped. Since VALU operations can vary based on certain conditions like tunable DP rate or possibly variable-latency operands like reading directly from the LDS, the CU may already have the ability to forego a VALU op if the SIMD isn't able to resolve itself fully in time.
NOPs seem to be commonly used for these kinds of scenarios as far as I can tell. Sometimes the NOP is put there by the compiler as it knows there's fixed duration latency. (Sometimes the compiler just puts entirely superfluous NOPs into the code :mad: )

Though I don't think anyone outside of AMD knows the full inner workings of the WAITCNT family of instructions and how data flows either as results arrive or are queued, etc.
 
AMD's GPUs have had resultant forwarding for the "next" instruction since R300 (in GCN it is strictly for the next instruction, but in R300 the resultant could live for multiple instructions before being consumed). So it's not strictly-speaking a cache (LRF is a better term), and it doesn't mean that resultants aren't written to RF, but it's one of the reasons why AMD's register file has always supported extremely short execution pipeline latencies.
The access patterns presented in that paper are what I found interesting. In the paper cache would be accurate as it was an energy saving feature more than anything.

The same would apply for GCN, but I was envisioning more capabilities with the cache. Emulating additional ports for instance as it shouldn't complicate the RF. It would enable the asynchronous behavior and allow temporal execution given sufficient size. Grab 16 lanes of data and feed into a scalar as an independent scratchpad for example.

If SIMD lanes can be clocked independently, then there really is no point in dynamic wavefront formation or building a CU from multiple, various-width, SIMDs.
It should still increase apparent execution resources. Personally I like the temporal over variable wavefront idea as well. Why I theorized that 16+1+1 SIMD design where the scalars would largely be temporal and operate like DSPs or more capable accumulators. That would make sense with "per lane" clocks and asynchronous behavior where speeds were adjusted to line up with RF access. It also tracks with the "nested" instructions AMD listed previously as commonly used instruction patterns could be hard coded.

Even with independent clocks, there will be limits to the clock deltas and scheduling complexity. By the time all lanes are independent they would have Zen, likely with less prediction logic, and hardly what I'd call a parallel architecture.
 
Sorry for being so out of touch, but do we know when to expect Vega's release? I feel like we've been hearing rumors for ever.
 
I think this explanation is the clearest and simplest match with the patent application in the sense of varying clocks, but it doesn't require per-lane clocking: all lanes are running the same instruction with the same bitness of operands.
The patent is purposefully staying broad on the exact relationship of the lanes. It makes a passing reference that an embodiment could have for example 64 lanes, and that an embodiment could be grouping those lanes into a set of SIMD units.
However, the exact set of associations, if some of those lanes are independent "scalar" lanes, or if there are special states like being predicated off would be covered by the patent so long as some of them happen to be doing something different.

As long as the plurality of lanes might not be finished at the same time, it matters in a synchronously clocked CU and the patent's various ideas as to speeding up or slowing down the lanes comes into effect.
The scenarios it envisions differ in how reactive/proactive they are (detect lack of progress or predict it based on instruction type/data), and how freely the lanes can be adjusted in voltage/clock relative to one another (truly per-lane internal clocks, amortizing the clocking adjustment over multiple instructions).

Output modifiers (e.g. 2x, 0.5x, don't write) are "fast" operations that don't come with a clocking penalty.
They don't come with a clocking penalty as long as the clock period+voltage are raised so that the worst-case is covered plus some safety margin. It may be that there's some other big operation that takes longer at the same voltage, so they're "free".
The current adaptive clocking scheme is already removing a lot of the safety margin, and now it seems this patent is looking at the base clock period and voltage, and opting to cut into it. They currently appear to have no effect because the designs must have a clock+voltage point that hides them.

It would be simpler to not have these sub-operations than to implement per-lane clocking just to make them synchronise, if a re-design resulted in them introducing a clocking penalty. They're used too infrequently, in general (5 or 10% of kernels have any instructions that do this?).
The context of the patent is a US Department of Energy contract, which I think means exascale. It's not that the modifiers necessitate per-lane clocking in a general sense, but that the GPU is going to be cutting voltages and pushing timings as close the limit as possible.

I don't like the way the patent dismisses "a few nanoseconds" delay, that apparently arises naturally due to lane clocking mismatches.
I think the patent is just stating that the operations themselves take a variable amount of wall-clock time--not cycles, and gave a delay of one or more nanoseconds as an example. If an exascale GPU running at or below 1 GHz (perhaps significantly below), all events are happening on the order of nanoseconds.

This is basically saying that after a while one lane is a whole instruction behind another lane. It literally doesn't compute.
The sentence "The difficulty is that the slowest lane will determine the performance for all the lanes." would indicate that the slow lane won't be behind. Everything else will be held for its sake.

If, instead, the argument went like this: "we can't run a SIMD-16 at 3 GHz because we can't keep the lanes synchronised across the entire CU's four SIMDs, which do all need to be synchronised, let alone within each SIMD - instead we run at 1.5GHz because the tolerances are slack enough to make it work" followed by saying, "we can run the entire CU at a median rate of 3GHz, because individual lanes now adjust automatically around the 3GHz target" - then I'd be convinced.
The argument is more that they can keep things synchronized, but would be leaving a vast majority of the cycle time unused and with voltage higher than necessary for many simpler operations.

To be honest, getting substantially higher SIMD clocks solely because of this, even without the power savings from temporal predication, seems very much worthwhile.
Possibly, although may be aiming for lower voltages at more modest clock ranges, given the context. Granted, if power consumption is reduced it might allow for more headroom in a non-HPC product.

NOPs seem to be commonly used for these kinds of scenarios as far as I can tell. Sometimes the NOP is put there by the compiler as it knows there's fixed duration latency. (Sometimes the compiler just puts entirely superfluous NOPs into the code :mad: )

Though I don't think anyone outside of AMD knows the full inner workings of the WAITCNT family of instructions and how data flows either as results arrive or are queued, etc.
A NOP wouldn't be able to be inserted into the middle of the execution of a VALU operation that is loading one operand from the LDS. It would still be sitting in the instruction buffer, and no wait states are listed for this class of instructions. S_WAITCNT couldn't work either, since it is a wait set in the instruction stream and is either expired or too late for the VALU instruction.

One idea I have is that the VALU_CNT that was mentioned briefly is being used, and for external purposes the architecture behaves like there's an wait count of 0 at all times for the vector path. It simplifies things from a software perspective, but still allows for some complex behaviors internally.
 
I would be surprised if the register file wasn't split to four 16-wide parts. Each with one read/write port (doesn't need a separate read & write port either). One big register file per SIMD wouldn't make any sense with the 4-cycle cadence. Obviously we will never know, since GCN hasn't got any bank conflicts or any explicit software scheduling requirements, so it's not possible to try to figure this out by measuring timing differences.
To the read/write port idea, if there were a set of wavefronts co-resident on the same SIMD, one FMA, one LDS, and one VMEM that just spammed those operations, could conflicts show as reduced throughput?
 
SIMD execution units are 16 wide. SIMD executes one 64-wide instruction in 4 cycles. Four instructions are in execution at once.
I can't tell what you mean by "execution" here and why you say "four" instructions are in execution at once. Perhaps you're referring to decode, operand fetch, compute and resultant write?

The logic to compute an instruction runs for four clocks. After that, another instruction runs for four clocks. Therefore at any point, the 4-clock computation logic is working on either one or two instructions. Labelling the logic stages as A, B, C and D:

A - instruction 2, work items 0 to 15
B - instruction 1, work items 48 to 63
C - instruction 1, work items 32 to 47
D - instruction 1, work items 16 to 31

I want to check what you're saying before going any further.
 
Don't think this has been posted yet.

http://wccftech.com/amd-radeon-rx-vega-8-gb-hbm2-teaser-video-leak/

Did we previously have a confirmation of a CLC-cooled Vega 10 XT?

So first thing’s first, the AMD Radeon RX Vega graphics card using the full blown Vega 10 GPU and up to 8 GB HBM2 VRAM will feature an AIO liquid cooler. Just like the Fury X, the card will benefit from a heatsink that utilizes liquid cooling for better thermal performance. The card will feature a unique silver and red color scheme with a GPU tachometer that displays the load on the card. There are also small switches on the cut out located at the backplate. These switches can be configured manually to change color of the tachometer which is set to red by default.

And apparently this is new as well, but I think we've seen it before.

https://streamable.com/4iyfi

But that article had a comment with a video that I think might actually be new.


Like many others, I immediately noticed there was a Halo 3 image flashing through the background (see below). Purely from personal experience, I'm borderline certain that image isn't new (Halo 3 came out in 2007), so either some marketing technician fucked up with an old image or Halo 3 is getting some kind of new release, potentially tied to either Ryzen 5 and/or Vega.

b4amWpOh.jpg
 
Maybe the Halo: Master Chief Collection is coming to PC? I'd be down for that.

Regards,
SB

That's possible, but my gut is that the image was misused. MS had never been generous with sharing Halo to PC gamers and I have a hunch the MCC wasn't super successful on Xbox.

Luckily, with Halo Online, you can get a near-identical Halo 3 multiplayer experience (forge and all) right now due to the hard work of modders.

Looks like the same video from which the bad quallity off-screen shots were posted from earlier

Yeah, the only new info that I could tell was the existence of a water-cooled top tier Vega part. I don't think we've explicitly seen a rumor for that.

Though it isn't surprising in the least. AMD has the relationships and expertise to make it happen. Keeping the gpu chilled at like 65C rather than 80+C and you'll earn some of your power budget back, which can be used to provide more performance.

Anandtech worded it well.

The final element in AMD’s plan to improve energy efficiency on Fiji is a bit more brute-force but none the less important, and that’s temperature controls. As our long-time readers may recall from the R9 290 (Hawaii) launch in 2013, with the reference R9 290X AMD picked a higher temperature gradient over lower operating temperatures in order to maximize the cooling efficiency of their reference cooler. The tradeoff was that they had to accept higher leakage as a result of the higher temperatures, though as AMD’s second-generation 28nm product they felt they had leakage under control.

But with R9 Fury X in particular and its large, overpowered closed loop liquid cooler, AMD has gone in the opposite direction. AMD no longer needs to rely on temperature gradients to boost cooler performance, and as a result they’ve significantly dialed down the average operating temperature of the Fiji GPU in R9 Fury X in order to further mitigate leakage and reduce overall power consumption. Whereas R9 290X would go to 95C, R9 Fury X essentially tops out at 65C, as that’s the point after which it will start ramping up the fan speed rather than allow the GPU to get any warmer. This 30C reduction in GPU temperature undoubtedly saves AMD some power on leakage, and while the precise amount isn’t disclosed, as leakage is a non-linear relationship the results could be rather significant for Fiji.

If I had to bet, I'd say that amd would offer a premium amd-branded clc-cooled Vega 10 XT, then a "regular" custom cooled Vega 10 XT, then the typical custom cooled Vega 10 Pro.

That would probably let them reap the performance benefits of the clc for purposes of matching/beating the 1080 ti, but also offer a cheaper option that doesn't have the expensive clc.
 
That's possible, but my gut is that the image was misused. MS had never been generous with sharing Halo to PC gamers and I have a hunch the MCC wasn't super successful on Xbox.

Didn't Spencer say all Xbox exclusives would be getting a dual-release between Xbox and UWP from now on?
 
I think so. Forzá was a declaration of intentions about that. The real question is will this also apply to halo?

Enviado desde mi HTC One mediante Tapatalk
 
Back
Top