AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Seeing less than ideal scaling in Fiji memory wise... did we explore the possibility that 8×512 Bit wide memory controllers might not be sufficiently fine-grained for current workloads? Even if you apply 128 KiB of L2 for each of them?
HBM has the same access granularity as GDDR5. The difference is that there are much more channels running at the same time from a single stack. If the Fiji's block diagram is correct, every HBM controller is now managing a whole bunch of independent access queues and this would definitely require some work on the driver side to balance out the requests more efficiently.
 
AFAIK we still are looking for performance ~45% over Hawaii at the same clock, since memory as a bottleneck should be much relieved if not mostly forfeited.
What scaling?
In the review thread I linked the review on Hardware.fr showing Crysis 3 and Witcher 3 scaling almost perfectly over the HD7970 in the same test.
The fillrate test at TechReport shows 64 GP/s instead instead of 67, whereas the 290X test shows 66 instead of 67. Is that what you're referring to?
Also, the HBM configuration results in 32 channels of 128 bits each, since each HBM chip has 8 channels.

I am talking about what AMD depicts on their block diagram. Eight memory controllers, same as in Hawaii, serving eight times as wide a memory bus.
Additionally, we are seeing in memory intense compute tests (i.e. Luxmark) a far less than expected scaling.

HBM has the same access granularity as GDDR5. The difference is that there are much more channels running at the same time from a single stack. If the Fiji's block diagram is correct, every HBM controller is now managing a whole bunch of independent access queues and this would definitely require some work on the driver side to balance out the requests more efficiently.

^^ This, rather.
 
HBM has the same access granularity as GDDR5. The difference is that there are much more channels running at the same time from a single stack. If the Fiji's block diagram is correct, every HBM controller is now managing a whole bunch of independent access queues and this would definitely require some work on the driver side to balance out the requests more efficiently.

Is there a practical difference between (1) a single controller managing 8 channels and (2) 8 separate controllers?
 
I doubt it's a single controller accessing memory in 512-bit chunks, that would be rather impractical, especially for random accesses.
 
AFAIK we still are looking for performance ~45% over Hawaii at the same clock, since memory as a bottleneck should be much relieved if not mostly forfeited.
I can't see any justification for that point of view.

I am talking about what AMD depicts on their block diagram. Eight memory controllers, same as in Hawaii, serving eight times as wide a memory bus.
I don't see how the count of MCs is meaningful without knowing what they're doing, especially as we know there are 32 channels in a 4-way HBM system. At best you could argue that L2 is per MC.

Additionally, we are seeing in memory intense compute tests (i.e. Luxmark) a far less than expected scaling.
Is there actual data? Do you know that Luxmark scales?

Actual data is required for a discussion.
 
If the Fiji's block diagram is correct, every HBM controller is now managing a whole bunch of independent access queues and this would definitely require some work on the driver side to balance out the requests more efficiently.
Is it realistic to expect that the driver has this much control over scheduling operations at this low level? Especially since there's a bunch cache levels involved?
 
I can't see any justification for that point of view.
I'd say 64/44 equates to ~45% more. Or what is it exactly you're lacking justification? I am talking about tests that scale approximately 40-45% over Hawaii apart from complete synthies.

I don't see how the count of MCs is meaningful without knowing what they're doing, especially as we know there are 32 channels in a 4-way HBM system. At best you could argue that L2 is per MC.
That's why I'm asking.

Is there actual data? Do you know that Luxmark scales?

Actual data is required for a discussion.
There is and yes it scales, even across multiple accelerators.
 
I'd say 64/44 equates to ~45% more. Or what is it exactly you're lacking justification? I am talking about tests that scale approximately 40-45% over Hawaii apart from complete synthies.
Theoretical fillrate is precisely 0% faster than Hawaii per clock. Other things in Fiji are also limited in their advantage over Hawaii.

If you want to evaluate scaling compare with HD7970 where almost every parameter of Fiji is twice that of Tahiti per clock. Or against Tonga/Antigua. That's why I mentioned my post in the review thread where 2 games do scale as expected, at least at Hardware.fr. Most games don't.

Why? No good answer. Driver? CPU overhead? Geometry bottlenecks? API overhead? etc. In my opinion, once we have a per-game answer to that question we can get somewhere.

The first step would be to try to find settings on each game that do result in scaling according to theoreticals. That'll then give you a list of graphics options, which are turned off, that hurt scaling.

I wonder how long that list is. And which were written by NVidia.

If possible I'd look for scaling of minimum framerates. Those not caused by texture loading glitches.

Draw up a similar list of options that hurt scaling for NVidia and that'll make for a cool article comparing the two architectures and drivers.

There is and yes it scales, even across multiple accelerators.
On that page the same person has posted results for 2x HD7970

http://luxmark.info/node/417

and 3x HD7970

http://luxmark.info/node/639

Those results don't indicate linear scaling with GPU count.

Have you tried underclocking Fury X to observe variations in performance on this test? Prolly easier to measure differences with underclocking than overclocking, since there's more range to play with!
 
What kind of problem? The control data is accessed first. It almost never misses the cache (see above). If the control data load misses the cache the GPU will just load that cache line, and you get doubled latency for this particular ROP block load, but lots of further loads will (with high likelihood) access the same control data cache line.
If latency is a problem you could simply assume that the tile is uncompressed if the control data is not cache resident. But if, as mczak suggests, fast clear is part of the same scheme, then that's unlikely. In any case, there's also opportunity for prefetching tile control data based on rasteriser/early-Z output.
 
What is the policy for a miss when attempting to write changed compression status to the control line? For that matter, is the assumption that nothing else but the ROP may have cached a copy of the now-stale control line?

Resources bound to ROPs are always write-only, you can not create read-write RTVs or simultaniously a read-only SRV and a write-only RTV, the bound write-only RTV is also the only active writeable "pointer" (a.k.a. decl-spec/attribute restrict). Only UAVs can be read-write [and scatter], and append/consume is a staggered access read-write resource.
 
Resources bound to ROPs are always write-only, you can not create read-write RTVs or simultaniously a read-only SRV and a write-only RTV, the bound write-only RTV is also the only active writeable "pointer" (a.k.a. decl-spec/attribute restrict). Only UAVs can be read-write [and scatter], and append/consume is a staggered access read-write resource.
The L2 line-sharing conflict was also resolved with the answer to how the control data is allocated statically to a specific RBE. It could have been write-only with respect to software but then different RBEs or compressors would still be passing around their own variants due to partial updates to 64-byte region.
 
Do you mean each control data word (at memory access granularity of 32B) is statically assigned to one RBE, or that a fixed 2B of a control data word are statically assigned to each RBE?
 
The setup that part was addressing would be the case where control data is stored in the L2, which would be in a 64B granularity. A physical L2 cache line does not historically have a relationship with a RBE or the link between the color caches and memory, so one would need to be defined by that proposed change.

The granularity of requiring a separate cache line per RBE or color cache is somewhat more expensive capacity-wise since it loses some of the locality of access between adjacent tiles and pulls in data for distant ones, but it avoids a problem where the L2 does not readily support partial updates without heavier synchronization. If segments of a line were updated with the compression status of separate ROP partitions, the behavior would differ from how traffic like L1 writes are handled. My initial question was prompted because since this path is historically separate from the vector memory path, that interacting with a subsystem that has different behaviors usually requires additional synchronization, and for GCN that usually means going back to memory.
 
Im a bit intriguated about the Nano, a cut down version with less SP ?
From the Anandtech Fury X review (this detail may have been mentioned in other reviews but is the first time I've noticed it).

Unlike the R9 Fury, AMD has announced the bulk of the specs for the R9 Nano. This card will feature a fully enabled Fiji GPU, and given AMD’s goals I suspect this is where we’re going to see the lowest leakage bins end up.
 
From the Anandtech Fury X review (this detail may have been mentioned in other reviews but is the first time I've noticed it).

Nice, you have remind to check when this review will be out and like allways, at contrario of other reviews, this is is full of technical informations, and goes deep in details.
 
Interestingly, per the page in the Anandtech review covering Fiji's organization, there are architectural concerns that could have been in play with regards to the shader engines and ROP allocation.
http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4

Hawaii's revision of GCN architecturally tops out at 4 shader engines, and is capped at 16 RBEs.
Fiji and its unit counts may represent the extent to which GCN can be incrementally leveraged.
 
Considering die size limit they did set GCN possibilities well. Just like with Tahiti the desire for more ROPs has little ground to stand on.
 
Do you think it couldn't have used more primitive throughput? That is architecturally defined by the limits described.
That would probably require expanded data paths to the backend to scale the throughput and synchronization.

Anyway, I wonder how should GCN evolve from now on. Probably it's time to double the SIMD clusters in the multiprocessor for better compute density and more optimized power design. And stuff in that scalar ALU with more bells and whistles. In other words - beefier and more feature rich CUs, but a bit less of them. That should simplify and optimize the internal datapath layout.
 
Back
Top