AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Deleted member 13524 · Jun 22, 2015

RedVi said:
I really don't think Nano would work being a cut down part. Just look at R9 290 power draw compared to 290X. Nano will be an undervolted, downclocked fully enabled Fiji, much like laptop chips being binned based on their lower power draw.

The laptop versions of larger chips are usually cut-down, just like the GM204 in the 980M and 970M.
The 290 consumes less power than the 290X.
So I don't really get your point.

Regardless, downclocking and not disabling units may not be the best method. Disabling ALUs and TMUs allows for a lower power consumption without decreasing fillrate performance, for example. Which is rather important if they're marketing the card for 4K resolutions.

RedVi said:
As such Nano will likely cost the same as Fury X while performing worse. It will be akin to low TDP CPU's currently on the market. You will be paying for the better performance/watt and the smaller form factor, it will not be the best perf/$ part.

I've seen absolutely no indication whatsoever pointing at the Nano being as expensive as the Fury X, or even as expensive as the aircooled Fury.

RedVi · Jun 22, 2015

Point taken on fillrate. I guess a combination of cut down and underclocked/volted would be the best approach. I was thinking along the lines of modern Apple Ax chips being wide and clocked low.

My point with 290 was that the power useage is not a whole lot lower, but I guess you'd need to look into per/watt of 290x vs 290 both clocked the same to get the real picture of whether disabling ALU's is worth the power savings.

I definitely think we will see Fury Nano priced Higher than Fury, even if it under performs it slightly. Just look at current small form factor cards like short verstions of the 970. Hopefully it is cheaper than Fury X.

kalelovil · Jun 22, 2015

RedVi said:
I really don't think Nano would work being a cut down part. Just look at R9 290 power draw compared to 290X.

The disabled shaders do save some power, but that is outweighed by the increased voltage setting for the 290. It is not a fair comparison to the R9 Nano.

silent_guy · Jun 22, 2015

How about the Nano having only 3 instead of 4 HBM stacks and thus only 3GB? It's unlikely, but not impossible.

spworley · Jun 22, 2015

silent_guy said:
How about the Nano having only 3 instead of 4 HBM stacks and thus only 3GB? It's unlikely, but not impossible.

That'd be a way to salvage interposer integration failures if they're common enough. Otherwise, are you going to throw out an otherwise working Fiji and 3 HBM stacks?

Hynix performs the interposer bonding step, not AMD or TSMC.

silent_guy · Jun 22, 2015

spworley said:
That'd be a way to salvage interposer integration failures if they're common enough.

My thoughts exactly.

hoom · Jun 22, 2015

Surely salvage is exactly the point of the Nano?
I'm certainly expecting only 3 functional HBM stacks on it just for logic of being a salvage part.
If Nano has 4 functional HBM stacks then there is 0 problem with yield & AMD will be creaming it.

Nano specs will presumably be some kind of lowest common denominator that covers most of the possible partial fail states so 3 HBM stacks, part of ALUs, TMUs, ROPs disabled & reduced clocks.

Some may be capable of running not far short of full specs.
Its even possible that it winds up in a 9600 pro situation with demand being such that they send out some fully functional GPU & HBM disabled down to Nano specs to meet demand.
But I think its pretty likely that there will be plenty of partial failed parts available for Nano.

Handy that it can give an attractively sized SKU that beats 290X with much lower power as well

Deleted member 13524 · Jun 22, 2015

AMD is already pushing it by selling the Fury X with 4GB when its direct competition carries 6GB and the 390 series are all coming with 8GB now.
No, I don't think it's feasible they'll sell the Nano with 3GB.

Tahiti cards had 3GB in all models and they only released salvaged versions with 2GB in a 256bit bus some 2 years after the original cards' release, with the 7870 XT.
Hawaii was never sold in a card with less than 4GB/512bit.

The only way I see Fiji scrapping stacks of HBM is if/when it supports HBM2 and it gets 6GB instead of 8GB.

Alexko · Jun 22, 2015

Or maybe in a heavily underclocked FirePro, e.g. a <150W model.

Grall · Jun 22, 2015

trinibwoy said:
Coil wine is one of the most aggravating sounds known to man. Really good news if they've nailed it.

Zero whine on my current ASUS 290X DC2. New "solid"* type inductors don't seem to be as susceptible as the older ones.

*Where the magnetic material is a powder crushed, and possibly also baked into shape around the coil at manufacturing, rather than having the coil wound around a solid, pre-manufactured core.

CarstenS · Jun 22, 2015

Lightman said:
Yesterday Gibbo from Overclockers.co.uk posted this:

It it really has no coil whine at all then I'm mega happy! I don't remember last time I had a GPU without coil whine and it still is annoying after all these years listening to it!

Another 3x4K showcase from AMD Fury X:

http://wccftech.com/amd-r9-fury-x-p...i-at-12k-resolution-and-60-fps/#ixzz3debRMFB2

somehow they again come up at 12K resolution where it's not even 8K but the result is impressive none the less!

I don't think that there are coil whine "free" cards in this performance segment. Cards might be built in order to not whine in typical gaming fps regions, like up to 100-150. But certainly not coil whine free without reservations.

Lightman · Jun 22, 2015

CarstenS said:
I don't think that there are coil whine "free" cards in this performance segment. Cards might be built in order to not whine in typical gaming fps regions, like up to 100-150. But certainly not coil whine free without reservations.

At least it's going in the right direction then

Btw will your review be published once NDA lifts or later on?

I would like to see detailed analysis of both compute and gaming performance but that might be a bit unrealistic expectation for launch review considering amount of time manufacturers give you to toy with new product. Hopefully by the time Fury nonX launches we will get that!

Kaotik · Jun 22, 2015

spworley said:
Hynix performs the interposer bonding step, not AMD or TSMC.

No they don't, UMC manufactures the interposer and does some preparations for TSVs, Amkor and ASE handle all assembly and finishing the interposer + final 2,5D chip

UniversalTruth · Jun 22, 2015

CarstenS said:
I don't think that there are coil whine "free" cards in this performance segment. Cards might be built in order to not whine in typical gaming fps regions, like up to 100-150. But certainly not coil whine free without reservations.

Not nice. Wouldn't it be optimal if the coil whine is shifted to FPS regions which are almost never touched - for instance from 1 to 10 FPS, rather than those over 100-150 FPS?

spworley · Jun 22, 2015

Kaotik said:
No they don't, UMC manufactures the interposer and does some preparations for TSVs, Amkor and ASE handle all assembly and finishing the interposer + final 2,5D chip

Ah, I was trusting this report that claims it's done by Hynix in Korea.

Jawed · Jun 22, 2015

Ethatron said:
BC is 28 times faster.

Just wanted to demonstrate how PNG is not inherently serial.

If you get a min/max range from the 8x8 tile and delta against the gradient you'd be as fast as BC. You can also split the tile in 2 sets (ala ETC). I would use variable golomb-codes for the delta values, just store the distribution parameter. But I don't think they used any variable coding scheme. 1:1, 1:2, 1:4 and 1:8 is more likely, much easier.

Will that be lossless?

Ethatron · Jun 22, 2015

Jawed said:
Just wanted to demonstrate how PNG is not inherently serial.

Well, you bought it for a hell lot of routing.

If, say you implement it as a pipeline what you were drawing, and you can manage an outstanding queue, you might actually get away with 2 cycles sustained throughput per tile, which would be very nice. But I have my doubts that this implemention matches the latency requirements of the render feedback loop (I mean tiles circulating between raster-output and memory). I suspect you can't give away any cycle of added latency, or otherwise: "do it again".

Jawed said:
Will that be lossless?

Sure, the approach is always the same for this stuff (depth buffer, MSAA etc), when you overflow you just drop to baseline uncompressed.
If the delta values fit into 2bits you go say 1:8, otherwise when 3bit then 1:4, when 4bit 1:2 otherwise dump 1:1. It's never really cheap to figure the right permutation for the encoding, but if it's within 4-8 cycles per tile, it might be in a better place than the above pipelined implementation.

The acceptability very much depends on the exact statistical profile of the ROP-cache/MC transfers. An implementation might just give a damn about latency because tiles are ever only re-touched every 1k cycles. Or if going to memory already takes 400 cycles, what does it matter if we add 50-100 more. There might be a window where you can pick a dog-slow implementation, like with XB1 zlib compression. If that thing would be serial, or you have to use it serial, it's a killer; if asynchronous you might not care that much.
My preference though is, try to make it really tight (better safe than sorry) without forgetting about compression efficiency.

Anyway. Little bit off-topic, sorry for the distraction.

Grall · Jun 22, 2015

How do you know how many bytes a compressed tile occupies when reading it back, so that you don't read too much and ruin the entire purpose of compressing in the first place? (Or, well, half of the reason anyway...)

Infinisearch · Jun 22, 2015

Isn't there a on-chip meta data cache?

mczak · Jun 23, 2015

Infinisearch said:
Isn't there a on-chip meta data cache?

Not exactly on-chip. The tile information is stored alongside the actual buffers nowadays and gets automatically loaded / stored (IIRC I think it has a separate programmable address for this structure, so not embedded in the fixed miptree structures, but too lazy to check for different hw with the open source drivers). I believe this data is stored/loaded more or less like ordinary data (so, 32 bytes of it at a time or so). You need at least 2 bits per tile (cleared, 1:1, 1:2, 1:4, 1 bit more if you want to support higher ratios), so with 8x8 tiles it can add up to quite something (for 16kx16k buffer and 2 bits per tile that would give 1MB per RT). On earlier chips (before r600 I think, going back to the original r100) generation of chips there was indeed a fixed cache for that on chip (for zs - not sure about msaa color), so when changing the depth buffer the information had to be stored/loaded manually (or additionally only use it for one depth buffer). Though chips with hierarchical z buffer (all the more expensive ones) had to store significantly more bits on-chip (basically at least some z value per tile) and at least on some chips it was necessary to switch off hierarchical z if the resolution was too high due to insufficient cache.

AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Deleted member 13524

Guest

RedVi

kalelovil

silent_guy

spworley

silent_guy

hoom

Deleted member 13524

Guest

Alexko

Grall

Invisible Member

CarstenS

Moderator

Lightman

Kaotik

Drunk Member

UniversalTruth

spworley

Jawed

Ethatron

Grall

Invisible Member

Infinisearch

mczak

Similar threads