AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

sebbbi · Jul 5, 2015

Jawed said:
I have no idea why AMD keeps pushing compute, when it's not been the route to success since R600.

Most modern graphics engines are compute heavy. You only need rasterization to fill the G-buffer and render the shadow maps. Everything else is compute shaders nowadays.

And there will be games that are pure compute based, such as Media Molecule's Dreams. ROPs are dead silicon for these games. Fury X will be quite nice product to run games like these.

Jawed said:
Though I think Lauritzen identified a serious bottleneck in GCN with the tiled lighting scheme that BF4 uses, so BF4 might not be a good candidate because that's known as "broken", and pretty fundamental to the game.

Tiled compute based lighting suits GCN architecture perfectly. You just need to optimize the shader in a certain way (to make the compiler to do what you want). GCN scalar unit is actually a big performance boost for tiled algorithms (but unfortunately on PC it is impossible to write code for it directly).

Jawed · Jul 5, 2015

If games are compute-bound, then GCN compute efficiency in these games must be terrible - worse than the VLIW GPUs of yore.

As for what I said about GCN and tiled lighting, I think I'm misremembering, now that I've bothered to rummage:

https://forum.beyond3d.com/posts/1611685/

Unfortunately the link to the Intel page doesn't work, so I can't be 100% sure it's the comparison I was thinking of. Wrong version of Battlefield, too.

EDIT: found another post, which has a working link to the Intel site:

https://forum.beyond3d.com/posts/1638737/

where HD7970 has a serious problem with MSAA. That does seem to be what I was remembering, though there is no number presented for Tahiti.

silent_guy · Jul 5, 2015

pjbliverpool said:
But do we know that a 256bit GDDR5 interface (not exactly low end) is cheaper than a single HBM2 stack?

We don't know. It's a matter of beliefs and, IMHO, common sense. The yearly volume of GPUs with GDDR5 must be on the order of 50 million. The volume of HBM will be 2 orders of magnitude lower. Doesn't matter whether it's 20x or 100x lower, that alone will have a major impact on price. Add to that: single source, die stacking, interposer etc.
My personal belief, based on the experience of having seen component cost differences that vary incredibly based on volume, is that HBM2 will be some low integer factor more expensive than GDDR5, but others are obviously going to disagree heavily.

Alexko · Jul 5, 2015

silent_guy said:
We don't know. It's a matter of beliefs and, IMHO, common sense. The yearly volume of GPUs with GDDR5 must be on the order of 50 million. The volume of HBM will be 2 orders of magnitude lower. Doesn't matter whether it's 20x or 100x lower, that alone will have a major impact on price. Add to that: single source, die stacking, interposer etc.
My personal belief, based on the experience of having seen component cost differences that vary incredibly based on volume, is that HBM2 will be some low integer factor more expensive than GDDR5, but others are obviously going to disagree heavily.

Will it, though?

I mean, if both AMD and NVIDIA decide to move their entire GPU lineup to HBM, what's left for GDDR5? There's the PS4, some low-volume Xeon Phis, and that's about it. HBM, on the other hand, would suddenly be manufactured in very large volume, provided that manufacturers can actually produce this much of it by converting GDDR5 production lines fast enough.

Rikimaru · Jul 5, 2015

I doubt HBM will be always cheaper than GDDR5 even in high volumes. It needs interposer and special assembly. Stack yield.

silent_guy · Jul 5, 2015

Alexko said:
if both AMD and NVIDIA decide to move their entire GPU lineup to HBM, what's left for GDDR5?

It's an interesting chicken and egg situation, isn't it?

FuryX has shown that massive BW is not the silver bullet they hoped it would be. It remains to be seen if that's due to AMD incompetence or due to fundamental reasons. But if it's the latter, then there's just no point in using HBM for anything but the highest SKUs for 14/16nm.

Conclusion: there's a lot left for GDDR5.

Maybe for 10nm...

sebbbi · Jul 5, 2015

Jawed said:
If games are compute-bound, then GCN compute efficiency in these games must be terrible - worse than the VLIW GPUs of yore.

ALU was a big bottleneck on last gen consoles, so it takes a while to change your code base and habits to be perfect fit for modern GPUs. GCN is not ALU bound in most shaders, but that doesn't mean that adding CUs shouldn't improve the performance (almost linearly), since additional CUs give a linear increase in total registers, L1 cache, LDS, etc (= many other potential bottlenecks in compute code).

Jawed said:
As for what I said about GCN and tiled lighting, I think I'm misremembering, now that I've bothered to rummage:
https://forum.beyond3d.com/posts/1611685/

Unfortunately the link to the Intel page doesn't work, so I can't be 100% sure it's the comparison I was thinking of. Wrong version of Battlefield, too.

EDIT: found another post, which has a working link to the Intel site:

https://forum.beyond3d.com/posts/1638737/

where HD7970 has a serious problem with MSAA. That does seem to be what I was remembering, though there is no number presented for Tahiti.

Andrew's comparison was between 5000 series (VLIW5) Radeon and 400 series (Fermi) GeForce. Old VLIW5 Radeons had big performance bottlenecks in dynamic array indexing (of memory and/or LDS). I remember measuring constant buffer array indexing to be roughly 2x faster than LDS array indexing (and indexing structured buffers was roughly 6x slower than a constant buffer). We used CPU based tile light binning on consoles (older AMD GPUs) and DX10 era PCs, because dynamic loops (that indexed the light lists using the loop counter) were much slower compared to unrolled code (multiple shader permutations based on light counts). VLIW Radeons also had some severe bank conflicts scenarios on registers and LDS.

GCN doesn't have any of these problems. Bad GCN multisampled performance in the tiled lighting shader are most likely explained by bad occupancy. If you don't use the all tricks in your book to force the AMD shader compiler to behave properly, you will end up with high VGPR usage in the complex tiled lighting shaders. Multisampled versions are much more prone to VGPR pressure, since the shader is more complex. I think we spent at least a month in optimizing the VGPR usage of our tiled lighting shader. But the end result is nice. We can push 16k visible lights at locked 60 fps (on a middle class GCN 1.1 GPU). It is silly how even simple things such as changing the order of two lines of shader code can cut the VGPR usage down by 2-3 (giving up to 10% extra performance for an shader that has poor occupancy).

Compute shader optimizations in general doesn't "port" well across architectures. You get big performance differences simply by changing your thread group size (128 = 16x8, 256 = 16x16, 512 = 32x16, 1024 = 32x32 threads). Less than 256 threads per group doesn't suit GCN well. And 1024 threads per group (32x32 tile) is hopeless to get running at high enough occupancy (for any complex multisampled tiled lighting implementation). Depending on the GPU resource bottlenecks, a different group size is optimal. If the same shader code is used on multiple generations of AMD and NVIDIA GPUs, the thread group size will likely not be perfectly optimal for each. Wrong thread group size is alone enough to severely hamper performance on some GPUs.

Because GCN is starved by register space, it is slightly more reliant on a good shader compiler than some other GPUs. GCN scalar unit could be a big help for register pressure, but utilizing it perfectly would require even more sophisticated compiler logic. Tile based lighting is perfect candidate for scalar optimizations, since you can split the thread group to 8x8 subgroups (sub tiles), and offload all subgroup calculations and data loads (and registers to hold that data) to the scalar unit. This saves lots of ALU and registers. OpenCL 2.0 shading language has subgroup operations that would help the GCN shader compiler to use the scalar unit (and lane swizzles) better. Our PC versions have always been DirectX-based, so I don't know how well this works in practice on PC.

UniversalTruth · Jul 5, 2015

silent_guy said:
It's an interesting chicken and egg situation, isn't it?

FuryX has shown that massive BW is not the silver bullet they hoped it would be. It remains to be seen if that's due to AMD incompetence or due to fundamental reasons. But if it's the latter, then there's just no point in using HBM for anything but the highest SKUs for 14/16nm.

Conclusion: there's a lot left for GDDR5.

Maybe for 10nm...

You got everything wrong here.

Weren't exactly you who claimed that GDDR5 was the mainstream solution?

Have you noticed that the results with overclocked HBM showed Fury X scaling perfectly fine with more memory bandwidth?
Very interesting why you make the opposite conclusions claiming that there is no need for more memory bandwidth.

Grall · Jul 5, 2015

@sebbbi, why is it you think that AMD isn't fixing its shader compiler? I've seen here on B3D that developers have been dissing it since at least the radeon 5000 series IIRC. You dev guys should know perfectly well what is wrong with that compiler and I'm sure you've told AMD repeatedly why exactly it sucks. Why is it so hard to get it fixed?

Razor1 · Jul 5, 2015

UniversalTruth said:
You got everything wrong here.

Weren't exactly you who claimed that GDDR5 was the mainstream solution?

Have you noticed that the results with overclocked HBM showed Fury X scaling perfectly fine with more memory bandwidth?
Very interesting why you make the opposite conclusions claiming that there is no need for more memory bandwidth.

The problem is two fold, HBM and the extra bandwidth didn't show it is going to save Fury X, architectural changes are still just as important as the different memory technology.

So without a more efficient architecture from a power usage perceptive, since Fury X seems to be close to its limits, (yes we have seen the over clocking of the vram and chip, but seriously, the overclocking is still very limited % wise, even more so with water cooling, and we haven't seen the amount of power consumption that even this small amout of overclocking incurs, it might be quite large), the extra bandwidth doesn't do Fury X any good.

if Fury X's core could have been clocked more then then 10% or so that reviewers have been getting and maintain it the extra HBM bandwidth, would have came in very handy, but right now as it seems, doesn't really look like it, have to wait and see if the voltage unlocks change that though.

UniversalTruth · Jul 5, 2015

Yes, there will be architectural optimisations with Arctic Islands on 14 nm. Then, even the die shrunk Fiji will need HBM2.

For me and I hope for the better of everyone, GDDR5 should be retired as soon as possible and become just a thing from the past.

Razor1 · Jul 5, 2015

I don't think they will be just doing a die shrink on Fiji, doesn't seem like it will hold up well, at least I hope Arctic Islands is going to be top down a whole new line.

Lower end cards also have to look at cost of production much more then the higher end cards, so IMO GDDR5 will be there for another generation of graphics cards, just not at the high end.

Edit and this will slow down mass production (total amount of production) of HBM since the high end cards are a limited segment.

lanek · Jul 5, 2015

Razor1 said:
if Fury X's core could have been clocked more then then 10% or so that reviewers have been getting and maintain it the extra HBM bandwidth, would have came in very handy, but right now as it seems, doesn't really look like it, have to wait and see if the voltage unlocks change that though.

For what i can imagine vcore / clock speed / power is really tighten on Fiji, basically you need a vcore increase for overclock it. They have set it really close to the limit for maintain powerconsumption, temp. instead of set a good margin. Ofc time will tell when we will be able to see what it give in practice.

For be honest, this is not the first time i see that with AMD gpu's, a merly 100mhz more was possibile on my 7970 and with a bit of vcore i hit 1400mhz under water.

Lets hope it will be too the case for Fiji.

When Maxwell dont really need to increase Vcore. AMD GCN gpu's have allways scale with vcore increase. ( Then ofc, there's allways the particular samples who hit a wall after a certain clock speed, whatever is the Vcore used. Have got a 7970 Sapphire who was like that, replaced after diying and the second was overclock like a beast ( same model )

I can imagine the reason for Maxwell is more that the vcore is set high, but as they limit the turbo clock speed by temp and power limit, the "real time" vcore is way under and fluctuate more, hence you see it fly at the moment you drop the power limit, there's a lot headroom left. Just, attentionr there's some big difference depending of the models, the power limit and voltage are way higher than default on retail OC models.. And this is barely this type of models you see in shops and what peoples own..

silent_guy · Jul 5, 2015

UniversalTruth said:
You got everything wrong here.

Coming from you, that's one hell of compliment.

Weren't exactly you who claimed that GDDR5 was the mainstream solution?

Yes, yes, I did. In fact, I still think so: my 'two orders of magnitude' statement is a confirmation of that. What makes you think I've changed my opinion on this?

Have you noticed that the results with overclocked HBM showed Fury X scaling perfectly fine with more memory bandwidth? Very interesting why you make the opposite conclusions claiming that there is no need for more memory bandwidth.

Have you noticed that a gm200 with 33% less bandwidth performs either better or identical to a FuryX? That shows that 168GB/s on the FuryX is largely wasted.
If increased memory BW on FuryX provides more performance, then that's a pretty strong indication that AMD is incompetent in their usage of bandwidth. Coming back to your 'scaling perfectly fine' argument, check out Damien's memory overclocking numbers. And look up the case where only memory is overclocked. If you average the performance increase for a +8% memory clock, you end up with +2.3%. Call me wrong all you want, that does not make a strong case for memory being the prime performance limiter.
Furthermore, if you compare memory-only overclock numbers against the GPU core clock overclock numbers, there are only two cases where a memory overclock has a bigger impact than a memory overclock.
It's clear that when improving AMD's performance, the most !/$ is in their core logic, not the memory BW.

UniversalTruth · Jul 5, 2015

Ok, I understand that you are not happy with performance but I do not want a 3 TIMES larger PCB footprint with GDDR5 just because you bring the argument about the performance. Which actually is irrelevant when considering all other benefits of HBM.

entity279 · Jul 5, 2015

Grall said:
@sebbbi, why is it you think that AMD isn't fixing its shader compiler? I've seen here on B3D that developers have been dissing it since at least the radeon 5000 series IIRC. You dev guys should know perfectly well what is wrong with that compiler and I'm sure you've told AMD repeatedly why exactly it sucks. Why is it so hard to get it fixed?

Non - deterministic ( this is a label I would give to the fact that reordering instructions significantly changes output as posted by sebbbi ) and generally non predictable (shader) compiler output is not something you can file under a (or several!) bug report.

From an outside bird's eye view, this and other similar posts point to rather systematic failures into both engineering the compiler and into it providing an adequate abstraction level.
And then it follows naturally that only a few vendors can afford the expertise to build, maintain and publish state of the art graphic engines.

Just my comment , sebbbi's answer will surely be much more revealing.

Alexko · Jul 5, 2015

silent_guy said:
It's an interesting chicken and egg situation, isn't it?

FuryX has shown that massive BW is not the silver bullet they hoped it would be. It remains to be seen if that's due to AMD incompetence or due to fundamental reasons. But if it's the latter, then there's just no point in using HBM for anything but the highest SKUs for 14/16nm.

Conclusion: there's a lot left for GDDR5.

Maybe for 10nm...

Yes, it very much is a chicken and egg situation. And those are problematic when many actors are involved and need to move at once to effect change, but here there are only two significant actors.

I don't know why you dismiss the size and power benefits of HBM so readily.

silent_guy · Jul 5, 2015

Alexko said:
I don't know why you dismiss the size and power benefits of HBM so readily.

Size: I believe that a GDDR5 GPU designed with water-cooling in mind right from the start can be made significantly smaller than the conventionally cooled GPUs of today. They just never bothered because the cooler needs to be big. We'll have to wait for the Fury to see whether that's really true or not. IMO, small size is a consequence of making a technology choice, but not a fundamental influencer of using that technology.

Power benefits: the combination of HBM and a highest-end chip with an efficient core is going to be awesome. But if the next x04 chip from Nvidia in 14/16nm will be similar in performance to gm200, then it should consume significantly less power than gm200 while keeping GDDR5 and well below 200W, negating a strong need for the HBM power savings there.

It's a matter of nice to have vs necessity. More than extra BW, AMD needed HBM power savings as a band aid to stay borderline competitive. I expect them to fix their power efficiency issues with the process shrink. (After all, they must have been doing something worthwhile in their core logic in the last 3 years.) And Nvidia didn't need the power savings at all.

UniversalTruth · Jul 5, 2015

Again this.

The water-cooling was a consequence of using the HBM, not a reason for using HBM.

With GDDR5 and water-cooler you will still need the immensely large PCB, it won't get smaller because of this.

silent_guy · Jul 5, 2015

UniversalTruth said:
Ok, I understand that you are not happy with performance but I do not want a 3 TIMES larger PCB footprint with GDDR5 just because you bring the argument about the performance. Which actually is irrelevant when considering all other benefits of HBM.

You should learn to look past the marketing slides. Your 3x size reduction (a 69% size reduction) is for the part of the PCB that contains just the GPU and the memory. In reality, the height of a FuryX is identical to a 980Ti, and the PCB length is 7.7" vs 10.5", good for a PCB side reduction of just 26%. If that 26% size reduction (and a big honking fan attached to it) is really what matter to you, then knock yourself out and go for a FuryX.
But let's wait one more week for an apples-to-apples comparison between similarly cooled Fury and GTX 980Ti.

AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

sebbbi

Jawed

silent_guy

Alexko

Rikimaru

silent_guy

sebbbi

UniversalTruth

Grall

Invisible Member

Razor1

UniversalTruth

Razor1

lanek

silent_guy

UniversalTruth

entity279

Alexko

silent_guy

UniversalTruth

silent_guy

Similar threads