AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

AMDs tessellation performance can vary quite substiantially with driver revision.

Hope you can see this graph directly:
http://www.pcgameshardware.de/scree...ellation_Performance_vs_R9_280_part1-pcgh.png

If not, you need to scroll down a bit here:
http://www.pcgameshardware.de/Grafikkarten-Grafikkarte-97980/Tests/AMD-Radeon-R9-285-Test-1134146/4/
Alright, I finally worked it out after going through a few driver revisions. It was Cat 14.12 (Omega) where TessMark performance jumped up. I don't run these low level tests every time (they never change, until now), so I hadn't picked up on it until now. Still not quite sure what AMD did though to boost performance though.

Wow! I wonder what happened there. Is there a similar boost for Tonga?
x64: 152.9fps (was: 134)
x32: 302.5fps (was: 294)

So a slight uptick in perf, but not like on 290. I'm not sure why 285 basically got this boost first, since 290 wasn't this fast when I initially reviewed 285.
 
Last edited:
Trying to install the latest beta driver, but since it's fucking beta shit it doesn't even work. It stops at the 'identifying graphics hardware' step and never progresses from there. Un-fucking believable, tried two times now and nothing ever happens.

How hard can it be to just get a god damn installer to work? All it has to do is stick some files in their right place (and create a billion registry keys, for some incomprehensible reason.)

That's unfortunate. Never had any driver problems myself. I always use their AMD Catalyst Install Manager to quickly remove everything including the manager, reboot and install a new version. It's the official method. Worked every single time, including 15.5 beta. At least on Win 8.1 x64.
 
AMD's problem is that they need an architectural improvement for GCN and honestly that improvement needs to be included in their first finfet GPUs or they're going to have a bad time.
The one thing I keep asking in this thread is "What kind of "big architectural improvements" do you want in GCN 2.0 exactly" and I still haven't heard the answer.

When you have 4096 very simple CISC cores - 50-80 times more than Intel Xeon Phi which use modified Atom cores - you can improve performance by simply scaling down the process node to add more cores and increase running frequency.

Does it really make sense to increase complexity and increase bus/register bit width, implement out-of-order and superscalar execution, etc.? Especially when known AMD performance problems seem to come from underutilizing existing GCN cores with non-optimal driver/compiler/optimizer code.


Or perhaps this is the place where those "IP-blocks" I think Dave Baumann once called them come into play. If my memory serves me right, it's not just "GCN x.y", but there's individual "IP-levels" for smaller blocks inside the GPU, which aren't tied to updating every other block too.
Updating just a few modular blocks still requires a complete redesign with all the validation stages, so this will be a new chip, not a minor revision/stepping.

What they could do is update the fixed-function rasterizer block to support Conservative Rasterization and Rasterizer Ordered Views to bring Fiji to feature level 12_1, maybe update the virtual memory block to support Volume Tiled Resources as well, as this shouldn't require updates to the GCN blocks.

Unfortunately it doesn't really seem that AMD implemented feature level 12_1 in Fiji.
 
Last edited:
AMD currently are beaten by 9xx cards in lower resolutions and having better cpu overhead would make a difference, would help also on the lower end where they aren't playing well with i3. Assuming the hardware is there, of course.

Maxwell also has quite a large amount of cache, which probably Tonga implements too so even that might be covered.
It would be disappointing if they don't implement 12.1/11.3 in Fiji. And from that slide, it's not even Tiled Resources tier 3...
 
I've also never had any problems installing Catalyst.
I have, however, never used any third party software to uninstall them. Using any third party driver uninstallers is probably the main reason I've seen for subsequent installs failing. And the corruption remains even if a driver has been installed successfully.
 
Alright, I finally worked it out after going through a few driver revisions. It was Cat 14.12 (Omega) where TessMark performance jumped up. I don't run these low level tests every time (they never change, until now), so I hadn't picked up on it until now. Still not quite sure what AMD did though to boost performance though.

x64: 152.9fps (was: 134)
x32: 302.5fps (was: 294)

So a slight uptick in perf, but not like on 290. I'm not sure why 285 basically got this boost first, since 290 wasn't this fast when I initially reviewed 285.

For make it a bit clear and this have been discussed some post further, tesssmark is really not the best way of test tesselation , it use OpenGL 4 ( who at this time was clearly made by the green team of OpenGL ), it dont forcibly reflect the real tesselation performance on DX tesselation.. But this said we all know the difference on tesselation performance between Kepler, Maxwell and AMD counterparts.

This benchmark is coded in a way, it is absolutely not efficient in many points..

Its just a good point of view of it, just it dont achieve really well his purpose . It will not change the lower performance on tesselation on AMD gpu's ( but well enough, i use way higher level on subdivision surface and tesselation with OpenCL and AMD gpu's s have a really way higher performance on an engine like luxrender that what you can see in any games for it, ).

The AMD performance of tesselation is offcourse subpaar of Maxwell or Kepler, but strangely in CG graphics this is absolutely not a problem..but in games yes, who dont approach the quality of the raytracing we are using it is a problem ... i try to understand, i dont do low poly modelisations, im more an adept of extremely high poly, and i adapt the system for been able to render this, and the geometry or tesselation engine of AMD GPU's have never enter in the question.. Why it is the case for gaming, i really dont know. Ofc, it is 2 different cases.. I use local, global displacement mapping, ultra tesselation level ( or subdivision of surface ) at level who are not even imaginable on a game, but whatever ( because i need to switch sometimes from AMD, to Nvidia gpu's for my work ), its not where i find the difference.

Its completely out of the topic, but with the change done after Vulkan on the OpenGL teams, ( outside the director of it, who is still the number 2 of Nvidia ) we should see some strange thingd happend on this front.
 
Last edited:
Updating just a few modular blocks still requires a complete redesign with all the validation stages, so this will be a new chip, not a minor revision/stepping.
Obviously, but it still may be relatively easy job compared to trying to fit actually new blocks like the TrueAudio DSPs in there
Maybe, maybe not - I wouldn't agree on the way you read that (supposed) slide, especially since all the GCN's (at least 1.1+) already do FL12_0 with resource binding 3
 
The one thing I keep asking in this thread is "What kind of "big architectural improvements" do you want in GCN 2.0 exactly" and I still haven't heard the answer.

Does it really make sense to increase complexity and increase bus/register bit width, implement out-of-order and superscalar execution, etc.? Especially when known AMD performance problems seem to come from underutilizing existing GCN cores with non-optimal driver/compiler/optimizer code.
I like GCN compute units. The design is elegant. I definitely do not want OoO. I like that GCN is a memory based architecture. All the resources stored in memory and cached by general purpose cache hardware. There are no sudden performance pitfalls. Performance is all about the memory access patterns (= something that the developer has full control on).

The AMD shader compiler is the biggest issue for GCN on PC. Current generation consoles have roughly the same GPU architecture as the AMD PC GPUs, making it easy for the developers to compare the compiler results (instruction counts, GPR counts, etc). A new (fully rewritten) shader compiler would be a big step in catching up with NVIDIA.

On the architecture side, I would prefer improvements to the GCN scalar unit. Right now the scalar unit has some limitations, and removing them would make it easier for the shader compiler to offload work (and register pressure) to the scalar unit. One scalar register takes only 64 bits to store, while a 64 wide vector register takes (64 lanes * 32 bits/lane =) 2048 bits to store (= much more die space and heat). Modern GPUs with scalar architectures (not to be confused with the scalar unit) need to run lots of threads (as each scalar thread does less work compared to a VLIW thread). More simultaneous threads means of course more register pressure. Surprisingly many instructions could be offloaded to the scalar unit, if the compiler was better and the scalar unit supported full instruction set. This would be a good way for AMD to improve performance, reduce the register pressure and to save power. But this strategy also needs a very good compiler to work.

Tonga (only in R9 285) already added two important things: color compression and 16 bit FP support. Color compression makes AMD bandwidth usage comparable to NVIDIA. This both saves power and improves performance. 16 bit FP support is going to be very important in the future, as the register pressure keeps increasing. For example many developers have switched to quaternion (QTangent) based transformations (from matrices), because quaternions need less GPRs (and consume less bandwidth). 16 bit FP is enough for quaternion math (multiply, slerp, etc) in shaders. This reduces the GPR load even further (allowing the GPU to run more threads -> better latency hiding). I would also like to see native 16 bit integer types. The most important thing with 16 bit registers is the reduced register file (GPR) usage. If we also get double speed execution for 16 bit types, then even better (but that is less relevant than saving GPRs).

The tessellation and geometry shader design is still bad in GCN. I know this is not an easy issue to solve, but it currently makes geometry shaders useless and limits tessellation usage to small amplification factors. Obviously we also want to have ROVs and conservative rasterization. These are great features that should be supported ASAP.
 
I like that GCN is a memory based architecture. All the resources stored in memory and cached by general purpose cache hardware. There are no sudden performance pitfalls. Performance is all about the memory access patterns (= something that the developer has full control on).
If I may (if you have the time). Are Maxwell and current Intel iGPUs also memory based architectures, or are they something else? And if you really have the time or a quick link, what is the alternative to "memory based"?

Current generation consoles have roughly the same GPU architecture as the AMD PC GPUs, making it easy for the developers to compare the compiler results (instruction counts, GPR counts, etc).
So there is a difference? And if so, is the console compiler better because it has a different API?
 
The Xbox compiler is written by Microsoft. We're not talking about the HLSL to bytecode compiler (which is the same fxc), but the bytecode to hardware GCN instruction compiler.
 
Although it is absolutely necessary for many things, it is somehow considered technically separate from the actual driver.
Ahaha, okay, so it's about politics then? :p Anyway, I chose "express uninstall of all AMD software", which reasonably should include the not-the-driver-but-also-part-of-the-driver catalyst, AND the actual driver itself. Yet when driver cleaner sorted through my stuff, it found tons of remnants. After killing all of that stuff the install program was able to proceed.

This - leaving basically everything behind after "uninstalling" - is an issue AMD has never bothered to resolve, and it's been extremely well documented over the years. It takes something like a minute for the catalyst installer to do its thing (I haven't actually timed it, but it won't be wildly off), while "uninstall" happens inside of a snap of your fingers in comparison. It stands within reason that it doesn't actually do all that much uninstalling in that short amount of time.
 
When you have 4096 very simple CISC cores - 50-80 times more than Intel Xeon Phi which use modified Atom cores - you can improve performance by simply scaling down the process node to add more cores and increase running frequency.

No. Repeat after me: an ALU (the lane in a SIMD) is not a core, marketing people who came up with this initially should be punished for eternity. If Fiji actually is 4096 ALUs, it's going to be a 64-core machine, each core being 64-wide (logically) SIMD. The Phi (for example) has comparable core count, but narrower SIMD. Intel is honest about counting their cores. As is AMD (sort of) in their APU marketing (see e.g. the 7850k which is 4 CPU + 8 GPU "compute" cores).

There are definitely elements left to explore in making this arrangement more efficient i.e. bringing it closer to the realization of the SIMT conceptual model - I find this presentation by Andy Glew to be an excellent exploration, albeit probably a tad optimistic (IMHO): http://parlab.eecs.berkeley.edu/sites/all/parlab/files/20090827-glew-vector.pdf. Note that this does not necessarily means much for graphics in the near-term, and that's where the game is being played. NV is doing wonderfully due to bringing graphics prowess to the table at exactly the right moment.
 
Ahaha, okay, so it's about politics then? :p Anyway, I chose "express uninstall of all AMD software", which reasonably should include the not-the-driver-but-also-part-of-the-driver catalyst, AND the actual driver itself. Yet when driver cleaner sorted through my stuff, it found tons of remnants. After killing all of that stuff the install program was able to proceed.

This - leaving basically everything behind after "uninstalling" - is an issue AMD has never bothered to resolve, and it's been extremely well documented over the years. It takes something like a minute for the catalyst installer to do its thing (I haven't actually timed it, but it won't be wildly off), while "uninstall" happens inside of a snap of your fingers in comparison. It stands within reason that it doesn't actually do all that much uninstalling in that short amount of time.
If I'm not mistaken, it's Windows that wants to keep old driver files and those are the ones driver cleaner softwares find and clean up.
 
If this is true, Fiji has 2x more memory bandwidth than it needs.
I also never understood why are people expecting all the bandwidth saving technologies in Fiji. For example if the compression had anything to do with booming size of Tonga, than you better drop that.
 
I also never understood why are people expecting all the bandwidth saving technologies in Fiji. For example if the compression had anything to do with booming size of Tonga, than you better drop that.
I still expect that full Tonga has 384bit membus, CU amount has already been confirmed to equal Tahiti
 
Back
Top