AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

It's not "driver issues", it's the fact that R9 290 was supposed to compete against GTX 770, now with the recent price cuts it will compete GTX 780.
The new drivers will increase clock speeds apparently and reviewers have been asked to re-run their benches with the new drivers for R9 290 (when they get the new drivers)

The 280X already does a fine job against the 770. It would be silly for AMD to think that 7xx prices wouldn't adjust to the new competitive reality.
 
I've only now started to read the R290X reviewer, starting out with the one from Anandtech. Ryan Smith writes the following:
At a high level the biggest change here is that AMD is now segmenting their hardware into “shader engines”. Conceptually the idea is similar to NVIDIA’s SMXes, with each Shader Engine (SE) representing a collection of hardware including shaders/CUs, geometry processors, rasterizers, and L1 cache.
I'm confused: shouldn't an SE of AMD correspond to a GPC of Nvidia? And CU with an SMX?

Moving forward, AMD designs are going to scale up and down both with respect to the number of SEs and in the number of CUs in each SE. This distinction is important because unlike NVIDIA’s SMX model, where the company can only scale down hardware by cutting whole SMXes, AMD can technically maintain up to 4 SEs while scaling down the number of CUs within each SE.
Nvidia can cut down the number of SMX'en per GPC, while leaving all GPCs active. So this paragraph doesn't make sense at all, IMHO?

Am I missing something?
 
I've only now started to read the R290X reviewer, starting out with the one from Anandtech. Ryan Smith writes the following:

I'm confused: shouldn't an SE of AMD correspond to a GPC of Nvidia? And CU with an SMX?
An SMX comes with its own polymorph engine, so there are certain classes of logic that get subdivided differently.
At an vector engine level, I would say the ALU resources of an SMX most resemble a group of CUs sharing the same scalar and instruction cache.
However, the sharing of the register file, texturing, and shared memory don't make it a very strong pattern.

Am I missing something?
I'm curious about whether someone will come out and explain if the "new" shader engine hierarchy of Hawaii differs from Tahiti's shader engines, which were discussed in this thread before the big reveal of the 290X.
 
Got it: I didn't realize that SMX also had this polymorph business going on.

Is it fair to say that, for AMD and R290X, all geometry handling is split into 4 blocks, while for Nvidia and GK110, there's an additional level hierarchy in that you have 5 GPCs and 15 polymorph engines?
 
I'm thinking there's a rough split between primitive setup and generation and rasterization.

The rasterizer portion subdivides screen space, and this leaves a GPC and SE similarly delineated.

The portion that belongs in the geometry and primitive blocks is more numerous for Nvidia, but it doesn't conflict with the GPC arrangement.
AMD has kept the same count, but the geometry processors have arrows that feed their output to rasterizers in other SEs, just as the polymorph engines need to distribute to other GPCs.

So it seems to be a similar hierarchy, except that Nvidia has more units at the setup level. The raster portion has fewer blocks in Hawaii, but they seem to be heftier given the raw pixel throughput of the design.
 
All the geometry processing done by NV is distributed at multiprocessor level. The primitive setup stage distribution is more coarse, grouping several multiprocessors together (GF100 - 4, GK104 - 2, GK110 - 3).

AMD is still using the more conventional "monolithic" front-end approach, by replicating the whole block to distribute the workload.
 
It's not "driver issues", it's the fact that R9 290 was supposed to compete against GTX 770, now with the recent price cuts it will compete GTX 780.
The new drivers will increase clock speeds apparently and reviewers have been asked to re-run their benches with the new drivers for R9 290 (when they get the new drivers)

290 vs the 770?
Hardocp tells that the 7970 beats the 770.
so I would say the 290 is meant to sit in between the 770 and 780.
New drivers down the line is likely to increase output a lot

my 7970 with BF4 sits at 100fps, runs better than Bf3 and Mantle isnt even out yet.
for a card made so long ago its a beast.
 
At an vector engine level, I would say the ALU resources of an SMX most resemble a group of CUs sharing the same scalar and instruction cache. However, the sharing of the register file, texturing, and shared memory don't make it a very strong pattern.

It's easier for me to look at it as 1 CU = 1 SMX. Multiple CUs share an L1 cache but otherwise each CU is self contained and indivisible, just like an SMX.

Both the SMX and CU are the smallest units of computation on their respective architectures.
 
A couple of potential explanations come to mind. One, TessMark uses OpenGL, and it's possible AMD hasn't updated its OpenGL drivers to take full advantage of Hawaii's quad geometry engines. Two, the drivers could be fine, and we could be seeing an architectural limitation of the Hawaii chip. As I noted earlier, large amounts of geometry amplification tend to cause data flow problems. It's possible the 290X is hitting some internal bandwidth barrier at the x32 and x64 tessellation levels that's common to GCN-based architectures. I've asked AMD to comment on these results but haven't heard back yet. I'll update this text if I find out more.

:???::???::???: ???

http://techreport.com/review/25509/amd-radeon-r9-290x-graphics-card-reviewed/6
 
Got it: I didn't realize that SMX also had this polymorph business going on.

Is it fair to say that, for AMD and R290X, all geometry handling is split into 4 blocks, while for Nvidia and GK110, there's an additional level hierarchy in that you have 5 GPCs and 15 polymorph engines?
I think of a AMD SE being closest to an Nvidia GPC. A difference with regard to geometry processing being Nvidia has an pre-cull rate that's independent of the post cull rate.

I'm still disappointed in AMD's tessellation implementation.
http://techreport.com/review/25509/amd-radeon-r9-290x-graphics-card-reviewed/6
Multiply NVidia's score by the tests' tessellation factors, and you get a roughly constant number, i.e. tris/s is constant. AMD, OTOH, loses throughput with t-factor. Hawaii didn't fix anything.

There's no need for off-chip buffering, no matter how high the t-factors. It's a really lazy algorithm.
You're getting caught up in the off-chip buffering statement and not understanding what it means. Grant it that no one has explained it. I'm not at liberty to go into much detail, but replace "off-chip buffering" with "cached memory." If you process a HS on CU0 you don't want all of the DS verts for that patch to execute on the same CU so the output data needs to go to a location up the memory hierarchy. On AMD hardware that's cached memory that could get flushed off chip.

Nvidia's SMX's have more compute per LDS than an AMD CU, but I suspect they can write HS output to the L2 as well.
 
You're getting caught up in the off-chip buffering statement and not understanding what it means. Grant it that no one has explained it. I'm not at liberty to go into much detail, but replace "off-chip buffering" with "cached memory." If you process a HS on CU0 you don't want all of the DS verts for that patch to execute on the same CU so the output data needs to go to a location up the memory hierarchy. On AMD hardware that's cached memory that could get flushed off chip.
If you design the tesselation unit properly, there's no need to generate all the DS verts at once. The hardware tessellator should have access to patch data (maybe a few wavefronts so that the case of zero amplification doesn't cause excessive slowdown) and create wavefronts of DS verts as they're needed.

Why should any DS verts have to go off chip? AMD doesn't even need the tesselator to do anything but put a couple indices in each DS vert, and a shader program (inserted into the DS) can calculate the barycentric coords from there.
 
Last edited by a moderator:
I'm still disappointed in AMD's tessellation implementation.
http://techreport.com/review/25509/amd-radeon-r9-290x-graphics-card-reviewed/6
Multiply NVidia's score by the tests' tessellation factors, and you get a roughly constant number, i.e. tris/s is constant. AMD, OTOH, loses throughput with t-factor. Hawaii didn't fix anything.

There's no need for off-chip buffering, no matter how high the t-factors. It's a really lazy algorithm.

Tessmark is really special.. i dont know why, for a long time they was not even test their soft on AMD GPU's. I dont know if it come from the driver, from Tessmark who have not been updated, but they got lower result at 32-64x of the Pitcain .. But if you look Unigine results, it is waay different. ( even if in unigine extreme tesselation, the tesselation is not the only part tested )
 
These results are weird. Techspot shows the GTX760 beating the 670 and Guru shows the 680 ahead of the 770.
 
If you design the tesselation unit properly, there's no need to generate all the DS verts at once. The hardware tessellator should have access to patch data (maybe a few wavefronts so that the case of zero amplification doesn't cause excessive slowdown) and create wavefronts of DS verts as they're needed.

Why should any DS verts have to go off chip? AMD doesn't even need the tesselator to do anything but put a couple indices in each DS vert, and a shader program (inserted into the DS) can calculate the barycentric coords from there.
You generate enough DS verts to fill a wavefront/warp and launch it. One per clock per tessellator.

Your last sentence is correct, but some of the previous comments are not.

The hardware tessellator only needs tessellation factors. It outputs UV's and the DS does the rest. The DS needs access to the HS output and if you only have a few wavefronts in flight performance will suck. Hence you want DS waves from the same patch to execute on multiple CUs/SMXs in parallel.
 
Concerning Mantle and BF4 ..

-BF3 used heavy object instancing to reduce draw calls count improving performance, on one example from 4000 to 900! less than a quarter. Don't know if BF4 does the same or not, but it's highly likely.

-In BF4 single player : HD 7850 achieves about 55fps, HD 7870 achieves about 65fps both @1680x1050 and High preset .

-PS4 GPU is running the game at 1600x900@60fps,at the High preset also. So results seem to be in between HD 7850 and 7870 as expected and consistent with the console's low clock speeds.

-XB1 runs at 1280x720 @60fps High preset also. Again consistent with it's much weaker GPU.

-Don't know if both consoles are running a code that is close to the metal, and whether it's much closer than Mantle or not. That raises questions about how much performance can be extracted using Mantle API, and whether it will affect the CPU more than the GPU or vice versa.
 
The tested resolution is 22.5% more pixels than the PS4's resolution. And we've yet to see what advantage Mantle brings to the table. With those elements factored in it looks like the 7850 should be as fast or faster than the PS4.
 
Back
Top