AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Yesterday at a completely different internet connection your image did show up, while at multiple computers at this internet connection it didn't. Today it shows on my computer :???:

Are you using chrome?
 
Are you using chrome?
No, FF and IE. Anyway I can only guess that my ISP was being creative with something or other, since the problem has disappeared. I had wondered whether Win8CP was the cause, but I ruled that out.
 
The Radeon 7900 review thread discussed a possible weakness in AMD's design when it came to MRTs and MSAA. Building the G-buffer in various deferred schemes is an area where Nvidia handled things significantly better.
 
The Radeon 7900 review thread discussed a possible weakness in AMD's design when it came to MRTs and MSAA. Building the G-buffer in various deferred schemes is an area where Nvidia handled things significantly better.

So, apart from obvious economic reassons that probably made them tease it in a Keppler card, it is not crazy to think that a heavy deferred engine like the future Unreal Engine 4 could perform much better in an Nvidia architecture ?.
 
The Radeon 7900 review thread discussed a possible weakness in AMD's design when it came to MRTs and MSAA. Building the G-buffer in various deferred schemes is an area where Nvidia handled things significantly better.
There's no "why" in that discussion, as far as I can tell.
 
There's no "why" in that discussion, as far as I can tell.
To add to this point, if I'm not mistaken, AMD GPUs read a render target (configured as input to a pixel shader) through the TMU data path. Therefore it could actually be a TMU weakness, not one of the ROPs. That would also explain, that the performance relation between Pitcairn and Tahiti stays virtually the same with MSAA.
 
To add to this point, if I'm not mistaken, AMD GPUs read a render target (configured as input to a pixel shader) through the TMU data path. Therefore it could actually be a TMU weakness, not one of the ROPs. That would also explain, that the performance relation between Pitcairn and Tahiti stays virtually the same with MSAA.
I would think everybody reading a render target in a pixel shader would do so through the tmu data path? A render target should look pretty much like any ordinary texture when accessed in the pixel shader. Maybe it's more likely to have non-full speed throughput due to "odd" format but otherwise what's the difference?
 
The Radeon 7900 review thread discussed a possible weakness in AMD's design when it came to MRTs and MSAA. Building the G-buffer in various deferred schemes is an area where Nvidia handled things significantly better.

That's the feeling I get when hearing some people talk about the results in BF3 when comparing Tahiti to Fermi. As without MSAA Tahiti's performance is about where you'd expect it. But once you enable MSAA, performance tanks compared to Fermi cards.

Regards,
SB
 
I would think everybody reading a render target in a pixel shader would do so through the tmu data path? A render target should look pretty much like any ordinary texture when accessed in the pixel shader.
Going back into the mists of time, CUDA was often (though not always) higher performance reading linearly organised buffers through non-TMU paths rather than through the TMUs.

What we could simply be seeing in this scenario, is that NVidia's "CUDA-specific" linear access hardware is better than AMD's. It wasn't that long ago that doing linear buffers in OpenCL on AMD was a disaster zone (because it was based upon the vertex fetch hardware) and AMD might still be climbing that curve.

AMD's initial support for UAVs was something of a kludge as far as I can tell - for the multiple UAVs that are required by D3D, using an emulation that configures a single physical UAV in hardware and splits it up. Additionally AMD hardware has severe constraints on the size of a UAV - a common complaint amongst OpenCL programmers is (was?) that it is impossible to allocate a single monster UAV (that is, a linear buffer) to use the majority of graphics memory (e.g. 900MB out of 1GB). There's some kind of hardware/driver restriction that only allows for 50% allocation. Allocating texture memory in OpenCL is less constrained.

Textures are normally non-linearly mapped across memory channels.

On Xenos there's a non-linear organisation of MSAA'd render targets (and regular render targets? can't remember) . If you rummage you'll find swizzle "hacks" for deferred MSAA rendering on XB360. Something like that, I forget the details.

It seems to me that anyone with access to both AMD and NVidia cards should be able to enumerate the MSAA levels and MRT configurations with some clever code, to observe the performance profiles of the raw access techniques, without being obscured by all the other stuff that happens in a game frame.
 
Maybe I missed thumbnail?

What if, I choose a random factor for each chip being a multiple of 3 for Cape Verde, a multiple of 6 for Pitcairn and a multiple of 8 for Tahiti?


I know this isn't very scientific but... There should be something more than just ROPs, TMUS, SPs and memory bandwidth there...
 
What we could simply be seeing in this scenario, is that NVidia's "CUDA-specific" linear access hardware is better than AMD's. It wasn't that long ago that doing linear buffers in OpenCL on AMD was a disaster zone (because it was based upon the vertex fetch hardware) and AMD might still be climbing that curve.
I'd think that gcn's better cache architecture should potentially fix such issues? I think though I'm largely missing how any necessary synchronization etc. really works for UAVs...
 
Going back into the mists of time, CUDA was often (though not always) higher performance reading linearly organised buffers through non-TMU paths rather than through the TMUs.

What we could simply be seeing in this scenario, is that NVidia's "CUDA-specific" linear access hardware is better than AMD's. It wasn't that long ago that doing linear buffers in OpenCL on AMD was a disaster zone (because it was based upon the vertex fetch hardware) and AMD might still be climbing that curve.
We've had caching for buffer reads for EG/NI chips for quite a while now. It doesn't help when a buffer is read/write, but there are plenty of buffers that are read only so it's still quite beneficial. SI has caching all the time of course.

Jawed said:
AMD's initial support for UAVs was something of a kludge as far as I can tell - for the multiple UAVs that are required by D3D, using an emulation that configures a single physical UAV in hardware and splits it up. Additionally AMD hardware has severe constraints on the size of a UAV - a common complaint amongst OpenCL programmers is (was?) that it is impossible to allocate a single monster UAV (that is, a linear buffer) to use the majority of graphics memory (e.g. 900MB out of 1GB). There's some kind of hardware/driver restriction that only allows for 50% allocation. Allocating texture memory in OpenCL is less constrained.
There are some reasons for this. First, the GPU's memory pool is split into two regions: CPU visible and invisible. The CPU visible region we expose is 256MB, normally. This means that you have ato most 768MB of contiguous memory on a 1GB card. The way the OpenCL conformance tests are written, you have to be able to allocate a buffer of the maximal size you report, which is sort of impossible to guarantee unless you're conservative. I believe Nvidia only exposes 128MB of CPU visible memory, so they have a larger continuous pool to work with. They also may handle memory allocations differently, but we use VidMM and expose two memory pools. Note that I believe we've improved this (memory allocation) behavior recently, but you're still going to have some limits caused by having two memory pools.

My understanding is that if everyone were using 64-bit OSes (and apps) we could expose all the video memory to the CPU and not worry about having separate memory pools, not to mention facilitating faster data uploads in some cases.
 
There are some reasons for this. First, the GPU's memory pool is split into two regions: CPU visible and invisible. The CPU visible region we expose is 256MB, normally. This means that you have ato most 768MB of contiguous memory on a 1GB card. The way the OpenCL conformance tests are written, you have to be able to allocate a buffer of the maximal size you report, which is sort of impossible to guarantee unless you're conservative. I believe Nvidia only exposes 128MB of CPU visible memory, so they have a larger continuous pool to work with. They also may handle memory allocations differently, but we use VidMM and expose two memory pools. Note that I believe we've improved this (memory allocation) behavior recently, but you're still going to have some limits caused by having two memory pools.

My understanding is that if everyone were using 64-bit OSes (and apps) we could expose all the video memory to the CPU and not worry about having separate memory pools, not to mention facilitating faster data uploads in some cases.

At least on HD5xxx family, you can not allocate a single OpenCL buffer larger than 128MB (indeed you can allocate multiple 128MB buffers). I haven't recently verified if this limit is still present but I assume so. It is one of the most annoying limitation of AMD old hardware. It was a severe limitation for most OpenCL applications on AMD.

In my opinion, this limit was more annoying than not having access to all GPU memory pool.

Another note: in the past, using linear data stored in an OpenCL image buffer was an effective way to improve performance over storing data on OpenCL linear buffer. This optimization was quite annoying to code too.
 
To get back to the question of MSAA performance:
I would think everybody reading a render target in a pixel shader would do so through the tmu data path? A render target should look pretty much like any ordinary texture when accessed in the pixel shader. Maybe it's more likely to have non-full speed throughput due to "odd" format but otherwise what's the difference?
Exactly that was my idea. ;)
Just imagine nV can do it full speed and AMD only half speed (or some other difference). Factor in that AMD TMUs are already slower for the FP16 data format used quite often (afaik) for such render targets and you may arrive at a significant difference.
 
At least on HD5xxx family, you can not allocate a single OpenCL buffer larger than 128MB (indeed you can allocate multiple 128MB buffers). I haven't recently verified if this limit is still present but I assume so. It is one of the most annoying limitation of AMD old hardware. It was a severe limitation for most OpenCL applications on AMD.
If you're using Linux, then the issue is lack of VM support. In Windows we support VM for all EG/NI/SI chips and don't have these issues. Currently, only SI has VM support in Linux.
Dade said:
Another note: in the past, using linear data stored in an OpenCL image buffer was an effective way to improve performance over storing data on OpenCL linear buffer. This optimization was quite annoying to code too.
This is probably because read-only images are always cached. Buffers used read-only would be cached as well, as long as you don't alias pointers. I.e. "kernel void foo(global float* in, global float* out)", if the same memory object were bound to "in" and "out", then in would not be cached.

Sorry for the OT, but I thought it was worth explaining.
 
Well, thats only 3GB per chip, much like all their other cross fire on a stick cards.

Given that theres already a 6GB for ONE GPU card coming out, 6GB for two GPUs is hardly amazing.
 
So 2 full Tahitis @ 850 mhz with only 300W TDP?
(and probably a bios switch making it a full 7970x2 around 375W)
That sounds pretty efficient :)
 
Back
Top