OCCT and the HD 2600 Pro missing SIMD

Frontino

Newcomer
I always knew the HD 2600 Pro was equipped with the RV630, which should consists of 3 SIMD of 8 VLIW5 units for a supposed total of 120 ALUs, but my Asus HD2600Pro 256 MB could be having 1 SIMD deactivated (or maybe RV630 never had 3 SIMD at all!).
I realized, while testing with the GPU stress bench from OCCT and making both cards' GPUs loading at 100% @1440x900 fullscreen and shader complexity 8, that:

with my current 9800 GX2 stock clocked I get 50 FPS.
with only 1 G92 active I get 25 FPS (exactly half-performance).
with the HD 2600 Pro I get 6 FPS.

Just so you know, my CPU is practically idling with both cards.

Now, if I make a simple calculation to wonder how much performance ratio there is between G92 and RV630 ALUs, a strange result comes up.
The formula I applied was:

(G92's) 256 ALUs * 1500 MHz / 50 FPS * 6 FPS / (RV630's) 600 MHz = 76.8 which, rounded at 80 brings me to exactly 80 ALUs contained in 2 SIMD of an RV630.

What enforces my idea is that one day I installed a not-up-to-date GPU-Z, which showed 80 unified shaders.

What do you think of this theory?
 
I think you're trying to decide how many cylinders a car has by how fast it goes. The reality is that they're not linked.

Each card will have it's own unique characteristics that impact performance. Using a performance comparison between two wildly different architectures to try to define one of the architectures is plainly wrong.
 
What enforces my idea is that one day I installed a not-up-to-date GPU-Z, which showed 80 unified shaders.

What do you think of this theory?

That you haven't come across the surprising fact that there is more to a gpu than shaders.:p

Anyway,
shader complexity 8

http://www.software-listings.com/occt-310-beta-4.htm

Use the shader complexity to adjust the shader, and the test. On some cards, some values are much better than others. For instance, value 3 is far better on HD4XXX than 4… on GeForce GTX280, 8 seems better. OCCT will suggest the best one for you, if known. 1 is the default.

needs moar tuning.
 
I've forgot to mention that the OCCT GPU Test is heavily shader limited.
I tried to overclock and downclock 9800 GX2's Core and RAM independently and contemporaneally and there's no change in FPS, while overclocking the Shaders from 1500 MHz to 1750 MHz adds 7 FPS.
So, memory bandwidth and fillrate don't count.
And I remind you that, with shader complexity 8, the HD 2600 Pro loads at 100% and lowering that value on the 9800 GX2 gives me a lesser load on the GPUs. So, to have equal results, I chose to utilize the maximum value for both cards.
 
What enforces my idea is that one day I installed a not-up-to-date GPU-Z, which showed 80 unified shaders.

What do you think of this theory?
Have you tried an up-to-date version of GPU-Z?
I would expect newer versions to have either the same or better detection abilities.

Anyway, you could also say that the test proves that on average one R6xx Stream Processor has approximately 2/3 of the real-world throughput of a G8x/G9x Unified Shader at the same clock.
 
GPU-Z does not detect the number of shader units presents. It uses the device id to look up what numbers it should display. It is nothing but a glorified lookup application.

Earlier versions simply have the wrong numbers entered in it's internal lookup table.
 
Ok. I'm leaning towards the theory that, somehow, some of the 120 ALUs are idling and the "100% GPU load" is in fact the driver load and not the actual chip load.
That would explain why the 9800 GX2's GPUs are at 100%, but if I downclock or overclock the core while letting the shaders at stock frequency it doesn't change the framerate in the OCCT GPU test.
Because that would mean part of the chip is idling, right?
 
Last edited by a moderator:
GPU-Z does not detect the number of shader units presents. It uses the device id to look up what numbers it should display. It is nothing but a glorified lookup application.

Earlier versions simply have the wrong numbers entered in it's internal lookup table.

I used to think that too until it detected the shader mod of a 6950 BIOS hack that doesn't touch the ids.

It must read some registers or something too.
 
I've forgot to mention that the OCCT GPU Test is heavily shader limited.
I tried to overclock and downclock 9800 GX2's Core and RAM independently and contemporaneally and there's no change in FPS, while overclocking the Shaders from 1500 MHz to 1750 MHz adds 7 FPS.
So, memory bandwidth and fillrate don't count.
And I remind you that, with shader complexity 8, the HD 2600 Pro loads at 100% and lowering that value on the 9800 GX2 gives me a lesser load on the GPUs. So, to have equal results, I chose to utilize the maximum value for both cards.

Alright, so since I get 20fps with a 4850 at your settings, I have 245, rounded to 240 shaders, 3 SIMDS :cry:

and yes it goes to 100% utilisation, cpu not much heat, etc etc.

Or my benchmark number isn't correct, and maybe ati did some driver hax,
but these are from before that, or rather during the occt debacle:

http://www.xtremesystems.org/FORUMS/showpost.php?p=3802008&postcount=412

http://techreport.com/articles.x/16820/4

http://www.xtremesystems.org/forums/showpost.php?p=3802203&postcount=421

and if we introduce the mysterious dual-issue MUL in your original calculations, we get 115.2, rounded to 120 shaders on rv630!!! :oops::p
 
I used to think that too until it detected the shader mod of a 6950 BIOS hack that doesn't touch the ids.

It must read some registers or something too.
For the past few generations we have been providing a lot of the register spec for this type of detection to the dev.
 
Alright, so since I get 20fps with a 4850 at your settings, I have 245, rounded to 240 shaders, 3 SIMDS :cry:

and yes it goes to 100% utilisation, cpu not much heat, etc etc.

Or my benchmark number isn't correct, and maybe ati did some driver hax,
but these are from before that, or rather during the occt debacle:

http://www.xtremesystems.org/FORUMS/showpost.php?p=3802008&postcount=412

http://techreport.com/articles.x/16820/4

http://www.xtremesystems.org/forums/showpost.php?p=3802203&postcount=421

and if we introduce the mysterious dual-issue MUL in your original calculations, we get 115.2, rounded to 120 shaders on rv630!!! :oops::p

Oh, shoot! I forgot the 2nd MUL from the Special Function Units. Yes, there are 128 SFUs in each G92, so I should have calculated 384 ops * 2 GPUs * 1500 MHz / 50 FPS * 6 FPS / 600 MHz / 2 ops = 115.2 rounded to 120 ALUs.
So, OCCT loads G92's SFUs also.
Thanks for reminding me that!

As for your 4850 you could try to use shader complexity 3 (which is HD4xxx most stressful setting) and post your FPS. I'll post mine too.
 
And if we take into account the 2nd MUL to calculate again for your 4850:
768 ops * 1500 MHz / 50 FPS * 20 FPS / 625 MHz / 2 ops / 80 ALUs = 4.605 SIMD rounded at 5 SIMD gives 50 FPS / 1152000 MFlops * 625 MHz * 800 ops = 21.7 FPS

If your 4850 had only 5 SIMD active for real, instead of 10 (remember GPU is at 100%), it would be a scandal for AMD on par with the Watergate one.

Or someone could come up and assure us that GPU utilization actually reflects driver saturation.
 
I would look at bottlenecks with the application. There are many cases that easily show the number of shaders / SIMD's in the architecture. I'd suggest looking at the GPGPU forum and searching for some of Prunedtree's posts for a start.
 
I made new tests today and every shader complexity except "1" brings GPUs temps to a max of 98/95.
Complexity 1, instead, tops at 104/101.
But MSI Afterburner shows 99-100% utilization in every scenario, thus meaning it's not relating to actual chip occupation, but to the driver.
 
Last edited by a moderator:
I noticed that at complexity 1, the framerate is also influenced by Core and Memory clock rates.
Now I'm sure with OCCT GPU Test I can't compare Ati and Nvidia results for ALU count.
 
Last edited by a moderator:
I made new tests today and every shader complexity except "1" brings GPUs temps to a max of 98/95.
Complexity 1, instead, tops at 104/101.
But MSI Afterburner shows 99-100% utilization in every scenario, thus meaning it's not relating to actual chip occupation, but to the driver.

I used to get 100% utilization in far cry 2, but the temps were about 10C lower than crysis benchmark, which didn't load it to constant 100%. The crysis benchmark in turn had less temperatures than running dx9 sdk demos, which had framerates of ~300fps. And ati tool triumphed over all which had frame rates in excess of 1000fps and gpu utilization 100%.

I can run the tests but I don't think we can conclude anything worthwhile unless the guy releases the source code.
 
Ok. I'm leaning towards the theory that, somehow, some of the 120 ALUs are idling and the "100% GPU load" is in fact the driver load and not the actual chip load.

This means that 100% of time some part of the gpu is doing something.

you CANNOT have all the units inside the chip "doing something" at the same time.
(even if you tried with very unrealistic workload like furmark, some register file ports and internal busses prevent this; some RF ports and buses are always shared by multiple FU's so that those cannot be used together)

And when you have a "reasonable shader code" doing something reasonable, there is lots of data dependencies in that shader code. If there is one scalar instruction executing, and data dependencies prevent another instructions fro being executed, 4/5 of the shader ALU's of ATI GPU are idling,
1/5 is doing the work.

In same situation with nvidia, 1/2 of shader ALU's are idling, 1/2 is doing the work.

This is the main reason why nvidia gets "better efficiency" from shader code, "gets closer to theoretical flops" in real code, and also why radeon 6900 usually gets better shader performance than radeon 5800 even though it has less shader ALUs.

That would explain why the 9800 GX2's GPUs are at 100%, but if I downclock or overclock the core while letting the shaders at stock frequency it doesn't change the framerate in the OCCT GPU test.
Because that would mean part of the chip is idling, right?

If the code is 100% time doing raster operations, and shaders are doing very little, it shows 100% GPU utilization, but shader clock speed has no impact on performance
 
Back
Top