Evolve (GPU Benchmark)

Scott_Arm

Legend

New GPU benchmark. It's not free but you can buy it on a few different stores.


With the rapid evolution of GPU technology, a one-size-fits-all benchmark is no longer sufficient as the market has exploded with devices of different performance characteristics and form factors, all at different power envelopes and performance. This makes it more important than ever to give deep insights into what’s going on exactly when measuring through a benchmark. Evolve addresses this challenge with a multifaceted approach, providing seven distinct performance scores:
  • Energy Consumption – Measuring power efficiency to optimize performance-per-watt.
  • Ray Tracing Capabilities – Evaluating real-time ray tracing performance.
  • Rasterization Performance – Assessing traditional rendering capabilities.
  • Compute Performance – Measuring general-purpose GPU computing strength.
  • Driver Effectiveness – Analyzing how effectively software leverages hardware.
  • Acceleration Structure Build
  • Workgraphs Performance (world-first evaluation)
And as an industry-first, Evolve introduces workgraph performance evaluation, pioneering a new way to assess the next generation of GPU workloads.
 
My first thought was it didn't seem to / claim to have a mechanism for measuring power consumption linked to the benchmarks, which I feel should've been included as it relates to their stated concerns of a plethora of devices "all at different power envelopes and performance." Great, Vendor A's product can generate 20% more performance than Vendor B's product, but at the cost of 75% more power consumption. Is it worth it?

My second thought was even more out there: how does it ensure a consistent quality output? Remember when drivers "cheated" in 3DMark? Radeon's radial aniso optimization was a big stinkfest, then I think NVIDIA at some point was inserting z-clip planes to bump a few more frames in certain tests. Then we got into the feces-throwing with 16 bit vs 24 bit vs 32 bit and later NVIDIA's FX series forcing certain FP32 shaders to use FP16. Is there going to be any mechanism to test the output to ensure shortcuts aren't being taken?

Doesn’t have a denoiser.
And this thought never crossed my mind until you mentioned it, but I feel like it somehow fits into my second thought above. Even if the hardware does provide a denoiser, is it a tenable solution?
 
Hi Scott, thanks for posting about our benchmark suite!

I'm Jasper Bekkers, CTO of Traverse Research and we've been developing this benchmark suite for a while now and we're really happy to see it out of the door. Also very nice to see that this community is looking at it, I've been lurking Beyond3D for a long time!


Doesn’t have a denoiser.

Correct, the path tracer mode doesn't have a denoiser at this time on purpose since for the path tracing mode we're comparing raw ray-tracing performance first and foremost. However we do have plans to integrate and/or build denoisers into the path tracer to get it to look good and perform better ;-). The "ray tracing" modes do have a various bespoke denoisers in the different subsystems (GI, reflections, shadows etc all have denoisers built in).

My first thought was it didn't seem to / claim to have a mechanism for measuring power consumption linked to the benchmarks, which I feel should've been included as it relates to their stated concerns of a plethora of devices "all at different power envelopes and performance." Great, Vendor A's product can generate 20% more performance than Vendor B's product, but at the cost of 75% more power consumption. Is it worth it?
I'm not sure if I understand this correctly, we measure energy consumption throughout the benchmark run. Notice that we don't to perf/watt scores though, we just calculate the energy score directly based off of energy consumption but don't factor it into our other scores.

My second thought was even more out there: how does it ensure a consistent quality output? Remember when drivers "cheated" in 3DMark? Radeon's radial aniso optimization was a big stinkfest, then I think NVIDIA at some point was inserting z-clip planes to bump a few more frames in certain tests. Then we got into the feces-throwing with 16 bit vs 24 bit vs 32 bit and later NVIDIA's FX series forcing certain FP32 shaders to use FP16. Is there going to be any mechanism to test the output to ensure shortcuts aren't being taken?

We have written up some rules on how we expect this benchmark to be used / optimized for; https://www.evolvebenchmark.com/technical-guide#chapter-3. However, the understanding is that these rules aren't perfect and won't be perfect. The primary rule is "don't cheat", and we'll blocklist drivers that we catch cheating.

However, it's widely known that GPU drivers for all games / apps out there will perform some kinds of "app-opt", these are settings specific to applications that might change certain driver behavior under the hood. With those things we're absolutely fine, as long as the benchmark keeps running as expected, since this is exactly what the "game optimized drivers" are also doing.
 
I'm not sure if I understand this correctly, we measure energy consumption throughout the benchmark run. Notice that we don't to perf/watt scores though, we just calculate the energy score directly based off of energy consumption but don't factor it into our other scores.
Welcome Jasper! Thanks for taking the time to reply! If your suite is indeed taking power measurements during the benchmark run, then it's my fault for missing the detail. The only other benchmark I'm personally aware of which takes power consumption as a function of performance is the SpecRate series. Nicely done!

We have written up some rules on how we expect this benchmark to be used / optimized for; https://www.evolvebenchmark.com/technical-guide#chapter-3. However, the understanding is that these rules aren't perfect and won't be perfect. The primary rule is "don't cheat", and we'll blocklist drivers that we catch cheating.

However, it's widely known that GPU drivers for all games / apps out there will perform some kinds of "app-opt", these are settings specific to applications that might change certain driver behavior under the hood. With those things we're absolutely fine, as long as the benchmark keeps running as expected, since this is exactly what the "game optimized drivers" are also doing.
I generally agree with your take; driver optimizations will absolutely include a heaping serving of application optimizations. It's a very gray area yet somehow a fine line too; some people get VERY angry about optimizations that perhaps are invisible to the naked eye, yet influence "scores" greatly. Further to the point, how would you as the app owner realistically be able to detect drivers doing nefarious things? What even constitutes the difference between nefarious things and obvious optimization? I like that you've documented your expectations, so at least future conversations can revolve around a documented case of what should be happening versus whatever might be happening at the time of discourse.

Glad you're here, hopefully you stick around :)
 
Welcome Jasper! Thanks for taking the time to reply! If your suite is indeed taking power measurements during the benchmark run, then it's my fault for missing the detail. The only other benchmark I'm personally aware of which takes power consumption as a function of performance is the SpecRate series. Nicely done!


I generally agree with your take; driver optimizations will absolutely include a heaping serving of application optimizations. It's a very gray area yet somehow a fine line too; some people get VERY angry about optimizations that perhaps are invisible to the naked eye, yet influence "scores" greatly. Further to the point, how would you as the app owner realistically be able to detect drivers doing nefarious things? What even constitutes the difference between nefarious things and obvious optimization? I like that you've documented your expectations, so at least future conversations can revolve around a documented case of what should be happening versus whatever might be happening at the time of discourse.

Glad you're here, hopefully you stick around :)
The power stuff so far I've found most interesting, having a laptop switch from 50W to 100W when plugged in, or having a MSI Claw take ~15W always when running, or the 4090 that's really going all over between 200W and 380W and switching very rapidly in a frame etc. I've attached a screenshot of the example. We'll then convert the power usage to an energy score also.
1742922119186.png
Further to the point, how would you as the app owner realistically be able to detect drivers doing nefarious things? What even constitutes the difference between nefarious things and obvious optimization? I like that you've documented your expectations, so at least future conversations can revolve around a documented case of what should be happening versus whatever might be happening at the time of discourse.

Thanks yeah it's somewhat going to be an interesting ride to see, of course historically benchmarks have been cheated on and that's not great so we at least documented it to make sure that we're all on the same page.
Glad you're here, hopefully you stick around :)
This account is from 2014 still so we'll see, I've been relatively silent so far, but I think I'll be around ;-)
 
I know there are Youtube reviewers out there who get pretty deep into the technical weeds on how to properly measure power to then generate a "joules per frame" chart. The challenge for you is, you can only rely on what the hardware (and related software) tell you. One reason for bringing this up is the Intel ARC B-series cards (IIRC, maybe I'm confusing it, apologies if so) misreported their power draw in software measurements, ostensibly because they're pulling more from the PCIe slot versus the PSU 6-pin/8-pin cable and apparently only reporting what's being fed by the PSU directly. Gamers Nexus is the youtuber I'm thinking of specifically in this example, where they basically had to break out a PCIe slot interposer to be able to discretely measure JUST the power consumption of the card itself separately from the rest of the PCIe root port in the CPU.

Thus, an example of where a software configuration (or maybe worse, a hardware misconfiguration?) can negatively affect your ability to generate reliable data. And it's not even something you could control nor even attempt to detect...
 
I know there are Youtube reviewers out there who get pretty deep into the technical weeds on how to properly measure power to then generate a "joules per frame" chart. The challenge for you is, you can only rely on what the hardware (and related software) tell you. One reason for bringing this up is the Intel ARC B-series cards (IIRC, maybe I'm confusing it, apologies if so) misreported their power draw in software measurements, ostensibly because they're pulling more from the PCIe slot versus the PSU 6-pin/8-pin cable and apparently only reporting what's being fed by the PSU directly. Gamers Nexus is the youtuber I'm thinking of specifically in this example, where they basically had to break out a PCIe slot interposer to be able to discretely measure JUST the power consumption of the card itself separately from the rest of the PCIe root port in the CPU.

Thus, an example of where a software configuration (or maybe worse, a hardware misconfiguration?) can negatively affect your ability to generate reliable data. And it's not even something you could control nor even attempt to detect...

Agreed, even ignoring Intel, AMD vs Nvidia's software telemetry so far as I know also aren't directly comparable and trying to use the software-reported readings vs actually measuring physically with a current clamp, or power from the wall, etc., can be misleading. Love that there's a new benchmark out there though, and the focus on specific microbenchmarks in addition to just a single numerical 'score'.

 
Agreed, even ignoring Intel, AMD vs Nvidia's software telemetry so far as I know also aren't directly comparable and trying to use the software-reported readings vs actually measuring physically with a current clamp, or power from the wall, etc., can be misleading
Indeed this is a problem, however, it's thankfully one that's gotten better over time where both AMD and Intel's measurements have gotten closer. We query this on a few points in the system, and the one closest to the silicon is the one we prefer.

Having said that; yes there will be some obviously difference and I did recently get an ElmorLabs BENCHLAB (https://www.elmorlabs.com/product/benchlab/) that I have yet to try out (it will always mismatch what we report, because it measures the whole GPU instead of the chip, but maybe at some point we should add direct support for it - if only the todo list wasn't +inf long already).
 
Help me understand this point here:
...(it will always mismatch what we report, because it measures the whole GPU instead of the chip...
Is your tool attempting to only report the power of the discrete GPU silicon, rather than the power consumption of the board?

As a proponent of Folding @ Home, I've spent an unreasonable amount of time working to optimize my several PCs to deliver their best PPD per watt. For the NVIDIA cards I'm using, I've noticed their configurable power limit (nvidia-smi -i 0 -pl nnn ) seems to measure the consumption for (most of) the PCB. I can watch the reported power consumption move based on fan speeds and memory speeds. I've come to a point where I have managed to tweak out a combination of custom fan profile, slight memory underclock, and GPU undervolt + overclock along with a custom power limit to optimize maximum PPD for minimal power draw for each card in my house.

The reason I'm asking is: the GPU consumption, by itself and without considering the rest of the PCB, seems an arbitrary and not necessarily useful distinction. The GPU isn't capable of generating any useful work without the rest of the PCB it's connected to... Is this simply a limitation of how certain vendors report (or perhaps, do not report) power consumption at a full PCB level?
 
Is your tool attempting to only report the power of the discrete GPU silicon, rather than the power consumption of the board?

I just checked the code - turns out I was lying! Turns out that we actually do exactly do what you're saying (for all the reasons you state as well). On AMD we're using https://gpuopen.com/manuals/adlx/adlx-_d_o_x__i_a_d_l_x_g_p_u_metrics__g_p_u_total_board_power/ on Nvidia we're using https://docs.nvidia.com/deploy/nvml...iceQueries_1g7ef7dff0ff14238d08a19ad7fb23fc87. For Intel we were doing it wrong, however, so for that I've now created & submitted a fix to our measurement library.

Thanks for making me take a look!
 
What’s up with the fan speed reporting over 100%?

The Nvidia API reports it as over 100% and the API we're calling is documented to be able to do that: "The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. This value may exceed 100% in certain cases".

We debated clamping it to not be misleading, however, we felt it was better not to tamper with the API results and just pass them along as they come in. Having said that, we're in the process of removing the percentage in preference to just outputting RPM directly since it's more sensible.
 
The Nvidia API reports it as over 100% and the API we're calling is documented to be able to do that: "The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. This value may exceed 100% in certain cases".

We debated clamping it to not be misleading, however, we felt it was better not to tamper with the API results and just pass them along as they come in. Having said that, we're in the process of removing the percentage in preference to just outputting RPM directly since it's more sensible.
Having both is useful. Maybe you could show the RPM and then the percent next to it like (%) for context.
 
Is there a particular use case you have in mind for this?
If API reported % really is based some threshold determined by the manufacturer (as your quote suggests), this is useful information to me. For instance I'd like to know if my fan is running at or above what NVIDIA considers to be 100%.

Keep in mind I don't know who your target market is, so I can only speak for myself. People who use things like this all the time might only care about the RPM. So adding the RPM is a good move IMO.
 
Interesting benchmark and it's really nice/great/cool to have Mr. Bekkers explaining everything and answering questions, serious thanks and big love for your time and help!

Just one thing, no free demo version or nothing for us cheap enthusiasts who have spouses/sig others who really don't like them "wasting money on computers" but really want to try it?
 
Back
Top