Value of Hardware Unboxed benchmarking

Probably, but not any of the big players in PC space. One is close but not quite the same.
Point being every manufacturer has different name for them, if you want to use a common name it should be what they are, matrix units or matrix multiplication units, not some made up crap like "AI cores"
"Matrix units" is about as "made up crap" as "AI cores" which is my point.
 
Probably, but not any of the big players in PC space. One is close but not quite the same.
Point being every manufacturer has different name for them, if you want to use a common name it should be what they are, matrix units or matrix multiplication units, not some made up crap like "AI cores"

How is it any different to the regular old ALUs which are essentially the same but have had different marketing names forever? CUDA cores, Vector engines, SIMD lanes etc.
 
Which is what exactly? You can do matrix multiplications on a CPU.
Yes, you can, just like with many other processing units including the usual shader units. These however are purpose built to do just those and do them fast.
How is it any different to the regular old ALUs which are essentially the same but have had different marketing names forever? CUDA cores, Vector engines, SIMD lanes etc.
And often referred with common term shader units, units built to run shader programs.
 
Yes, you can, just like with many other processing units including the usual shader units. These however are purpose built to do just those and do them fast.
RDNA3 "usual shader units" are "purpose built" to do matrix multiplication "fast" too. There isn't really anything which sets these units aside from your usual ALUs aside from the fact that they can do some specific math ops faster. Nvidia has been running FP16 shader math on tensor cores since Turing IIRC.
 
And often referred with common term shader units, units built to run shader programs.
Shaders can run on CPUs too, what they truly are mixed precision FP units (Floating Point), with precisions of FP64/32/16, and INT8.

Tensor cores are mixed precision FMA Systolic Arrays with precision of FP64/32/16/8/4, and INT8/4/1, sometimes they are referred to as GEMM (General Matrix Multiply Accumulate) accelerators.
 
Well, to accurately define what each unit is, we need to peel away decades of marketing crap.

A core is nothing than more than a load/store unit (LDU) and arithmetic logic unit (ALU) consisting of an adder and a multiplier, that ALU is of integer precision and it's also called an execution unit, this is the simplest form of a core. That form of CPU core should be capable of doing one instruction per clock, however it's usually less than that, since it stalls often, and outputs less than one instruction per clock, anywhere from 0.1 to 0.9 instructions per clock (which is why it's sub scalar), we can also call it barely scalar at it's best case scenario.

It later evolved to have an additional address generation unit (AGU) to handle complex memory configurations, and a separate ALU unit for floating point operations, they called it FPU. Now that distinction here is very important, they could have called this a 2 core CPU as it had two execution units (one integer and one floating point), but they chose not, since the core still has one LDU and one AGU, it took data from a single thread, it also shared registers, caches and control logic between the units.

They evolved the core later with pipelining (dividing instructions into micro instructions), to achieve a guaranteed scalar operation (meaning the core can output a single instruction per clock) and a potential super scalar operation (more than a single instruction per clock).

Later they multiplied the execution units (eg: 4 ALUs, 2 FPUs, 4 LDUs, 2 AGUs) inside the core and maximized the super scalar operation to always always be more than 1, they refused to call this kind of core multi core since it still shared caches, registers and control between units, it also still only handled instructions from one thread, and it's output hovered around 1 instruction per clock (meaning 1.2 or 1.5 ...etc).

During all of this, a new form of computing was invented, called packed math, where the core loads different data that are the same type and fuses them into a single giant instruction, executing this large instruction using it's multiple execution units, it is effectively multiple data in a single instruction, or Single Instruction Multiple Data (SIMD). For example, instead of loading each pixel of an image serially, 4 pixels will be loaded as one giant instruction and executed as one, speeding up the execution considerably. However, SIMD is still not multi core because it operated within the confines of super scalar cores, and since it's a memory trick basically.

Later they dedicated different execution units (ALUs and FPUs) for SIMD, but they still did not call this multi core, because it still operated within the confines of a single thread and shared resources and caches between SIMD execution units and regular execution units. Examples of SIMD inside the CPU core include MMX, SSE, and AVX.

Later they added the ability to load two or more threads on the core, if one threads stalls another thread will take it's place, this is the Simultaneous Multi Threading (SMT) or Hyper Threading (HT) approach, it also maximizes the super scalar operation of the core, they still refused to call it multi core because it's conditional (relies on stalls), still shared the caches and it's final output also hovered around 1 instruction per clock (1.7, 1.9 .. etc).

So all the pipelining, multiple executions, SIMD, SMT are all to boost the instruction per clock of a single core, it now stands at 2 to 4 instructions per clocks per core in modern CPUs. This is how strict the definition of a core in the CPU world. And even though, we've deviated much from the original core definition, we refused to call any of those deviations multi core because they took data from a single thread, and shared resources (caches, registers, memory bus, control logic) with other units.

In the harsh world of CPU cores definition, a true multi core is multiple super scalar cores each operating on a different thread, having a giant cache share between all of them.

AMD tried to change the definition of a CPU core with Bulldozer, by counting the number of integer ALU blocks as true cores, while sharing FPU blocks between "these cores", but they backtracked from this.

A GPU core is "supposedly" the same as a CPU core, the difference is that a GPU core is a single execution unit, specifically a fused multiply add unit (FMA), that handles mixed precision (Integer and Floating Point). A GPU has thousands of these units each capable of operating on data from different threads. Which is why GPU vendors claim each execution unit as a core based on the distinction of doing different threads.

However, I am not convinced of this, the GPU has many types of "cores": texture, raster, shader, tensor, ray tracing, ... etc. Should we consider them all equal cores and include all of them in the final count? Furthermore, most of these "cores" share resources with other cores, which include register files, caches, and control logic.

If we apply the same strict CPU definitions, then the SM or CU (a large constellation of FMA execution units), is the true definition of a GPU core, it has a fixed number of shader units, texture, raster, tensor, ray tracing ..etc, all sharing the same register files, caches, and control logic.

Intel gets this, they don't call their small FMA shader units cores at all, they call them execution units, and calls their larger grouping a Xe Core, while NVIDIA calls it Streaming Multiprocessor (SM), and AMD calls it Compute Unit (CU). Each SM/CU/Xe core has multiple execution units (FMA) units operating under the principle of SIMD (or SIMT), single instruction multiple data/threads. Each SM/CU/Xe core also employs different types of computation units, tensor units are just mixed precision FMA Systolic Arrays or (General Matrix Multiply Accumulate) accelerators, texture units are samplers with address generators, ray tracing units are MIMD units .. etc.

So a 4090 doesn't really have 16000 cores, it only has 128 SMs/cores. Arc 770 has 32 Xe cores, the 3060 has 28 SMs/cores .. etc. If you look at the times of old, the Geforce 256 (the first GPU), didn't have cores, only pixel "shaders", it's only after the advent of CUDA and combining of vertex and pixel shaders into one giant FMA unit did NVIDIA call these units "cores".
 
Last edited:
This seems even worse than low, mid and high end:

Entry-Level Tasks (Around 896 CUDA cores)
Gaming (3,584 – 4,864+ CUDA cores)
Professional Graphics and Visualization (9,728+ CUDA cores)
Machine Learning and AI (10,572 – 16,384+ CUDA cores)
Scientific Simulations (16,384+ CUDA cores)
 
That sounds even more complicated that the long used standard of low-, mid- and high-end.
Unfortunately the definition of low, mid, and high end is a moving target. I bought a GTX970 for like $350 (~$470 now) and it was the 2nd fastest GPU in the world at the time (980Ti/Titan X didn't exist yet). 470 bucks won't even get you a 4070 which is far from high end these days in terms of both price and performance. The 4070 sits way lower in the product stack than the 970 did. Simply using price (preferably adjusted for inflation) is an easy way to clear up any confusion.
 
Final Fantasy 7 Rebirth will be released in January. And it only works with GPUs supporting "DX12 ultimate".
0zgzx16v3m6e1.png


Another game playable on Turing and not on the 5700 series. Another win for HBU...
 
This one is a bit weird though as it's UE5 which doesn't explain why it would require DX12U features. It's possible that the requirement isn't actually firm.
 
This one is a bit weird though as it's UE5 which doesn't explain why it would require DX12U features. It's possible that the requirement isn't actually firm.
It's UE4 though, and came with bad indirect lighting on PS5, they supposedly enhanced the lighting for the PC version, which could be why they demand DX12U.

On the Steam page they say the game demands GPUs with Shader Model 6.6, could be another reason.
 
This one is a bit weird though as it's UE5 which doesn't explain why it would require DX12U features. It's possible that the requirement isn't actually firm.
It’s UE4. This is likely just a bad decision. Probably similar to when games arbitrarily required AVX on the CPU despite not using any of the instructions.
 
Back
Top