The NVIDIA GPU release cycle is about two years: the RTX 2080 Ti was announced back in August 2018, with 4,352 CUDA cores, 1.2 times more cores than the previous generation GTX 1080 Ti, and it has exactly twice the performance for password recovery.
When NVIDIA announced the new 3000 series of GPUs in September, our minds were blown away: the RTX 3080 had 8,704 CUDA cores, a 2x increase compared to the RTX 2080 Ti, at a lower price point.
We were anxious to run Passware Kit Forensic benchmarks to see if we could see a 2x gain in performance compared to the 2000 series. However, the benchmark results showed a mere 6% increase in performance compared to 2080Ti, and we initially thought that there must be a software glitch.
Unfortunately, a software issue was not the cause.
Why is the performance increase marginal if the new GPU has double the number of CUDA cores?
Let’s take a deep dive into the NVIDIA GPU Architecture. NVIDIA GPU consists of several GPCs (Graphics Processing Clusters), each with multiple SMs (Streaming Multiprocessors).
Let’s take the RTX 3080 as an example. According to NVIDIA specs, this GPU has 68 SMs, that’s the same number of SMs as the 2080 Ti. So why has the number of CUDA cores in the spec sheet doubled?
The reason is that the number of FP32 ALUs (arithmetic-logic units) was doubled for each SM (Streaming Multiprocessor). However, password recovery applications use INT32 units exclusively, and the number of INT32 ALUs in the RTX 3080 has remained unchanged, compared to the RTX 2080Ti.
Streaming Multiprocessors: RTX 2080 Ti and RTX 3080
Here is a technical explanation from Tony Tamasi of NVIDIA, published on reddit:
“One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.”
As the number of transistors in the GPU has increased from 18.6 billion in the 2080Ti to 28.3 billion in the 3080, so has the power consumption. Our measurements show that under full load, the 3080 consumes 25% more power, and that would be over 1 kW for a single 12-GPU high-density Decryptum system.
All in all, the RTX 3080 is a solid performer, but there is no significant gain in performance for password recovery. Moreover, the additional power and cooling requirements make this GPU not quite suitable for high-density high-performance password solutions.
We’re waiting for the new AMD GPUs to become available to see if they might be winning the performance race now.