Performance Benchmark

From Hours to Seconds:
How a $250 Jetson GPU
Beat a MacBook Pro by 50,000×

A real-world Monte Carlo benchmark pitting NVIDIA's Jetson Orin Nano against a multi-threaded Java simulation on Apple Silicon

The Setup

Poker is a game of incomplete information and probability. To understand the true odds of any given starting hand, you have to simulate millions — ideally hundreds of millions — of random games and count the wins. This is Monte Carlo simulation: raw statistical brute force.

I built exactly this: a Texas Hold'em Monte Carlo engine that shuffles a deck, deals cards to 8 players, evaluates all 7-card hands, determines a winner, and tallies the result into a 52×52 statistics matrix. The matrix ultimately tells you the win probability for every possible starting-hand combination.

I implemented it twice:

Same algorithm. Same math. Same statistical goal. Very different hardware. The results were staggering.


The Numbers

Version 1
MacBook Pro — Java, 11 threads
Actual terminal output
All threads finished in 13,552,158 ms Hand appearances : 7,999,992 Best hand [0][13]: 46.62% win rate
Wall-clock time
3.76 hours
Throughput
73.8 hands / second
Hands simulated
1,000,000
Version 2
Jetson Orin Nano — CUDA
Actual terminal output
GPU time : 267,175 ms Throughput : 3.68M hands/sec TOTAL hands: 983,040,000 Best slot : 41.77% win rate
Wall-clock time
4.5 min
Throughput
3,679,381 hands / second
Hands simulated
983,040,000

The Comparison

Metric MacBook Pro (Java) Jetson Orin Nano (CUDA)
Hands simulated 1,000,000 983,040,000
Wall-clock time 3.76 hours 4.5 minutes
Throughput 73.8 hands / sec 3,679,381 hands / sec
Speed advantage ~50,000×
983×
more hands completed by the GPU
51×
less wall-clock time
154
days Java would need for the same workload
272 ms
GPU time to match Java's entire run
The GPU completed 983× more work in 51× less time. If the Java version had attempted the same 983 million hands, it would have taken 154 days of continuous computation. The GPU finished in 4.5 minutes.

Why Is the Gap So Large?

1Parallelism: 11 threads vs 65,536 threads

The MacBook Pro ran 11 Java threads in parallel — one per logical CPU core. That is genuinely good use of a modern laptop.

The Jetson Orin Nano launched 65,536 CUDA threads simultaneously — 256 threads per block across 256 blocks — all executing the same poker simulation code at the same time across 1,024 physical CUDA cores. The GPU is a machine purpose-built for this kind of embarrassingly parallel work, where every iteration is completely independent of every other.

2Thread-level independence on the GPU

One unexpected bottleneck in the Java version is the card deck itself. Java's Card class uses a single shared static deck. When multiple threads need to shuffle, they must take turns — only one thread can hold the Card.class lock at a time. Even after optimising the lock to be as narrow as possible, 11 threads serialise their way through every shuffle.

On the GPU, each of the 65,536 threads has its own completely private cuRAND random number generator state and its own private deck array in local memory. There is no locking, no waiting, no contention whatsoever during the simulation loop. Every thread shuffles and deals at full GPU clock speed, simultaneously, forever.

3Integer arithmetic vs string operations

The Java implementation represents cards as strings — "ClubA", "DiamondK" — and must parse them back to values on every hand check. String operations involve memory allocation, bounds checking, character scanning, and garbage collection pressure.

The CUDA version encodes every card as a single integer (0–51), computes suit as card / 13 and value as card % 13 + 1 with a single division. The entire 7-card hand fits in a handful of registers. There is no heap allocation, no parsing, no GC — just arithmetic.

4The Jetson's role: a purpose-built edge AI device

The Jetson Orin Nano is not a gaming GPU. It is a $249 embedded AI accelerator designed to run neural network inference and parallel workloads at the edge — on robots, cameras, and autonomous systems — without a data center behind it. It consumes roughly 7–15 watts under load.

The MacBook Pro it was compared against costs ten times as much and draws considerably more power during a sustained compute workload. And yet the Jetson won, decisively, for this class of problem. This is what edge computing is becoming: not a compromise, but a superpower for the right workload.


Statistical Consistency: The Results Agree

Despite the 983× difference in sample size, both versions arrived at the same statistical conclusion. Both identified pocket aces (the ace of one suit paired with the ace of another) as the highest win-rate starting hand:

Version Samples Best hand Win rate
Java (MacBook Pro) 1M hands ClubA + DiamondA (pocket aces) 46.62%
GPU (Jetson) 983M hands SpadeA + DiamondA (pocket aces) 41.77%
The GPU's lower win percentage for the same hand type reflects better statistical accuracy. With nearly a billion samples, the Monte Carlo estimate converges toward the true mathematical probability, smoothing out the lucky-run variance that inflates the Java result at just 1 million samples. This is the law of large numbers at work.

What This Means for Your Projects

Monte Carlo simulation is not just a poker trick. The same embarrassingly parallel pattern appears across science and industry:

Any domain where you need to run the same independent computation millions of times is a candidate for this kind of GPU acceleration. And increasingly, that GPU does not need to live in a rack in a data center — it can live at the edge, embedded in the device doing the work, consuming single-digit watts.


~50,000×

A $249 Jetson Orin Nano running a CUDA kernel outran an 11-thread Java program on a premium laptop by a factor of roughly 50,000 on a real Monte Carlo workload.

That number deserves to sit with you for a moment. It is not a benchmark artifact or a cherry-picked microbenchmark. It is the direct result of architectural choices: massive parallelism (65,536 threads vs 11), zero inter-thread contention, pure integer arithmetic, and hardware built from the ground up to do many simple things at the same time.

If your application involves any kind of parallel simulation, search, or sampling loop — and if speed or energy efficiency matters — the question is no longer whether GPU acceleration helps. The question is why you haven't already reached for it.