Hardware
Choosing the Right GPU for Local AI Inference
Compare VRAM, memory bandwidth, thermals, and power before buying a GPU for private LLMs.

Choosing the Right GPU for Local AI Inference
The right GPU is the one that fits your models, case, power budget, and noise tolerance. For local AI, VRAM is usually the first limit you hit, but memory bandwidth and cooling can matter just as much once you start loading larger models or serving multiple requests.
Start with the workload
If you mainly want chat and summarisation, a mid-range card can be enough. If you want larger context windows, image models, or more than one user at a time, you will feel the ceiling sooner.
Read Best Hardware for Self-Hosted AI for the wider system picture before you lock in a card.
VRAM comes first
More VRAM means fewer compromises. It lets you run larger models, hold more context, and avoid constant swapping between GPU and system memory. Capacity often matters more than chasing the newest architecture.
Bandwidth still affects responsiveness
Two cards with the same VRAM can behave very differently if one has much higher memory bandwidth. When you move past small models, bandwidth can be the difference between smooth token generation and frustrating lag.
Power and thermals are buying criteria
Big GPUs need more than a strong PSU. They need airflow, physical clearance, and a case that does not recirculate hot air. If the machine lives near people, efficiency is a feature, not a luxury.
Pair your build with Proxmox Setup for AI Workloads if you want to isolate the GPU inside a VM or service layer.
Buying used versus new
Used cards can be excellent value, especially when you prioritise VRAM over prestige. Just budget for fan wear, warranty risk, and higher idle draw on older generations.
Conclusion
Choose a GPU around the model size and usage pattern you actually expect. A sensible card with enough VRAM and decent cooling will outperform an overambitious buy that is awkward to power, cool, or fit into your case.
