Berlin Buzzwords 2025

Which GPU for Local LLMs?
2025-06-16 , Kesselhaus

You’re using local LLMs. For example, to power RAG. You want to deploy them in production, but you don’t know where: which type of GPU? How large should it be? Should you use a larger model but quantize more aggressively?

Our benchmark results and their interpretation will give you some answers.


It’s easy to offload the LLM - in solutions such as RAG - to external services like OpenAI. This is great for PoCs, but if you have a lot of requests, a local LLM makes more sense from both a cost and a latency point of view. Especially in the context of RAG, where the query itself adds latency and the context to be shifted can be significant.

For this session, we’ll use llama.cpp - which supports inference on many models for many platforms - and benchmark some LLMs on various GPUs. We’ll focus on cost, throughput (tokens/s), and memory usage when presenting results. Memory usage is the same for the same model, but we’ll explore quantization and how it influences throughput, especially since we can fit a larger context. A larger context means we can process more queries in parallel.

Participants will get a better sense of how to deploy their RAG/LLM in production from a hardware, model, and quantization perspective.


Tags:

Data Science, Scale, Operations

Level:

Intermediate

Radu has been in the search space for many years, mainly on Elasticsearch, Solr, OpenSearch, and, more recently, Vespa.ai. Helps users with both the relevance and the operations side of retrieval. Enjoys education in all its forms (training, blog posts, books, conferences...) and got the chance to be involved in all of them.

Software engineer, trainer, consultant and author from time to time - some would say that he is an all in one battle weapon concentrated on information retrieval, performance and user search experience. However he also likes all the other cool stuff that is happening in the IT world. Likes to share his knowledge by giving talks at various meet ups and conferences.