Künstliche Intelligenz (KI) News Tech-Guide & Tipps

LM Studio: Use LLM locally with offloading to NVIDIA and AMD graphics cards

10.06.2026, 20:08 , von Andreas Bunen

In the rapidly evolving world of AI, large language models (LLMs) have become indispensable tools for tasks such as document creation, conversational AI and customer support. Now that many of these models can run on NVIDIA GeForce RTX and AMD Radeon RX GPUs, users no longer need to rely solely on cloud services or data centers for AI-powered tasks.

LM Studio Windows Image © PCMasters LM Studio Windows (Image © PCMasters)

However, larger models often exceed the memory capacity of consumer GPUs, so innovative solutions are needed to bring these powerful tools to everyday systems.

Local AI with GPU offloading

GPU offloading is a technique that allows users with less powerful GPUs to utilize the power of LLMs by splitting the processing between the CPU and GPU. This method ensures that users can benefit from GPU acceleration even when an LLM exceeds the GPU's available video memory.

LM Studio takes advantage of this approach and provides an intuitive interface for downloading and running LLMs locally. LM Studio is based on llama.cpp and allows users to set exactly how much of a model should be transferred to the GPU so that they can precisely control performance. The user interface is very simple and intuitive.

LM Studio Modelle suchen und installieren (Image © PCMasters)

More performance without maximum VRAM

A large model such as Gemma-2-27B with 27 billion parameters, for example, requires 19 GB of VRAM for full GPU acceleration. With GPU offloading, users with less powerful systems can still achieve a significant performance boost by running part of the model on the GPU and the rest on the CPU.

Tests have shown that offloading can dramatically improve throughput, from just a few tokens per second on the CPU alone to much higher speeds when a larger portion of the model is processed on the GPU. On our test system with Rtyzen 9 9750X CPU, 32 GB RAM and the GeForce RTX 4090, we were able to run LLAMA 3.2 3B and achieve around 77 tokens per second, which is already very fast and more than sufficient for everyday use.

Optimizing the LLM performance

LM Studio's GPU offloading slider allows you to find the right balance between performance and memory usage depending on your system configuration. Whether you're using a high-end GeFroce RTX 4090 or a more modest GPU, LM Studio ensures that you can realize the full potential of LLMs on your local systems.

In the settings of LM Studio there are CPU llama.cpp for CPUs, CUDA llama.cpp for NVIDIA graphics cards and Vulkan llama.cpp for AMD graphics cards. Other models can also be searched and installed in the settings. It is an optimal front-end for experienced and not so experienced users.

Andreas Bunen

The IT world never stands still, so there's a lot to learn and understand every day. My personal areas of interest include technology, cybersecurity, photography and science....

58 articles Email Twitter Google+

LM Studio: Use LLM locally with offloading to NVIDIA and AMD graphics cards

Local AI with GPU offloading

More performance without maximum VRAM

Optimizing the LLM performance

Support PCMasters