BD Brian Detering Professor of Programming – University of Southern California
AI Tools

Local LLMs for Developers: Ollama vs LM Studio vs Jan

Brian Detering
Brian Detering Tech Writer & Developer

Running language models locally means no API costs, no data leaving your machine, and no rate limits. For developers working with proprietary code, sensitive data, or in air-gapped environments, local LLMs are increasingly practical.

I tested three tools for running LLMs locally — Ollama, LM Studio, and Jan — to see which ones are ready for real developer workflows in 2026.

Why Run LLMs Locally

Cloud AI services like GitHub Copilot and Cursor send your code to external servers. For most developers, this is fine. But if you work with proprietary algorithms, financial data, medical records, or classified information, local inference is the only option that satisfies data governance requirements.

Local LLMs also eliminate recurring costs. After the initial hardware investment, inference is free — useful for high-volume tasks like batch code analysis, documentation generation, and automated testing.

Ollama

Ollama is the Docker of local LLMs. Pull a model (ollama pull llama3.1), run it (ollama run llama3.1), and you have a local inference server with an OpenAI-compatible API. The simplicity is remarkable — from installation to running your first query takes under five minutes.

The model library covers the major open models: Llama 3.1, Mistral, Code Llama, Phi-3, Gemma, DeepSeek Coder, and dozens more. Model files (Modelfiles) let you customize system prompts, parameters, and quantization. Creating a specialized coding assistant with a specific system prompt is a one-file configuration.

The API compatibility is Ollama’s practical advantage. Any tool that supports the OpenAI API can point at your Ollama server instead. This includes aider (terminal AI coding tool), Continue (VS Code extension), and custom scripts. Switch between cloud and local inference by changing a URL.

Performance depends on hardware. On an M2 MacBook Pro with 16GB RAM, a 7B parameter model runs at 30-40 tokens/second — fast enough for interactive use. A 70B model requires 64GB+ RAM and is slower but more capable. GPU acceleration (NVIDIA CUDA, Apple Metal) makes larger models practical.

Best for

Developers who want a simple, CLI-first way to run local models with API compatibility. The best option for integrating local LLMs into existing tools and scripts. Essential for teams implementing Zero Trust security where code cannot leave the network.

LM Studio

LM Studio provides a desktop GUI for discovering, downloading, and running local LLMs. The model browser shows available models from Hugging Face with size, quantization options, and hardware requirements. Download a model, load it, and start chatting — no command line required.

The chat interface is polished and supports multiple conversations, system prompts, and parameter tuning (temperature, top-p, max tokens). For experimenting with different models and prompts, the GUI is faster than editing CLI configs.

LM Studio also runs a local server with OpenAI-compatible API, similar to Ollama. The difference is that LM Studio manages model files through the GUI rather than CLI commands. For developers who prefer visual tools, this is more approachable.

Model performance visualization shows token generation speed, memory usage, and GPU utilization in real time. This is useful for choosing the right quantization level — you can see exactly how a Q4 vs Q8 quantization affects speed and memory on your hardware.

Best for

Developers who prefer a GUI for model management and experimentation. Good for trying different models before committing to one for production use. Useful for teams where not everyone is comfortable with CLI tools.

Jan

Jan is an open-source desktop application that positions itself as a local alternative to ChatGPT. The interface is clean and conversation-focused, with model management, chat history, and extension support built in.

The extension system is Jan’s differentiator. Extensions add functionality like RAG (Retrieval-Augmented Generation) over local documents, tool use, and custom integrations. The Codebase extension indexes your local repository and answers questions about your code using the local model — no data leaves your machine.

Jan supports both local models (through llama.cpp) and remote APIs (OpenAI, Anthropic) in the same interface. You can start a conversation with a local model and switch to a cloud model for a harder question, maintaining conversation context.

The assistants feature lets you create specialized AI personas with custom system prompts, model selections, and document contexts. A “Code Reviewer” assistant with a code review system prompt and your style guide as context is practical for local code review without sending code to external services.

Best for

Developers who want a ChatGPT-like experience running entirely locally. Teams that need RAG over private documents. Anyone who wants to mix local and cloud models in a single interface.

Hardware Considerations

For 7B models (good for code completion, simple explanations): 8GB RAM minimum, 16GB recommended. Apple Silicon Macs perform well here.

For 13-34B models (better reasoning, longer context): 16-32GB RAM, GPU acceleration strongly recommended.

For 70B+ models (approaching cloud model quality): 64GB+ RAM or a dedicated GPU with 24GB+ VRAM (RTX 4090, A6000).

Quantization (Q4, Q5, Q8) trades quality for speed and memory. Q4 runs twice as fast as Q8 with slightly lower quality. For most coding tasks, Q5 is the sweet spot.

Verdict

Ollama is the best for developers who want CLI-first simplicity and API compatibility. If you plan to integrate local LLMs into scripts, tools, or pipelines, start here.

LM Studio is the best for model experimentation through a visual interface. The performance monitoring helps you find the right model and quantization for your hardware.

Jan is the best for a complete local AI assistant experience with RAG and extensions. If you want a ChatGPT replacement that runs on your machine, Jan is the most polished option.

Start with Ollama — it takes five minutes to install and run your first model. If you need a GUI or advanced features, try LM Studio or Jan alongside it.

Brian Detering

About Brian Detering

Brian Detering is a software engineer, educator, and tech writer based in Los Angeles. He teaches programming and software engineering at the University of Southern California, where his work spans programming languages, systems architecture, and applied AI. With over a decade of hands-on experience building production systems, Brian writes about the tools and workflows that actually make developers more productive — from CI/CD pipelines and containerization to API testing and security best practices. When he's not teaching or writing code, he's usually benchmarking the latest dev tools or tinkering with homelab infrastructure.

Related Articles