Ollama Local LLMs Cheat Sheet

Ollama local LLMs cheat sheet - Run Llama, Mistral, and other models locally. Installation, model management, API usage, and optimization guide. Updated December 2025.

Last Updated: December 24, 2025

Installation

# macOS
curl -fsSL https://ollama.com/install.sh | sh

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from ollama.com

# Verify installation
ollama --version

# Start Ollama server
ollama serve

Basic Commands

ollama run llama3.3
Download and run Llama 3.3 interactively
ollama pull mistral
Download Mistral model
ollama list
List installed models
ollama rm llama3.3
Remove a model
ollama ps
Show running models
ollama cp source dest
Copy a model

Popular Models

Model Size Description
llama3.3 70B / 8B Latest Llama 3.3 (best quality)
mistral 7B Fast, efficient general model
mixtral 8x7B Mixture of experts, very capable
codellama 7B / 13B / 34B Specialized for code
phi3 3.8B Small, fast Microsoft model
gemma2 9B / 27B Google's open model
qwen2.5 0.5B - 72B Alibaba's latest model
deepseek-coder-v2 16B / 236B Excellent for coding

Model Variants

# Pull specific quantization
ollama pull llama3.3:8b-q4_0          # 4-bit quantization
ollama pull llama3.3:8b-q8_0          # 8-bit quantization
ollama pull llama3.3:70b-q4_K_M       # Medium quality 4-bit

# Quantization levels (smallest to largest):
# q2_K - Very small, lowest quality
# q3_K_S, q3_K_M, q3_K_L - Small, low quality
# q4_0, q4_1, q4_K_S, q4_K_M - Medium size/quality (recommended)
# q5_0, q5_1, q5_K_S, q5_K_M - Larger, better quality
# q6_K - Very large, high quality
# q8_0 - Largest, highest quality
# f16 - Full precision (huge)

Interactive Mode

# Start chat
ollama run llama3.3

# Special commands in chat:
/bye           # Exit
/clear         # Clear conversation
/set parameter # Set parameters
/show info     # Show model info
/? or /help    # Show help

# Examples:
>>> /set temperature 0.7
>>> /set top_p 0.9
>>> /show info
>>> /clear

API Usage

# Python
pip install ollama

import ollama

# Simple generation
response = ollama.chat(model='llama3.3', messages=[
    {
        'role': 'user',
        'content': 'Why is the sky blue?',
    },
])
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.3',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True,
):
    print(chunk['message']['content'], end='', flush=True)

# With options
response = ollama.chat(
    model='llama3.3',
    messages=[{'role': 'user', 'content': 'Hello'}],
    options={
        'temperature': 0.7,
        'top_p': 0.9,
        'top_k': 40,
    }
)

REST API

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

# List models
curl http://localhost:11434/api/tags

# Show model info
curl http://localhost:11434/api/show -d '{
  "name": "llama3.3"
}'

# Pull model
curl http://localhost:11434/api/pull -d '{
  "name": "mistral"
}'

Model Options

Parameter Range Description
temperature 0.0 - 2.0 Randomness (default 0.8)
top_p 0.0 - 1.0 Nucleus sampling (default 0.9)
top_k 1 - 100 Top-k sampling (default 40)
num_ctx 128 - 131072 Context window size
repeat_penalty 0.0 - 2.0 Penalize repetition (default 1.1)
seed Any integer Random seed for reproducibility

Custom Modelfile

# Create Modelfile
FROM llama3.3

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40

# Set system prompt
SYSTEM """
You are a helpful coding assistant specializing in Python.
Provide clear, well-commented code examples.
"""

# Build model
ollama create my-python-assistant -f ./Modelfile

# Use custom model
ollama run my-python-assistant

Embeddings

# Python
import ollama

# Generate embeddings
embedding = ollama.embeddings(
    model='llama3.3',
    prompt='The sky is blue'
)
print(embedding['embedding'])

# REST API
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3.3",
  "prompt": "The sky is blue"
}'

# Use nomic-embed-text for embeddings (recommended)
ollama pull nomic-embed-text
embedding = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Your text here'
)

LangChain Integration

from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = Ollama(
    model="llama3.3",
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)

response = llm.invoke("Tell me a joke")

# With chat models
from langchain_community.chat_models import ChatOllama

chat = ChatOllama(model="llama3.3")
messages = [
    ("system", "You are a helpful assistant"),
    ("human", "What is 2+2?")
]
response = chat.invoke(messages)

Performance Optimization

Use smaller quantizations (q4_0, q4_K_M)
Faster inference, less memory
Reduce num_ctx for shorter contexts
Lower context = faster generation
Use GPU acceleration
CUDA/Metal automatically detected
Keep models loaded in memory
Set keep_alive parameter
Choose smaller models (7B vs 70B)
Balance speed vs quality

Environment Variables

# Set model storage location
export OLLAMA_MODELS=/path/to/models

# Set host and port
export OLLAMA_HOST=0.0.0.0:11434

# GPU settings
export OLLAMA_NUM_GPU=1        # Number of GPUs to use
export OLLAMA_GPU_LAYERS=35    # Number of layers to offload

# Keep models loaded
export OLLAMA_KEEP_ALIVE=5m    # Keep in memory for 5 minutes

Troubleshooting

Issue Solution
Out of memory Use smaller model or quantization (q4_0)
Slow generation Check GPU usage, reduce context size
Model won't load Check disk space, verify model name
API connection refused Ensure ollama serve is running
GPU not detected Update drivers, check CUDA/Metal installation
💡 Pro Tip: Start with mistral:7b-q4_0 for a fast, capable model that runs on most hardware. For coding, use deepseek-coder-v2:16b. For production, use llama3.3:8b-q4_K_M for the best balance of speed and quality. Always test different quantizations to find your sweet spot between speed and quality.
← Back to Data Science & ML | Browse all categories | View all cheat sheets