Ollama Local LLMs Cheat Sheet

Ollama local LLMs cheat sheet - Run Llama, Mistral, and other models locally. Installation, model management, API usage, and optimization guide. Updated December 2025.

Last Updated: December 24, 2025

Installation

# macOS
curl -fsSL https://ollama.com/install.sh | sh

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download installer from ollama.com

# Verify installation
ollama --version

# Start Ollama server
ollama serve

Basic Commands

ollama run llama3.3

Download and run Llama 3.3 interactively

ollama pull mistral

Download Mistral model

ollama list

List installed models

ollama rm llama3.3

Remove a model

ollama ps

Show running models

ollama cp source dest

Copy a model

Popular Models

Model	Size	Description
`llama3.3`	70B / 8B	Latest Llama 3.3 (best quality)
`mistral`	7B	Fast, efficient general model
`mixtral`	8x7B	Mixture of experts, very capable
`codellama`	7B / 13B / 34B	Specialized for code
`phi3`	3.8B	Small, fast Microsoft model
`gemma2`	9B / 27B	Google's open model
`qwen2.5`	0.5B - 72B	Alibaba's latest model
`deepseek-coder-v2`	16B / 236B	Excellent for coding

Model Variants

# Pull specific quantization
ollama pull llama3.3:8b-q4_0          # 4-bit quantization
ollama pull llama3.3:8b-q8_0          # 8-bit quantization
ollama pull llama3.3:70b-q4_K_M       # Medium quality 4-bit

# Quantization levels (smallest to largest):
# q2_K - Very small, lowest quality
# q3_K_S, q3_K_M, q3_K_L - Small, low quality
# q4_0, q4_1, q4_K_S, q4_K_M - Medium size/quality (recommended)
# q5_0, q5_1, q5_K_S, q5_K_M - Larger, better quality
# q6_K - Very large, high quality
# q8_0 - Largest, highest quality
# f16 - Full precision (huge)

Interactive Mode

# Start chat
ollama run llama3.3

# Special commands in chat:
/bye           # Exit
/clear         # Clear conversation
/set parameter # Set parameters
/show info     # Show model info
/? or /help    # Show help

# Examples:
>>> /set temperature 0.7
>>> /set top_p 0.9
>>> /show info
>>> /clear

API Usage

# Python
pip install ollama

import ollama

# Simple generation
response = ollama.chat(model='llama3.3', messages=[
    {
        'role': 'user',
        'content': 'Why is the sky blue?',
    },
])
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.3',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True,
):
    print(chunk['message']['content'], end='', flush=True)

# With options
response = ollama.chat(
    model='llama3.3',
    messages=[{'role': 'user', 'content': 'Hello'}],
    options={
        'temperature': 0.7,
        'top_p': 0.9,
        'top_k': 40,
    }
)

REST API

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.3",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

# List models
curl http://localhost:11434/api/tags

# Show model info
curl http://localhost:11434/api/show -d '{
  "name": "llama3.3"
}'

# Pull model
curl http://localhost:11434/api/pull -d '{
  "name": "mistral"
}'

Model Options

Parameter	Range	Description
`temperature`	0.0 - 2.0	Randomness (default 0.8)
`top_p`	0.0 - 1.0	Nucleus sampling (default 0.9)
`top_k`	1 - 100	Top-k sampling (default 40)
`num_ctx`	128 - 131072	Context window size
`repeat_penalty`	0.0 - 2.0	Penalize repetition (default 1.1)
`seed`	Any integer	Random seed for reproducibility

Custom Modelfile

# Create Modelfile
FROM llama3.3

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40

# Set system prompt
SYSTEM """
You are a helpful coding assistant specializing in Python.
Provide clear, well-commented code examples.
"""

# Build model
ollama create my-python-assistant -f ./Modelfile

# Use custom model
ollama run my-python-assistant

Embeddings

# Python
import ollama

# Generate embeddings
embedding = ollama.embeddings(
    model='llama3.3',
    prompt='The sky is blue'
)
print(embedding['embedding'])

# REST API
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3.3",
  "prompt": "The sky is blue"
}'

# Use nomic-embed-text for embeddings (recommended)
ollama pull nomic-embed-text
embedding = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Your text here'
)

LangChain Integration

from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = Ollama(
    model="llama3.3",
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)

response = llm.invoke("Tell me a joke")

# With chat models
from langchain_community.chat_models import ChatOllama

chat = ChatOllama(model="llama3.3")
messages = [
    ("system", "You are a helpful assistant"),
    ("human", "What is 2+2?")
]
response = chat.invoke(messages)

Performance Optimization

Use smaller quantizations (q4_0, q4_K_M)

Faster inference, less memory

Reduce num_ctx for shorter contexts

Lower context = faster generation

Use GPU acceleration

CUDA/Metal automatically detected

Keep models loaded in memory

Set keep_alive parameter

Choose smaller models (7B vs 70B)

Balance speed vs quality

Environment Variables

# Set model storage location
export OLLAMA_MODELS=/path/to/models

# Set host and port
export OLLAMA_HOST=0.0.0.0:11434

# GPU settings
export OLLAMA_NUM_GPU=1        # Number of GPUs to use
export OLLAMA_GPU_LAYERS=35    # Number of layers to offload

# Keep models loaded
export OLLAMA_KEEP_ALIVE=5m    # Keep in memory for 5 minutes

Troubleshooting

Issue	Solution
Out of memory	Use smaller model or quantization (q4_0)
Slow generation	Check GPU usage, reduce context size
Model won't load	Check disk space, verify model name
API connection refused	Ensure ollama serve is running
GPU not detected	Update drivers, check CUDA/Metal installation

💡 Pro Tip: Start with mistral:7b-q4_0 for a fast, capable model that runs on most hardware. For coding, use deepseek-coder-v2:16b. For production, use llama3.3:8b-q4_K_M for the best balance of speed and quality. Always test different quantizations to find your sweet spot between speed and quality.

← Back to Data Science & ML | Browse all categories | View all cheat sheets