Last Updated: December 24, 2025
Installation
# macOS
curl -fsSL https://ollama.com/install.sh | sh
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download installer from ollama.com
# Verify installation
ollama --version
# Start Ollama server
ollama serve
Basic Commands
ollama run llama3.3
Download and run Llama 3.3 interactively
ollama pull mistral
Download Mistral model
ollama list
List installed models
ollama rm llama3.3
Remove a model
ollama ps
Show running models
ollama cp source dest
Copy a model
Popular Models
| Model | Size | Description |
|---|---|---|
llama3.3 |
70B / 8B | Latest Llama 3.3 (best quality) |
mistral |
7B | Fast, efficient general model |
mixtral |
8x7B | Mixture of experts, very capable |
codellama |
7B / 13B / 34B | Specialized for code |
phi3 |
3.8B | Small, fast Microsoft model |
gemma2 |
9B / 27B | Google's open model |
qwen2.5 |
0.5B - 72B | Alibaba's latest model |
deepseek-coder-v2 |
16B / 236B | Excellent for coding |
Model Variants
# Pull specific quantization
ollama pull llama3.3:8b-q4_0 # 4-bit quantization
ollama pull llama3.3:8b-q8_0 # 8-bit quantization
ollama pull llama3.3:70b-q4_K_M # Medium quality 4-bit
# Quantization levels (smallest to largest):
# q2_K - Very small, lowest quality
# q3_K_S, q3_K_M, q3_K_L - Small, low quality
# q4_0, q4_1, q4_K_S, q4_K_M - Medium size/quality (recommended)
# q5_0, q5_1, q5_K_S, q5_K_M - Larger, better quality
# q6_K - Very large, high quality
# q8_0 - Largest, highest quality
# f16 - Full precision (huge)
Interactive Mode
# Start chat
ollama run llama3.3
# Special commands in chat:
/bye # Exit
/clear # Clear conversation
/set parameter # Set parameters
/show info # Show model info
/? or /help # Show help
# Examples:
>>> /set temperature 0.7
>>> /set top_p 0.9
>>> /show info
>>> /clear
API Usage
# Python
pip install ollama
import ollama
# Simple generation
response = ollama.chat(model='llama3.3', messages=[
{
'role': 'user',
'content': 'Why is the sky blue?',
},
])
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.3',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True,
):
print(chunk['message']['content'], end='', flush=True)
# With options
response = ollama.chat(
model='llama3.3',
messages=[{'role': 'user', 'content': 'Hello'}],
options={
'temperature': 0.7,
'top_p': 0.9,
'top_k': 40,
}
)
REST API
# Generate completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
# List models
curl http://localhost:11434/api/tags
# Show model info
curl http://localhost:11434/api/show -d '{
"name": "llama3.3"
}'
# Pull model
curl http://localhost:11434/api/pull -d '{
"name": "mistral"
}'
Model Options
| Parameter | Range | Description |
|---|---|---|
temperature |
0.0 - 2.0 | Randomness (default 0.8) |
top_p |
0.0 - 1.0 | Nucleus sampling (default 0.9) |
top_k |
1 - 100 | Top-k sampling (default 40) |
num_ctx |
128 - 131072 | Context window size |
repeat_penalty |
0.0 - 2.0 | Penalize repetition (default 1.1) |
seed |
Any integer | Random seed for reproducibility |
Custom Modelfile
# Create Modelfile
FROM llama3.3
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
# Set system prompt
SYSTEM """
You are a helpful coding assistant specializing in Python.
Provide clear, well-commented code examples.
"""
# Build model
ollama create my-python-assistant -f ./Modelfile
# Use custom model
ollama run my-python-assistant
Embeddings
# Python
import ollama
# Generate embeddings
embedding = ollama.embeddings(
model='llama3.3',
prompt='The sky is blue'
)
print(embedding['embedding'])
# REST API
curl http://localhost:11434/api/embeddings -d '{
"model": "llama3.3",
"prompt": "The sky is blue"
}'
# Use nomic-embed-text for embeddings (recommended)
ollama pull nomic-embed-text
embedding = ollama.embeddings(
model='nomic-embed-text',
prompt='Your text here'
)
LangChain Integration
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = Ollama(
model="llama3.3",
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)
response = llm.invoke("Tell me a joke")
# With chat models
from langchain_community.chat_models import ChatOllama
chat = ChatOllama(model="llama3.3")
messages = [
("system", "You are a helpful assistant"),
("human", "What is 2+2?")
]
response = chat.invoke(messages)
Performance Optimization
Use smaller quantizations (q4_0, q4_K_M)
Faster inference, less memory
Reduce num_ctx for shorter contexts
Lower context = faster generation
Use GPU acceleration
CUDA/Metal automatically detected
Keep models loaded in memory
Set keep_alive parameter
Choose smaller models (7B vs 70B)
Balance speed vs quality
Environment Variables
# Set model storage location
export OLLAMA_MODELS=/path/to/models
# Set host and port
export OLLAMA_HOST=0.0.0.0:11434
# GPU settings
export OLLAMA_NUM_GPU=1 # Number of GPUs to use
export OLLAMA_GPU_LAYERS=35 # Number of layers to offload
# Keep models loaded
export OLLAMA_KEEP_ALIVE=5m # Keep in memory for 5 minutes
Troubleshooting
| Issue | Solution |
|---|---|
| Out of memory | Use smaller model or quantization (q4_0) |
| Slow generation | Check GPU usage, reduce context size |
| Model won't load | Check disk space, verify model name |
| API connection refused | Ensure ollama serve is running |
| GPU not detected | Update drivers, check CUDA/Metal installation |
💡 Pro Tip: Start with mistral:7b-q4_0 for a fast, capable model that runs on most hardware. For coding, use deepseek-coder-v2:16b. For production, use llama3.3:8b-q4_K_M for the best balance of speed and quality. Always test different quantizations to find your sweet spot between speed and quality.