Views: 2
Integrating Ollama with FastAPI allows you to serve private, local large language models (LLMs) like Llama 2 or Mistral via a high-performance REST API. To implement this, install Ollama locally and use the ollama-python library to create a service abstraction layer that communicates with the Ollama server at http://localhost:11434. By utilizing FastAPI’s StreamingResponse, you can deliver real-time, token-by-token chat interactions to frontends while maintaining full data privacy and zero API costs. This architecture supports advanced features like conversation memory management and dynamic model switching through Pydantic-validated request schemas.
🎓 What You’ll Learn
By the end of this tutorial, you’ll be able to:
- Install and configure Ollama locally for AI model serving
- Create an AI service abstraction layer that works with any LLM
- Build streaming chat endpoints for real-time responses
- Implement conversation history management
- Handle multiple AI models dynamically
- Create a simple HTML chat interface to test your API
- Prepare your architecture for future model integrations (HuggingFace, RunPod, etc.)
📖 Understanding Ollama
What is Ollama?
Ollama is a tool that lets you run large language models (LLMs) locally on your computer. Think of it as Docker for AI models.
Real-world analogy:
- ❌ Without Ollama: Calling external APIs (OpenAI, Anthropic) – costs money per request, requires internet
- ✅ With Ollama: AI models running on your machine – free, private, works offline
Why Use Ollama for Development?
| Benefit | Description |
|---|---|
| Free | No API costs, unlimited requests |
| Private | Data never leaves your machine |
| Offline | Works without internet |
| Fast | No network latency |
| Learning | Perfect for understanding AI integration |
| Flexibility | Easy to switch between models |
System Requirements
- RAM: 8GB minimum (16GB+ recommended)
- Storage: ~4-50GB per model (varies by model size)
- OS: macOS, Linux, or Windows
- Optional: GPU for faster inference (NVIDIA/AMD)
🛠️ Step-by-Step Implementation
Step 1: Install Ollama
On macOS:
brew install ollama
On Linux:
curl -fsSL https://ollama.com/install.sh | sh
On Windows:
Download and install from: https://ollama.com/download
Verify Installation:
ollama --version
Expected output: ollama version 0.x.x
Step 2: Start Ollama Service
# Start Ollama server
ollama serve
What this does:
- Starts a REST API server at
http://localhost:11434 - Manages AI models
- Handles inference requests
- Keeps running in the background
Leave this terminal open or run as a background service.
Step 3: Pull an AI Model
Open a new terminal:
# Pull Llama 2 (7B parameters, ~4GB)
ollama pull llama2
# Or try smaller/faster models:
ollama pull llama2:7b-chat # Optimized for chat
ollama pull phi # Microsoft's Phi (1.3B, very fast)
ollama pull mistral # Mistral 7B (good quality/speed balance)
ollama pull codellama # Specialized for code
# List installed models
ollama list
Available Models:
llama2– Meta’s Llama 2 (good all-rounder)mistral– Mistral 7B (fast, high quality)phi– Microsoft Phi (tiny, very fast)codellama– Code generation specialistneural-chat– Intel’s chat modelorca-mini– Compact but capable
Model Size Guide:
- 3B parameters = ~2GB, very fast, basic quality
- 7B parameters = ~4GB, fast, good quality
- 13B parameters = ~8GB, slower, better quality
- 70B parameters = ~40GB, slow, excellent quality (requires powerful GPU)
Step 4: Test Ollama from Command Line
# Interactive chat
ollama run llama2
# You'll see a prompt:
>>> Hello! Tell me about FastAPI.
# Exit with: /bye
Testing the REST API directly:
# Generate completion
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "What is FastAPI?",
"stream": false
}'
# Chat completion
curl http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"stream": false
}'
Step 5: Install Python Ollama Client
# Activate your virtual environment
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install ollama-python
pip install ollama
# Update requirements
pip freeze > requirements.txt
Step 6: Create AI Models
Create app/models/ai.py:
"""
AI-related Pydantic models
Request and response models for AI endpoints
"""
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from datetime import datetime
# ============================================
# MESSAGE MODELS
# ============================================
class ChatMessage(BaseModel):
"""
Individual chat message
Represents a single message in a conversation
"""
role: Literal["user", "assistant", "system"] = Field(
...,
description="Message role (user, assistant, or system)"
)
content: str = Field(
...,
min_length=1,
description="Message content"
)
timestamp: Optional[datetime] = Field(
default=None,
description="When the message was created"
)
class Config:
json_schema_extra = {
"example": {
"role": "user",
"content": "What is FastAPI?",
"timestamp": "2024-01-15T10:30:00"
}
}
# ============================================
# CHAT REQUEST/RESPONSE
# ============================================
class ChatRequest(BaseModel):
"""
Chat completion request
Send a message and get AI response
"""
message: str = Field(
...,
min_length=1,
max_length=4000,
description="User message"
)
model: str = Field(
default="llama2",
description="AI model to use"
)
conversation_id: Optional[str] = Field(
None,
description="Conversation ID for context (optional)"
)
temperature: float = Field(
default=0.7,
ge=0.0,
le=2.0,
description="Sampling temperature (0.0-2.0). Higher = more creative"
)
max_tokens: Optional[int] = Field(
None,
ge=1,
le=4096,
description="Maximum tokens to generate"
)
stream: bool = Field(
default=False,
description="Stream response in real-time"
)
class Config:
json_schema_extra = {
"example": {
"message": "Explain FastAPI in simple terms",
"model": "llama2",
"temperature": 0.7,
"stream": False
}
}
class ChatResponse(BaseModel):
"""
Chat completion response
AI model's response to user message
"""
message: str = Field(..., description="AI response")
model: str = Field(..., description="Model used")
conversation_id: str = Field(..., description="Conversation ID")
created_at: datetime = Field(default_factory=datetime.now)
finish_reason: Optional[str] = Field(None, description="Why generation stopped")
# Token usage (if available)
prompt_tokens: Optional[int] = Field(None, description="Tokens in prompt")
completion_tokens: Optional[int] = Field(None, description="Tokens in completion")
total_tokens: Optional[int] = Field(None, description="Total tokens used")
class Config:
json_schema_extra = {
"example": {
"message": "FastAPI is a modern Python web framework...",
"model": "llama2",
"conversation_id": "conv_123",
"created_at": "2024-01-15T10:30:00",
"total_tokens": 150
}
}
# ============================================
# CONVERSATION MODELS
# ============================================
class Conversation(BaseModel):
"""
Conversation with message history
Maintains context across multiple messages
"""
conversation_id: str = Field(..., description="Unique conversation ID")
model: str = Field(..., description="AI model being used")
messages: List[ChatMessage] = Field(default=[], description="Message history")
created_at: datetime = Field(default_factory=datetime.now)
updated_at: datetime = Field(default_factory=datetime.now)
metadata: dict = Field(default={}, description="Additional conversation metadata")
class Config:
json_schema_extra = {
"example": {
"conversation_id": "conv_abc123",
"model": "llama2",
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help?"}
],
"created_at": "2024-01-15T10:30:00"
}
}
# ============================================
# MODEL INFORMATION
# ============================================
class ModelInfo(BaseModel):
"""
Information about an AI model
"""
name: str = Field(..., description="Model name")
size: Optional[str] = Field(None, description="Model size (e.g., '7B', '13B')")
family: Optional[str] = Field(None, description="Model family")
parameter_size: Optional[str] = Field(None, description="Number of parameters")
quantization: Optional[str] = Field(None, description="Quantization level")
modified_at: Optional[datetime] = Field(None, description="Last modified date")
class Config:
json_schema_extra = {
"example": {
"name": "llama2:latest",
"size": "3.8GB",
"family": "llama",
"parameter_size": "7B"
}
}
class ModelListResponse(BaseModel):
"""List of available models"""
models: List[ModelInfo] = Field(..., description="Available models")
count: int = Field(..., description="Number of models")
# ============================================
# GENERATION OPTIONS
# ============================================
class GenerationOptions(BaseModel):
"""
Options for text generation
Controls how the AI generates responses
"""
temperature: float = Field(
default=0.7,
ge=0.0,
le=2.0,
description="Randomness (0=deterministic, 2=very random)"
)
top_p: float = Field(
default=0.9,
ge=0.0,
le=1.0,
description="Nucleus sampling threshold"
)
top_k: int = Field(
default=40,
ge=1,
description="Top-k sampling parameter"
)
repeat_penalty: float = Field(
default=1.1,
ge=0.0,
description="Penalty for repeating tokens"
)
class Config:
json_schema_extra = {
"example": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1
}
}
🔍 Understanding Model Parameters:
- temperature: Controls randomness
- 0.0 = Deterministic, always picks most likely word
- 0.7 = Balanced (good for chat)
- 1.5+ = Creative, unpredictable
- top_p: Nucleus sampling
- 0.9 = Consider top 90% probability mass
- Lower = more focused, higher = more diverse
- top_k: Limits vocabulary to top K tokens
- 40 = Consider 40 most likely next words
- Higher = more options, lower = more focused
Step 7: Create AI Service
Create app/services/ai_service.py:
"""
AI Service - Ollama Integration
Handles communication with Ollama for AI inference
"""
import ollama
from typing import List, Dict, Any, Optional, AsyncGenerator
from datetime import datetime
import uuid
import json
from app.models.ai import (
ChatMessage,
ChatRequest,
ChatResponse,
Conversation,
ModelInfo,
GenerationOptions
)
from app.core.config import settings
from app.utils.logger import logger
class OllamaService:
"""
Service for interacting with Ollama AI models
Provides methods for chat, streaming, and model management
"""
def __init__(self, base_url: str = None):
"""
Initialize Ollama service
Args:
base_url: Ollama API base URL (default from settings)
"""
self.base_url = base_url or settings.OLLAMA_BASE_URL
self.client = ollama.Client(host=self.base_url)
# In-memory conversation storage (use database in production)
self.conversations: Dict[str, Conversation] = {}
logger.info(
"Ollama service initialized",
extra={"base_url": self.base_url}
)
# ============================================
# MODEL MANAGEMENT
# ============================================
async def list_models(self) -> List[ModelInfo]:
"""
List available Ollama models
Returns:
List of available models
"""
try:
response = self.client.list()
models = []
for model_data in response.get('models', []):
models.append(
ModelInfo(
name=model_data.get('name'),
size=model_data.get('size'),
modified_at=model_data.get('modified_at')
)
)
logger.info(f"Listed {len(models)} models")
return models
except Exception as e:
logger.error(f"Error listing models: {e}")
raise
async def check_model_exists(self, model_name: str) -> bool:
"""
Check if a model exists locally
Args:
model_name: Name of the model to check
Returns:
True if model exists, False otherwise
"""
models = await self.list_models()
return any(model.name.startswith(model_name) for model in models)
# ============================================
# CONVERSATION MANAGEMENT
# ============================================
def create_conversation(self, model: str = None) -> Conversation:
"""
Create a new conversation
Args:
model: AI model to use
Returns:
New conversation object
"""
conversation_id = f"conv_{uuid.uuid4().hex[:12]}"
conversation = Conversation(
conversation_id=conversation_id,
model=model or settings.DEFAULT_AI_MODEL,
messages=[],
created_at=datetime.now(),
updated_at=datetime.now()
)
self.conversations[conversation_id] = conversation
logger.info(
"Conversation created",
extra={
"conversation_id": conversation_id,
"model": conversation.model
}
)
return conversation
def get_conversation(self, conversation_id: str) -> Optional[Conversation]:
"""
Get conversation by ID
Args:
conversation_id: Conversation identifier
Returns:
Conversation if found, None otherwise
"""
return self.conversations.get(conversation_id)
def add_message_to_conversation(
self,
conversation_id: str,
role: str,
content: str
) -> None:
"""
Add a message to conversation history
Args:
conversation_id: Conversation identifier
role: Message role (user/assistant/system)
content: Message content
"""
conversation = self.conversations.get(conversation_id)
if conversation:
message = ChatMessage(
role=role,
content=content,
timestamp=datetime.now()
)
conversation.messages.append(message)
conversation.updated_at = datetime.now()
# ============================================
# CHAT (NON-STREAMING)
# ============================================
async def chat(self, request: ChatRequest) -> ChatResponse:
"""
Generate chat completion (non-streaming)
Args:
request: Chat request with message and options
Returns:
Chat response from AI model
"""
try:
# Get or create conversation
if request.conversation_id:
conversation = self.get_conversation(request.conversation_id)
if not conversation:
conversation = self.create_conversation(request.model)
conversation.conversation_id = request.conversation_id
self.conversations[request.conversation_id] = conversation
else:
conversation = self.create_conversation(request.model)
# Add user message to history
self.add_message_to_conversation(
conversation.conversation_id,
"user",
request.message
)
# Prepare messages for Ollama
messages = [
{"role": msg.role, "content": msg.content}
for msg in conversation.messages
]
logger.info(
"Generating chat completion",
extra={
"conversation_id": conversation.conversation_id,
"model": request.model,
"message_count": len(messages)
}
)
# Call Ollama API
response = self.client.chat(
model=request.model,
messages=messages,
options={
"temperature": request.temperature,
"num_predict": request.max_tokens
} if request.max_tokens else {"temperature": request.temperature},
stream=False
)
# Extract response
assistant_message = response['message']['content']
# Add assistant response to history
self.add_message_to_conversation(
conversation.conversation_id,
"assistant",
assistant_message
)
logger.info(
"Chat completion generated",
extra={
"conversation_id": conversation.conversation_id,
"response_length": len(assistant_message)
}
)
# Build response
return ChatResponse(
message=assistant_message,
model=request.model,
conversation_id=conversation.conversation_id,
created_at=datetime.now(),
finish_reason="stop"
)
except Exception as e:
logger.error(f"Error in chat: {e}", extra={"error": str(e)})
raise
# ============================================
# STREAMING CHAT
# ============================================
async def chat_stream(
self,
request: ChatRequest
) -> AsyncGenerator[str, None]:
"""
Generate streaming chat completion
Args:
request: Chat request with message and options
Yields:
Chunks of response as they're generated
"""
try:
# Get or create conversation
if request.conversation_id:
conversation = self.get_conversation(request.conversation_id)
if not conversation:
conversation = self.create_conversation(request.model)
conversation.conversation_id = request.conversation_id
self.conversations[request.conversation_id] = conversation
else:
conversation = self.create_conversation(request.model)
# Add user message
self.add_message_to_conversation(
conversation.conversation_id,
"user",
request.message
)
# Prepare messages
messages = [
{"role": msg.role, "content": msg.content}
for msg in conversation.messages
]
logger.info(
"Starting streaming chat",
extra={
"conversation_id": conversation.conversation_id,
"model": request.model
}
)
# Stream from Ollama
full_response = ""
stream = self.client.chat(
model=request.model,
messages=messages,
options={"temperature": request.temperature},
stream=True
)
for chunk in stream:
if 'message' in chunk and 'content' in chunk['message']:
content = chunk['message']['content']
full_response += content
# Yield as Server-Sent Event format
yield f"data: {json.dumps({'content': content})}\n\n"
# Add complete response to conversation
self.add_message_to_conversation(
conversation.conversation_id,
"assistant",
full_response
)
# Send completion event
yield f"data: {json.dumps({'done': True, 'conversation_id': conversation.conversation_id})}\n\n"
logger.info(
"Streaming chat completed",
extra={
"conversation_id": conversation.conversation_id,
"response_length": len(full_response)
}
)
except Exception as e:
logger.error(f"Error in streaming chat: {e}")
error_data = json.dumps({"error": str(e)})
yield f"data: {error_data}\n\n"
# ============================================
# SIMPLE GENERATION (NO CONVERSATION)
# ============================================
async def generate(
self,
prompt: str,
model: str = None,
options: GenerationOptions = None
) -> str:
"""
Simple text generation (no conversation context)
Args:
prompt: Input prompt
model: Model to use
options: Generation options
Returns:
Generated text
"""
try:
model = model or settings.DEFAULT_AI_MODEL
logger.info(
"Generating text",
extra={"model": model, "prompt_length": len(prompt)}
)
response = self.client.generate(
model=model,
prompt=prompt,
options=options.model_dump() if options else {},
stream=False
)
return response['response']
except Exception as e:
logger.error(f"Error in generate: {e}")
raise
# ============================================
# GLOBAL SERVICE INSTANCE
# ============================================
ollama_service = OllamaService()
🔍 Service Architecture Explained:
- Conversation Management: Stores chat history in memory
- Production: Use Redis or database
- Allows context-aware responses
- Streaming Support: Real-time token-by-token responses
- Better UX (user sees response as it’s generated)
- Uses Server-Sent Events (SSE)
- Model Abstraction: Easy to swap AI providers
- Today: Ollama
- Tomorrow: Add OpenAI, Anthropic, etc.
Step 8: Create AI Endpoints
Create app/api/v1/endpoints/ai.py:
"""
AI endpoints
Chat, streaming, and model management endpoints
"""
from fastapi import APIRouter, Depends, HTTPException, status
from fastapi.responses import StreamingResponse
from typing import Annotated
from app.models.ai import (
ChatRequest,
ChatResponse,
ModelListResponse,
Conversation
)
from app.services.ai_service import OllamaService, ollama_service
from app.utils.logger import logger
router = APIRouter(prefix="/ai", tags=["AI"])
def get_ai_service() -> OllamaService:
"""Dependency to get AI service"""
return ollama_service
# ============================================
# MODEL MANAGEMENT
# ============================================
@router.get(
"/models",
response_model=ModelListResponse,
summary="List available AI models"
)
async def list_models(
service: Annotated[OllamaService, Depends(get_ai_service)]
):
"""
List all available Ollama models
Returns:
List of models installed locally
"""
try:
models = await service.list_models()
return ModelListResponse(
models=models,
count=len(models)
)
except Exception as e:
logger.error(f"Error listing models: {e}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Failed to list models: {str(e)}"
)
# ============================================
# CHAT ENDPOINTS
# ============================================
@router.post(
"/chat",
response_model=ChatResponse,
summary="Chat with AI (non-streaming)"
)
async def chat(
request: ChatRequest,
service: Annotated[OllamaService, Depends(get_ai_service)]
):
"""
Send a message and get AI response
**Features:**
- Maintains conversation context
- Returns complete response at once
- Supports multiple models
- Configurable temperature
**Example Request:**
```json
{
"message": "Explain FastAPI in simple terms",
"model": "llama2",
"temperature": 0.7
}
```
"""
try:
# Check if model exists
if not await service.check_model_exists(request.model):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail=f"Model '{request.model}' not found. Pull it first with: ollama pull {request.model}"
)
response = await service.chat(request)
return response
except HTTPException:
raise
except Exception as e:
logger.error(f"Error in chat endpoint: {e}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Chat failed: {str(e)}"
)
@router.post(
"/chat/stream",
summary="Chat with AI (streaming)",
description="Stream AI responses token-by-token for real-time chat experience"
)
async def chat_stream(
request: ChatRequest,
service: Annotated[OllamaService, Depends(get_ai_service)]
):
"""
Stream chat response in real-time
**How to use:**
1. Send POST request with message
2. Response streams as Server-Sent Events
3. Each chunk contains a piece of the response
4. Final chunk includes conversation_id
**Response format:**
data: {"content": "Fast"}
data: {"content": "API"}
data: {"content": " is"}
data: {"done": true, "conversation_id": "conv_123"}
"""
try:
# Check if model exists
if not await service.check_model_exists(request.model):
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail=f"Model '{request.model}' not found. Pull it first with: ollama pull {request.model}"
)
return StreamingResponse(
service.chat_stream(request),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no" # Disable nginx buffering
}
)
except HTTPException:
raise
except Exception as e:
logger.error(f"Error in streaming chat: {e}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Streaming chat failed: {str(e)}"
)
# ============================================
# CONVERSATION MANAGEMENT
# ============================================
@router.get(
"/conversations/{conversation_id}",
response_model=Conversation,
summary="Get conversation history"
)
async def get_conversation(
conversation_id: str,
service: Annotated[OllamaService, Depends(get_ai_service)]
):
"""
Retrieve conversation history by ID
Returns:
Conversation with all messages
"""
conversation = service.get_conversation(conversation_id)
if not conversation:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail=f"Conversation '{conversation_id}' not found"
)
return conversation
@router.post(
"/conversations",
response_model=Conversation,
status_code=status.HTTP_201_CREATED,
summary="Create new conversation"
)
async def create_conversation(
model: str = "llama2",
service: Annotated[OllamaService, Depends(get_ai_service)]
):
"""
Create a new conversation
Args:
model: AI model to use for this conversation
Returns:
New conversation object with ID
"""
conversation = service.create_conversation(model=model)
return conversation
# ============================================
# HEALTH CHECK
# ============================================
@router.get(
"/health",
summary="Check AI service health"
)
async def ai_health_check(
service: Annotated[OllamaService, Depends(get_ai_service)]
):
"""
Check if Ollama service is running and accessible
Returns:
Status and available models count
"""
try:
models = await service.list_models()
return {
"status": "healthy",
"ollama_url": service.base_url,
"models_available": len(models),
"models": [model.name for model in models]
}
except Exception as e:
return {
"status": "unhealthy",
"error": str(e),
"message": "Make sure Ollama is running: ollama serve"
}
Step 9: Update API Router
Update app/api/v1/api.py:
"""
API v1 router aggregator
"""
from fastapi import APIRouter
from app.api.v1.endpoints import users, health, users_advanced, dependencies_demo, ai
# Create main v1 router
api_router = APIRouter()
# Include all endpoint routers
api_router.include_router(health.router)
api_router.include_router(users.router)
api_router.include_router(users_advanced.router)
api_router.include_router(dependencies_demo.router)
api_router.include_router(ai.router) # New AI endpoints!
Step 10: Update Configuration
Make sure your .env file has Ollama settings:
# .env
# ... existing settings ...
# AI Settings
OLLAMA_BASE_URL=http://localhost:11434
DEFAULT_AI_MODEL=llama2
These settings are already in our app/core/config.py from earlier!
Step 11: Create HTML Chat Interface
Create app/static/chat.html:
First, create the static directory:
mkdir -p app/static
Now create the HTML file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Chat - FastAPI Backend</title>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
height: 100vh;
display: flex;
justify-content: center;
align-items: center;
padding: 20px;
}
.chat-container {
width: 100%;
max-width: 800px;
height: 90vh;
background: white;
border-radius: 20px;
box-shadow: 0 20px 60px rgba(0,0,0,0.3);
display: flex;
flex-direction: column;
overflow: hidden;
}
.chat-header {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
padding: 20px;
display: flex;
justify-content: space-between;
align-items: center;
}
.chat-header h1 {
font-size: 24px;
font-weight: 600;
}
.model-selector {
background: rgba(255,255,255,0.2);
border: none;
color: white;
padding: 8px 12px;
border-radius: 8px;
font-size: 14px;
cursor: pointer;
}
.model-selector option {
background: #764ba2;
}
.chat-messages {
flex: 1;
overflow-y: auto;
padding: 20px;
background: #f5f5f5;
}
.message {
margin-bottom: 20px;
display: flex;
animation: slideIn 0.3s ease;
}
@keyframes slideIn {
from {
opacity: 0;
transform: translateY(10px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
.message.user {
justify-content: flex-end;
}
.message-content {
max-width: 70%;
padding: 12px 16px;
border-radius: 12px;
word-wrap: break-word;
white-space: pre-wrap;
}
.message.user .message-content {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}
.message.assistant .message-content {
background: white;
color: #333;
box-shadow: 0 2px 8px rgba(0,0,0,0.1);
}
.message-time {
font-size: 11px;
color: #999;
margin-top: 5px;
}
.typing-indicator {
display: none;
padding: 12px 16px;
background: white;
border-radius: 12px;
box-shadow: 0 2px 8px rgba(0,0,0,0.1);
}
.typing-indicator.active {
display: inline-block;
}
.typing-indicator span {
height: 8px;
width: 8px;
background: #667eea;
border-radius: 50%;
display: inline-block;
margin: 0 2px;
animation: bounce 1.4s infinite;
}
.typing-indicator span:nth-child(2) {
animation-delay: 0.2s;
}
.typing-indicator span:nth-child(3) {
animation-delay: 0.4s;
}
@keyframes bounce {
0%, 60%, 100% {
transform: translateY(0);
}
30% {
transform: translateY(-10px);
}
}
.chat-input-container {
padding: 20px;
background: white;
border-top: 1px solid #e0e0e0;
}
.chat-input-wrapper {
display: flex;
gap: 10px;
}
.chat-input {
flex: 1;
padding: 12px 16px;
border: 2px solid #e0e0e0;
border-radius: 12px;
font-size: 14px;
resize: none;
font-family: inherit;
max-height: 120px;
}
.chat-input:focus {
outline: none;
border-color: #667eea;
}
.send-button {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
border: none;
padding: 12px 24px;
border-radius: 12px;
font-size: 14px;
font-weight: 600;
cursor: pointer;
transition: transform 0.2s;
}
.send-button:hover {
transform: scale(1.05);
}
.send-button:disabled {
opacity: 0.5;
cursor: not-allowed;
transform: scale(1);
}
.options {
display: flex;
gap: 15px;
margin-top: 10px;
font-size: 13px;
}
.option-group {
display: flex;
align-items: center;
gap: 8px;
}
.option-group label {
color: #666;
}
.option-group input[type="range"] {
width: 100px;
}
.option-group input[type="checkbox"] {
cursor: pointer;
}
.error-message {
background: #fee;
color: #c33;
padding: 12px;
border-radius: 8px;
margin: 10px 20px;
display: none;
}
.error-message.active {
display: block;
}
</style>
</head>
<body>
<div class="chat-container">
<div class="chat-header">
<h1>🤖 AI Chat</h1>
<select class="model-selector" id="modelSelector">
<option value="llama2">Llama 2</option>
<option value="mistral">Mistral</option>
<option value="phi">Phi</option>
<option value="codellama">Code Llama</option>
</select>
</div>
<div class="error-message" id="errorMessage"></div>
<div class="chat-messages" id="chatMessages">
<div class="message assistant">
<div class="message-content">
👋 Hello! I'm your AI assistant. How can I help you today?
</div>
</div>
</div>
<div class="chat-input-container">
<div class="options">
<div class="option-group">
<label for="temperature">Temperature:</label>
<input type="range" id="temperature" min="0" max="2" step="0.1" value="0.7">
<span id="temperatureValue">0.7</span>
</div>
<div class="option-group">
<input type="checkbox" id="streamToggle" checked>
<label for="streamToggle">Stream responses</label>
</div>
</div>
<div class="chat-input-wrapper">
<textarea
id="chatInput"
class="chat-input"
placeholder="Type your message..."
rows="1"
></textarea>
<button id="sendButton" class="send-button">Send</button>
</div>
</div>
</div>
<script>
const API_BASE_URL = 'http://localhost:8000/api/v1';
let conversationId = null;
// Elements
const chatMessages = document.getElementById('chatMessages');
const chatInput = document.getElementById('chatInput');
const sendButton = document.getElementById('sendButton');
const modelSelector = document.getElementById('modelSelector');
const temperatureSlider = document.getElementById('temperature');
const temperatureValue = document.getElementById('temperatureValue');
const streamToggle = document.getElementById('streamToggle');
const errorMessage = document.getElementById('errorMessage');
// Auto-resize textarea
chatInput.addEventListener('input', function() {
this.style.height = 'auto';
this.style.height = (this.scrollHeight) + 'px';
});
// Temperature slider
temperatureSlider.addEventListener('input', function() {
temperatureValue.textContent = this.value;
});
// Send message on Enter (Shift+Enter for new line)
chatInput.addEventListener('keydown', function(e) {
if (e.key === 'Enter' && !e.shiftKey) {
e.preventDefault();
sendMessage();
}
});
// Send button click
sendButton.addEventListener('click', sendMessage);
// Show error
function showError(message) {
errorMessage.textContent = message;
errorMessage.classList.add('active');
setTimeout(() => {
errorMessage.classList.remove('active');
}, 5000);
}
// Add message to chat
function addMessage(role, content) {
const messageDiv = document.createElement('div');
messageDiv.className = `message ${role}`;
const contentDiv = document.createElement('div');
contentDiv.className = 'message-content';
contentDiv.textContent = content;
messageDiv.appendChild(contentDiv);
chatMessages.appendChild(messageDiv);
chatMessages.scrollTop = chatMessages.scrollHeight;
return contentDiv;
}
// Show typing indicator
function showTyping() {
const typingDiv = document.createElement('div');
typingDiv.className = 'message assistant';
typingDiv.id = 'typingIndicator';
typingDiv.innerHTML = `
<div class="typing-indicator active">
<span></span>
<span></span>
<span></span>
</div>
`;
chatMessages.appendChild(typingDiv);
chatMessages.scrollTop = chatMessages.scrollHeight;
}
function hideTyping() {
const typing = document.getElementById('typingIndicator');
if (typing) typing.remove();
}
// Send message (non-streaming)
async function sendMessageNonStream(message) {
try {
const response = await fetch(`${API_BASE_URL}/ai/chat`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
message: message,
model: modelSelector.value,
conversation_id: conversationId,
temperature: parseFloat(temperatureSlider.value),
stream: false
})
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Request failed');
}
const data = await response.json();
conversationId = data.conversation_id;
hideTyping();
addMessage('assistant', data.message);
} catch (error) {
hideTyping();
showError(`Error: ${error.message}`);
console.error('Error:', error);
}
}
// Send message (streaming)
async function sendMessageStream(message) {
try {
const response = await fetch(`${API_BASE_URL}/ai/chat/stream`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
message: message,
model: modelSelector.value,
conversation_id: conversationId,
temperature: parseFloat(temperatureSlider.value),
stream: true
})
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Request failed');
}
hideTyping();
// Create message element for streaming
const messageDiv = document.createElement('div');
messageDiv.className = 'message assistant';
const contentDiv = document.createElement('div');
contentDiv.className = 'message-content';
messageDiv.appendChild(contentDiv);
chatMessages.appendChild(messageDiv);
// Read stream
const reader = response.body.getReader();
const decoder = new TextDecoder();
let fullResponse = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
if (data.content) {
fullResponse += data.content;
contentDiv.textContent = fullResponse;
chatMessages.scrollTop = chatMessages.scrollHeight;
}
if (data.done) {
conversationId = data.conversation_id;
}
if (data.error) {
showError(data.error);
}
}
}
}
} catch (error) {
hideTyping();
showError(`Error: ${error.message}`);
console.error('Error:', error);
}
}
// Main send function
async function sendMessage() {
const message = chatInput.value.trim();
if (!message) return;
// Add user message
addMessage('user', message);
chatInput.value = '';
chatInput.style.height = 'auto';
// Disable input
chatInput.disabled = true;
sendButton.disabled = true;
// Show typing
showTyping();
// Send based on stream toggle
if (streamToggle.checked) {
await sendMessageStream(message);
} else {
await sendMessageNonStream(message);
}
// Enable input
chatInput.disabled = false;
sendButton.disabled = false;
chatInput.focus();
}
// Load available models on start
async function loadModels() {
try {
const response = await fetch(`${API_BASE_URL}/ai/models`);
const data = await response.json();
modelSelector.innerHTML = '';
data.models.forEach(model => {
const option = document.createElement('option');
option.value = model.name.split(':')[0];
option.textContent = model.name;
modelSelector.appendChild(option);
});
} catch (error) {
console.error('Error loading models:', error);
showError('Failed to load models. Make sure Ollama is running.');
}
}
// Initialize
loadModels();
chatInput.focus();
</script>
</body>
</html>
Step 12: Serve Static Files
Update app/main.py to serve static files:
"""
FastAPI AI Backend - Main Application
Updated with static file serving for chat interface
"""
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.exceptions import RequestValidationError
from fastapi.staticfiles import StaticFiles # New import
from pydantic import ValidationError
from app.core.config import settings
from app.api.v1.api import api_router
from app.core.exceptions import AppException
from app.core.error_handlers import (
app_exception_handler,
validation_exception_handler,
generic_exception_handler
)
from app.middleware.logging_middleware import LoggingMiddleware
from app.middleware.performance_middleware import PerformanceMiddleware
from app.utils.logger import logger
def create_application() -> FastAPI:
"""Application factory pattern"""
app = FastAPI(
title=settings.APP_NAME,
version=settings.APP_VERSION,
description="""
A production-ready FastAPI backend for AI applications.
## Features
* **AI Integration** with Ollama (local LLMs)
* Streaming chat responses
* Multiple AI model support
* Conversation history management
* Advanced dependency injection
* Custom middleware stack
* Structured logging
* Background tasks
* Comprehensive error handling
## Try the Chat Interface
Visit: http://localhost:8000/chat.html
## API Documentation
* Swagger UI: /docs
* ReDoc: /redoc
""",
debug=settings.DEBUG,
docs_url="/docs",
redoc_url="/redoc",
openapi_url="/openapi.json"
)
# ============================================
# STATIC FILES
# ============================================
# Mount static files directory
app.mount("/", StaticFiles(directory="app/static", html=True), name="static")
# ============================================
# MIDDLEWARE
# ============================================
app.add_middleware(PerformanceMiddleware, slow_request_threshold=1.0)
app.add_middleware(LoggingMiddleware)
app.add_middleware(
CORSMiddleware,
allow_origins=settings.allowed_origins_list,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
expose_headers=["X-Request-ID", "X-Process-Time"]
)
# ============================================
# EXCEPTION HANDLERS
# ============================================
app.add_exception_handler(AppException, app_exception_handler)
app.add_exception_handler(RequestValidationError, validation_exception_handler)
app.add_exception_handler(ValidationError, validation_exception_handler)
app.add_exception_handler(Exception, generic_exception_handler)
# ============================================
# ROUTERS
# ============================================
app.include_router(api_router, prefix=settings.API_V1_PREFIX)
return app
app = create_application()
@app.on_event("startup")
async def startup_event():
"""Startup tasks"""
logger.info(
"Application startup",
extra={
"app_name": settings.APP_NAME,
"version": settings.APP_VERSION,
"environment": settings.ENVIRONMENT
}
)
print(f"🚀 Starting {settings.APP_NAME} v{settings.APP_VERSION}")
print(f"📝 Environment: {settings.ENVIRONMENT}")
print(f"🤖 Ollama URL: {settings.OLLAMA_BASE_URL}")
print(f"📚 API Docs: http://{settings.HOST}:{settings.PORT}/docs")
print(f"💬 Chat Interface: http://{settings.HOST}:{settings.PORT}/chat.html")
@app.on_event("shutdown")
async def shutdown_event():
"""Shutdown tasks"""
logger.info("Application shutdown", extra={"app_name": settings.APP_NAME})
print(f"👋 Shutting down {settings.APP_NAME}")
🧪 Testing Your AI Integration
Complete Testing Workflow:
Step 1: Install and Start Ollama
# Install Ollama (if not done)
# macOS: brew install ollama
# Linux: curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama server (in terminal 1)
ollama serve
# Pull a model (in terminal 2)
ollama pull llama2
# Verify
ollama list
Step 2: Start FastAPI Server
# In your project directory (terminal 3)
source venv/bin/activate
python main.py
You should see:
🚀 Starting FastAPI AI Backend v0.3.0
📝 Environment: development
🤖 Ollama URL: http://localhost:11434
📚 API Docs: http://localhost:8000/docs
💬 Chat Interface: http://localhost:8000/chat.html
Step 3: Run Tests
# In another terminal (terminal 4)
source venv/bin/activate
python test_ai_endpoints.py
Step 4: Try the Web Interface
Open your browser:
http://localhost:8000/chat.html
You’ll see:
- Beautiful gradient chat interface
- Model selector dropdown
- Temperature slider
- Streaming toggle
- Real-time chat with AI!
📊 Project Structure Update
fastapi-ai-backend/
├── app/
│ ├── api/v1/endpoints/
│ │ └── ai.py ✅ NEW! AI endpoints
│ ├── models/
│ │ └── ai.py ✅ NEW! AI models
│ ├── services/
│ │ └── ai_service.py ✅ NEW! Ollama service
│ ├── static/
│ │ └── chat.html ✅ NEW! Chat interface
│ └── main.py ✅ Updated (static files)
│
├── test_ai_endpoints.py ✅ NEW! AI tests
└── .env ✅ Updated (Ollama settings)
🎉 What You’ve Accomplished!
Technical Achievements:
✅ Ollama Integration
- Local LLM serving
- Multi-model support
- Streaming responses
- Conversation management
✅ Production-Ready Architecture
- Service abstraction layer
- Provider-agnostic design
- Error handling
- Structured logging
✅ Complete AI API
- Chat endpoints (streaming & non-streaming)
- Model management
- Conversation history
- Health monitoring
✅ User Interface
- Real-time chat interface
- Model switching
- Temperature control
- Streaming visualization
Skills Gained:
✅ AI model deployment and management ✅ Streaming API implementation ✅ Real-time web interfaces ✅ Conversation state management ✅ Service architecture patterns
What are the system requirements for running Ollama locally?
Why should I use Ollama instead of OpenAI’s API?
How do I implement “Streaming” in a FastAPI AI backend?
Can I use multiple different AI models in the same API?
How do I manage conversation history (memory)?
🚀 What’s Next?
You now have a complete, working AI backend! Here’s what you can build on this foundation:
Immediate Enhancements:
- Add file upload to chat (PDFs, images)
- Implement code syntax highlighting
- Add conversation export (JSON, markdown)
- Build conversation search
Coming in Future Posts:
- Blog Post 7: React frontend with TypeScript
- Blog Post 8: Database integration (SQLAlchemy)
- Blog Post 9: Authentication & user management
- Blog Post 10: Deployment (Docker, production setup)
💡 Pro Tips
Performance Optimization:
# Cache model responses for identical prompts
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_generate(prompt: str, model: str):
return ollama_service.generate(prompt, model)
Better Error Messages:
try:
response = await ollama_service.chat(request)
except Exception as e:
if "connection refused" in str(e).lower():
raise HTTPException(
status_code=503,
detail="Ollama service not running. Start with: ollama serve"
)
raise
Request Timeout:
import asyncio
async def chat_with_timeout(request, timeout=30):
try:
return await asyncio.wait_for(
ollama_service.chat(request),
timeout=timeout
)
except asyncio.TimeoutError:
raise HTTPException(408, "Request timeout")
🎯 Summary
You’ve successfully integrated AI into your FastAPI backend! This is a major milestone. You can now:
- 🤖 Run AI models locally for free
- 💬 Build chat applications
- 🌊 Stream responses in real-time
- 📝 Manage conversations
- 🔄 Switch between models
- 🎨 Create beautiful chat interfaces
Your FastAPI project is now a production-ready AI-powered backend that can scale to support real-world applications!
Congratulations! 🎉
Leave a Reply