FastAPI Ollama Integration: Build AI Backend Tutorial

Integrating Ollama with FastAPI allows you to serve private, local large language models (LLMs) like Llama 2 or Mistral via a high-performance REST API. To implement this, install Ollama locally and use the ollama-python library to create a service abstraction layer that communicates with the Ollama server at http://localhost:11434. By utilizing FastAPI’s StreamingResponse, you can deliver real-time, token-by-token chat interactions to frontends while maintaining full data privacy and zero API costs. This architecture supports advanced features like conversation memory management and dynamic model switching through Pydantic-validated request schemas.

🎓 What You’ll Learn

By the end of this tutorial, you’ll be able to:

Install and configure Ollama locally for AI model serving
Create an AI service abstraction layer that works with any LLM
Build streaming chat endpoints for real-time responses
Implement conversation history management
Handle multiple AI models dynamically
Create a simple HTML chat interface to test your API
Prepare your architecture for future model integrations (HuggingFace, RunPod, etc.)

📖 Understanding Ollama

What is Ollama?

Ollama is a tool that lets you run large language models (LLMs) locally on your computer. Think of it as Docker for AI models.

Real-world analogy:

❌ Without Ollama: Calling external APIs (OpenAI, Anthropic) – costs money per request, requires internet
✅ With Ollama: AI models running on your machine – free, private, works offline

Why Use Ollama for Development?

Benefit	Description
Free	No API costs, unlimited requests
Private	Data never leaves your machine
Offline	Works without internet
Fast	No network latency
Learning	Perfect for understanding AI integration
Flexibility	Easy to switch between models

System Requirements

RAM: 8GB minimum (16GB+ recommended)
Storage: ~4-50GB per model (varies by model size)
OS: macOS, Linux, or Windows
Optional: GPU for faster inference (NVIDIA/AMD)

🛠️ Step-by-Step Implementation

Step 1: Install Ollama

On macOS:

brew install ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows:

Download and install from: https://ollama.com/download

Verify Installation:

ollama --version

Expected output: ollama version 0.x.x

Step 2: Start Ollama Service

# Start Ollama server
ollama serve

What this does:

Starts a REST API server at http://localhost:11434
Manages AI models
Handles inference requests
Keeps running in the background

Leave this terminal open or run as a background service.

Step 3: Pull an AI Model

Open a new terminal:

# Pull Llama 2 (7B parameters, ~4GB)
ollama pull llama2

# Or try smaller/faster models:
ollama pull llama2:7b-chat    # Optimized for chat
ollama pull phi               # Microsoft's Phi (1.3B, very fast)
ollama pull mistral          # Mistral 7B (good quality/speed balance)
ollama pull codellama        # Specialized for code

# List installed models
ollama list

Available Models:

llama2 – Meta’s Llama 2 (good all-rounder)
mistral – Mistral 7B (fast, high quality)
phi – Microsoft Phi (tiny, very fast)
codellama – Code generation specialist
neural-chat – Intel’s chat model
orca-mini – Compact but capable

Model Size Guide:

3B parameters = ~2GB, very fast, basic quality
7B parameters = ~4GB, fast, good quality
13B parameters = ~8GB, slower, better quality
70B parameters = ~40GB, slow, excellent quality (requires powerful GPU)

Step 4: Test Ollama from Command Line

# Interactive chat
ollama run llama2

# You'll see a prompt:
>>> Hello! Tell me about FastAPI.

# Exit with: /bye

Testing the REST API directly:

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "What is FastAPI?",
  "stream": false
}'

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false
}'

Step 5: Install Python Ollama Client

# Activate your virtual environment
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install ollama-python
pip install ollama

# Update requirements
pip freeze > requirements.txt

Step 6: Create AI Models

Create app/models/ai.py:

"""
AI-related Pydantic models

Request and response models for AI endpoints
"""

from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from datetime import datetime


# ============================================
# MESSAGE MODELS
# ============================================

class ChatMessage(BaseModel):
    """
    Individual chat message
    
    Represents a single message in a conversation
    """
    role: Literal["user", "assistant", "system"] = Field(
        ...,
        description="Message role (user, assistant, or system)"
    )
    content: str = Field(
        ...,
        min_length=1,
        description="Message content"
    )
    timestamp: Optional[datetime] = Field(
        default=None,
        description="When the message was created"
    )
    
    class Config:
        json_schema_extra = {
            "example": {
                "role": "user",
                "content": "What is FastAPI?",
                "timestamp": "2024-01-15T10:30:00"
            }
        }


# ============================================
# CHAT REQUEST/RESPONSE
# ============================================

class ChatRequest(BaseModel):
    """
    Chat completion request
    
    Send a message and get AI response
    """
    message: str = Field(
        ...,
        min_length=1,
        max_length=4000,
        description="User message"
    )
    model: str = Field(
        default="llama2",
        description="AI model to use"
    )
    conversation_id: Optional[str] = Field(
        None,
        description="Conversation ID for context (optional)"
    )
    temperature: float = Field(
        default=0.7,
        ge=0.0,
        le=2.0,
        description="Sampling temperature (0.0-2.0). Higher = more creative"
    )
    max_tokens: Optional[int] = Field(
        None,
        ge=1,
        le=4096,
        description="Maximum tokens to generate"
    )
    stream: bool = Field(
        default=False,
        description="Stream response in real-time"
    )
    
    class Config:
        json_schema_extra = {
            "example": {
                "message": "Explain FastAPI in simple terms",
                "model": "llama2",
                "temperature": 0.7,
                "stream": False
            }
        }


class ChatResponse(BaseModel):
    """
    Chat completion response
    
    AI model's response to user message
    """
    message: str = Field(..., description="AI response")
    model: str = Field(..., description="Model used")
    conversation_id: str = Field(..., description="Conversation ID")
    created_at: datetime = Field(default_factory=datetime.now)
    finish_reason: Optional[str] = Field(None, description="Why generation stopped")
    
    # Token usage (if available)
    prompt_tokens: Optional[int] = Field(None, description="Tokens in prompt")
    completion_tokens: Optional[int] = Field(None, description="Tokens in completion")
    total_tokens: Optional[int] = Field(None, description="Total tokens used")
    
    class Config:
        json_schema_extra = {
            "example": {
                "message": "FastAPI is a modern Python web framework...",
                "model": "llama2",
                "conversation_id": "conv_123",
                "created_at": "2024-01-15T10:30:00",
                "total_tokens": 150
            }
        }


# ============================================
# CONVERSATION MODELS
# ============================================

class Conversation(BaseModel):
    """
    Conversation with message history
    
    Maintains context across multiple messages
    """
    conversation_id: str = Field(..., description="Unique conversation ID")
    model: str = Field(..., description="AI model being used")
    messages: List[ChatMessage] = Field(default=[], description="Message history")
    created_at: datetime = Field(default_factory=datetime.now)
    updated_at: datetime = Field(default_factory=datetime.now)
    metadata: dict = Field(default={}, description="Additional conversation metadata")
    
    class Config:
        json_schema_extra = {
            "example": {
                "conversation_id": "conv_abc123",
                "model": "llama2",
                "messages": [
                    {"role": "user", "content": "Hello!"},
                    {"role": "assistant", "content": "Hi! How can I help?"}
                ],
                "created_at": "2024-01-15T10:30:00"
            }
        }


# ============================================
# MODEL INFORMATION
# ============================================

class ModelInfo(BaseModel):
    """
    Information about an AI model
    """
    name: str = Field(..., description="Model name")
    size: Optional[str] = Field(None, description="Model size (e.g., '7B', '13B')")
    family: Optional[str] = Field(None, description="Model family")
    parameter_size: Optional[str] = Field(None, description="Number of parameters")
    quantization: Optional[str] = Field(None, description="Quantization level")
    modified_at: Optional[datetime] = Field(None, description="Last modified date")
    
    class Config:
        json_schema_extra = {
            "example": {
                "name": "llama2:latest",
                "size": "3.8GB",
                "family": "llama",
                "parameter_size": "7B"
            }
        }


class ModelListResponse(BaseModel):
    """List of available models"""
    models: List[ModelInfo] = Field(..., description="Available models")
    count: int = Field(..., description="Number of models")


# ============================================
# GENERATION OPTIONS
# ============================================

class GenerationOptions(BaseModel):
    """
    Options for text generation
    
    Controls how the AI generates responses
    """
    temperature: float = Field(
        default=0.7,
        ge=0.0,
        le=2.0,
        description="Randomness (0=deterministic, 2=very random)"
    )
    top_p: float = Field(
        default=0.9,
        ge=0.0,
        le=1.0,
        description="Nucleus sampling threshold"
    )
    top_k: int = Field(
        default=40,
        ge=1,
        description="Top-k sampling parameter"
    )
    repeat_penalty: float = Field(
        default=1.1,
        ge=0.0,
        description="Penalty for repeating tokens"
    )
    
    class Config:
        json_schema_extra = {
            "example": {
                "temperature": 0.7,
                "top_p": 0.9,
                "top_k": 40,
                "repeat_penalty": 1.1
            }
        }

🔍 Understanding Model Parameters:

temperature: Controls randomness
- 0.0 = Deterministic, always picks most likely word
- 0.7 = Balanced (good for chat)
- 1.5+ = Creative, unpredictable
top_p: Nucleus sampling
- 0.9 = Consider top 90% probability mass
- Lower = more focused, higher = more diverse
top_k: Limits vocabulary to top K tokens
- 40 = Consider 40 most likely next words
- Higher = more options, lower = more focused

Step 7: Create AI Service

Create app/services/ai_service.py:

"""
AI Service - Ollama Integration

Handles communication with Ollama for AI inference
"""

import ollama
from typing import List, Dict, Any, Optional, AsyncGenerator
from datetime import datetime
import uuid
import json

from app.models.ai import (
    ChatMessage,
    ChatRequest,
    ChatResponse,
    Conversation,
    ModelInfo,
    GenerationOptions
)
from app.core.config import settings
from app.utils.logger import logger


class OllamaService:
    """
    Service for interacting with Ollama AI models
    
    Provides methods for chat, streaming, and model management
    """
    
    def __init__(self, base_url: str = None):
        """
        Initialize Ollama service
        
        Args:
            base_url: Ollama API base URL (default from settings)
        """
        self.base_url = base_url or settings.OLLAMA_BASE_URL
        self.client = ollama.Client(host=self.base_url)
        
        # In-memory conversation storage (use database in production)
        self.conversations: Dict[str, Conversation] = {}
        
        logger.info(
            "Ollama service initialized",
            extra={"base_url": self.base_url}
        )
    
    # ============================================
    # MODEL MANAGEMENT
    # ============================================
    
    async def list_models(self) -> List[ModelInfo]:
        """
        List available Ollama models
        
        Returns:
            List of available models
        """
        try:
            response = self.client.list()
            
            models = []
            for model_data in response.get('models', []):
                models.append(
                    ModelInfo(
                        name=model_data.get('name'),
                        size=model_data.get('size'),
                        modified_at=model_data.get('modified_at')
                    )
                )
            
            logger.info(f"Listed {len(models)} models")
            return models
            
        except Exception as e:
            logger.error(f"Error listing models: {e}")
            raise
    
    async def check_model_exists(self, model_name: str) -> bool:
        """
        Check if a model exists locally
        
        Args:
            model_name: Name of the model to check
        
        Returns:
            True if model exists, False otherwise
        """
        models = await self.list_models()
        return any(model.name.startswith(model_name) for model in models)
    
    # ============================================
    # CONVERSATION MANAGEMENT
    # ============================================
    
    def create_conversation(self, model: str = None) -> Conversation:
        """
        Create a new conversation
        
        Args:
            model: AI model to use
        
        Returns:
            New conversation object
        """
        conversation_id = f"conv_{uuid.uuid4().hex[:12]}"
        
        conversation = Conversation(
            conversation_id=conversation_id,
            model=model or settings.DEFAULT_AI_MODEL,
            messages=[],
            created_at=datetime.now(),
            updated_at=datetime.now()
        )
        
        self.conversations[conversation_id] = conversation
        
        logger.info(
            "Conversation created",
            extra={
                "conversation_id": conversation_id,
                "model": conversation.model
            }
        )
        
        return conversation
    
    def get_conversation(self, conversation_id: str) -> Optional[Conversation]:
        """
        Get conversation by ID
        
        Args:
            conversation_id: Conversation identifier
        
        Returns:
            Conversation if found, None otherwise
        """
        return self.conversations.get(conversation_id)
    
    def add_message_to_conversation(
        self,
        conversation_id: str,
        role: str,
        content: str
    ) -> None:
        """
        Add a message to conversation history
        
        Args:
            conversation_id: Conversation identifier
            role: Message role (user/assistant/system)
            content: Message content
        """
        conversation = self.conversations.get(conversation_id)
        if conversation:
            message = ChatMessage(
                role=role,
                content=content,
                timestamp=datetime.now()
            )
            conversation.messages.append(message)
            conversation.updated_at = datetime.now()
    
    # ============================================
    # CHAT (NON-STREAMING)
    # ============================================
    
    async def chat(self, request: ChatRequest) -> ChatResponse:
        """
        Generate chat completion (non-streaming)
        
        Args:
            request: Chat request with message and options
        
        Returns:
            Chat response from AI model
        """
        try:
            # Get or create conversation
            if request.conversation_id:
                conversation = self.get_conversation(request.conversation_id)
                if not conversation:
                    conversation = self.create_conversation(request.model)
                    conversation.conversation_id = request.conversation_id
                    self.conversations[request.conversation_id] = conversation
            else:
                conversation = self.create_conversation(request.model)
            
            # Add user message to history
            self.add_message_to_conversation(
                conversation.conversation_id,
                "user",
                request.message
            )
            
            # Prepare messages for Ollama
            messages = [
                {"role": msg.role, "content": msg.content}
                for msg in conversation.messages
            ]
            
            logger.info(
                "Generating chat completion",
                extra={
                    "conversation_id": conversation.conversation_id,
                    "model": request.model,
                    "message_count": len(messages)
                }
            )
            
            # Call Ollama API
            response = self.client.chat(
                model=request.model,
                messages=messages,
                options={
                    "temperature": request.temperature,
                    "num_predict": request.max_tokens
                } if request.max_tokens else {"temperature": request.temperature},
                stream=False
            )
            
            # Extract response
            assistant_message = response['message']['content']
            
            # Add assistant response to history
            self.add_message_to_conversation(
                conversation.conversation_id,
                "assistant",
                assistant_message
            )
            
            logger.info(
                "Chat completion generated",
                extra={
                    "conversation_id": conversation.conversation_id,
                    "response_length": len(assistant_message)
                }
            )
            
            # Build response
            return ChatResponse(
                message=assistant_message,
                model=request.model,
                conversation_id=conversation.conversation_id,
                created_at=datetime.now(),
                finish_reason="stop"
            )
            
        except Exception as e:
            logger.error(f"Error in chat: {e}", extra={"error": str(e)})
            raise
    
    # ============================================
    # STREAMING CHAT
    # ============================================
    
    async def chat_stream(
        self,
        request: ChatRequest
    ) -> AsyncGenerator[str, None]:
        """
        Generate streaming chat completion
        
        Args:
            request: Chat request with message and options
        
        Yields:
            Chunks of response as they're generated
        """
        try:
            # Get or create conversation
            if request.conversation_id:
                conversation = self.get_conversation(request.conversation_id)
                if not conversation:
                    conversation = self.create_conversation(request.model)
                    conversation.conversation_id = request.conversation_id
                    self.conversations[request.conversation_id] = conversation
            else:
                conversation = self.create_conversation(request.model)
            
            # Add user message
            self.add_message_to_conversation(
                conversation.conversation_id,
                "user",
                request.message
            )
            
            # Prepare messages
            messages = [
                {"role": msg.role, "content": msg.content}
                for msg in conversation.messages
            ]
            
            logger.info(
                "Starting streaming chat",
                extra={
                    "conversation_id": conversation.conversation_id,
                    "model": request.model
                }
            )
            
            # Stream from Ollama
            full_response = ""
            stream = self.client.chat(
                model=request.model,
                messages=messages,
                options={"temperature": request.temperature},
                stream=True
            )
            
            for chunk in stream:
                if 'message' in chunk and 'content' in chunk['message']:
                    content = chunk['message']['content']
                    full_response += content
                    
                    # Yield as Server-Sent Event format
                    yield f"data: {json.dumps({'content': content})}\n\n"
            
            # Add complete response to conversation
            self.add_message_to_conversation(
                conversation.conversation_id,
                "assistant",
                full_response
            )
            
            # Send completion event
            yield f"data: {json.dumps({'done': True, 'conversation_id': conversation.conversation_id})}\n\n"
            
            logger.info(
                "Streaming chat completed",
                extra={
                    "conversation_id": conversation.conversation_id,
                    "response_length": len(full_response)
                }
            )
            
        except Exception as e:
            logger.error(f"Error in streaming chat: {e}")
            error_data = json.dumps({"error": str(e)})
            yield f"data: {error_data}\n\n"
    
    # ============================================
    # SIMPLE GENERATION (NO CONVERSATION)
    # ============================================
    
    async def generate(
        self,
        prompt: str,
        model: str = None,
        options: GenerationOptions = None
    ) -> str:
        """
        Simple text generation (no conversation context)
        
        Args:
            prompt: Input prompt
            model: Model to use
            options: Generation options
        
        Returns:
            Generated text
        """
        try:
            model = model or settings.DEFAULT_AI_MODEL
            
            logger.info(
                "Generating text",
                extra={"model": model, "prompt_length": len(prompt)}
            )
            
            response = self.client.generate(
                model=model,
                prompt=prompt,
                options=options.model_dump() if options else {},
                stream=False
            )
            
            return response['response']
            
        except Exception as e:
            logger.error(f"Error in generate: {e}")
            raise


# ============================================
# GLOBAL SERVICE INSTANCE
# ============================================

ollama_service = OllamaService()

🔍 Service Architecture Explained:

Conversation Management: Stores chat history in memory
- Production: Use Redis or database
- Allows context-aware responses
Streaming Support: Real-time token-by-token responses
- Better UX (user sees response as it’s generated)
- Uses Server-Sent Events (SSE)
Model Abstraction: Easy to swap AI providers
- Today: Ollama
- Tomorrow: Add OpenAI, Anthropic, etc.

Step 8: Create AI Endpoints

Create app/api/v1/endpoints/ai.py:

"""
AI endpoints

Chat, streaming, and model management endpoints
"""

from fastapi import APIRouter, Depends, HTTPException, status
from fastapi.responses import StreamingResponse
from typing import Annotated

from app.models.ai import (
    ChatRequest,
    ChatResponse,
    ModelListResponse,
    Conversation
)
from app.services.ai_service import OllamaService, ollama_service
from app.utils.logger import logger

router = APIRouter(prefix="/ai", tags=["AI"])


def get_ai_service() -> OllamaService:
    """Dependency to get AI service"""
    return ollama_service


# ============================================
# MODEL MANAGEMENT
# ============================================

@router.get(
    "/models",
    response_model=ModelListResponse,
    summary="List available AI models"
)
async def list_models(
    service: Annotated[OllamaService, Depends(get_ai_service)]
):
    """
    List all available Ollama models
    
    Returns:
        List of models installed locally
    """
    try:
        models = await service.list_models()
        return ModelListResponse(
            models=models,
            count=len(models)
        )
    except Exception as e:
        logger.error(f"Error listing models: {e}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Failed to list models: {str(e)}"
        )


# ============================================
# CHAT ENDPOINTS
# ============================================

@router.post(
    "/chat",
    response_model=ChatResponse,
    summary="Chat with AI (non-streaming)"
)
async def chat(
    request: ChatRequest,
    service: Annotated[OllamaService, Depends(get_ai_service)]
):
    """
    Send a message and get AI response
    
    **Features:**
    - Maintains conversation context
    - Returns complete response at once
    - Supports multiple models
    - Configurable temperature
    
    **Example Request:**
```json
    {
      "message": "Explain FastAPI in simple terms",
      "model": "llama2",
      "temperature": 0.7
    }
```
    """
    try:
        # Check if model exists
        if not await service.check_model_exists(request.model):
            raise HTTPException(
                status_code=status.HTTP_404_NOT_FOUND,
                detail=f"Model '{request.model}' not found. Pull it first with: ollama pull {request.model}"
            )
        
        response = await service.chat(request)
        return response
        
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Error in chat endpoint: {e}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Chat failed: {str(e)}"
        )


@router.post(
    "/chat/stream",
    summary="Chat with AI (streaming)",
    description="Stream AI responses token-by-token for real-time chat experience"
)
async def chat_stream(
    request: ChatRequest,
    service: Annotated[OllamaService, Depends(get_ai_service)]
):
    """
    Stream chat response in real-time
    
    **How to use:**
    1. Send POST request with message
    2. Response streams as Server-Sent Events
    3. Each chunk contains a piece of the response
    4. Final chunk includes conversation_id
    
    **Response format:**

data: {"content": "Fast"}
data: {"content": "API"}
data: {"content": " is"}
data: {"done": true, "conversation_id": "conv_123"}

    """
    try:
        # Check if model exists
        if not await service.check_model_exists(request.model):
            raise HTTPException(
                status_code=status.HTTP_404_NOT_FOUND,
                detail=f"Model '{request.model}' not found. Pull it first with: ollama pull {request.model}"
            )
        
        return StreamingResponse(
            service.chat_stream(request),
            media_type="text/event-stream",
            headers={
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "X-Accel-Buffering": "no"  # Disable nginx buffering
            }
        )
        
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Error in streaming chat: {e}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Streaming chat failed: {str(e)}"
        )


# ============================================
# CONVERSATION MANAGEMENT
# ============================================

@router.get(
    "/conversations/{conversation_id}",
    response_model=Conversation,
    summary="Get conversation history"
)
async def get_conversation(
    conversation_id: str,
    service: Annotated[OllamaService, Depends(get_ai_service)]
):
    """
    Retrieve conversation history by ID
    
    Returns:
        Conversation with all messages
    """
    conversation = service.get_conversation(conversation_id)
    
    if not conversation:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=f"Conversation '{conversation_id}' not found"
        )
    
    return conversation


@router.post(
    "/conversations",
    response_model=Conversation,
    status_code=status.HTTP_201_CREATED,
    summary="Create new conversation"
)
async def create_conversation(
    model: str = "llama2",
    service: Annotated[OllamaService, Depends(get_ai_service)]
):
    """
    Create a new conversation
    
    Args:
        model: AI model to use for this conversation
    
    Returns:
        New conversation object with ID
    """
    conversation = service.create_conversation(model=model)
    return conversation


# ============================================
# HEALTH CHECK
# ============================================

@router.get(
    "/health",
    summary="Check AI service health"
)
async def ai_health_check(
    service: Annotated[OllamaService, Depends(get_ai_service)]
):
    """
    Check if Ollama service is running and accessible
    
    Returns:
        Status and available models count
    """
    try:
        models = await service.list_models()
        return {
            "status": "healthy",
            "ollama_url": service.base_url,
            "models_available": len(models),
            "models": [model.name for model in models]
        }
    except Exception as e:
        return {
            "status": "unhealthy",
            "error": str(e),
            "message": "Make sure Ollama is running: ollama serve"
        }

Step 9: Update API Router

Update app/api/v1/api.py:

"""
API v1 router aggregator
"""

from fastapi import APIRouter
from app.api.v1.endpoints import users, health, users_advanced, dependencies_demo, ai

# Create main v1 router
api_router = APIRouter()

# Include all endpoint routers
api_router.include_router(health.router)
api_router.include_router(users.router)
api_router.include_router(users_advanced.router)
api_router.include_router(dependencies_demo.router)
api_router.include_router(ai.router)  # New AI endpoints!

Step 10: Update Configuration

Make sure your .env file has Ollama settings:

# .env
# ... existing settings ...

# AI Settings
OLLAMA_BASE_URL=http://localhost:11434
DEFAULT_AI_MODEL=llama2

These settings are already in our app/core/config.py from earlier!

Step 11: Create HTML Chat Interface

Create app/static/chat.html:

First, create the static directory:

mkdir -p app/static

Now create the HTML file:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>AI Chat - FastAPI Backend</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            height: 100vh;
            display: flex;
            justify-content: center;
            align-items: center;
            padding: 20px;
        }

        .chat-container {
            width: 100%;
            max-width: 800px;
            height: 90vh;
            background: white;
            border-radius: 20px;
            box-shadow: 0 20px 60px rgba(0,0,0,0.3);
            display: flex;
            flex-direction: column;
            overflow: hidden;
        }

        .chat-header {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            padding: 20px;
            display: flex;
            justify-content: space-between;
            align-items: center;
        }

        .chat-header h1 {
            font-size: 24px;
            font-weight: 600;
        }

        .model-selector {
            background: rgba(255,255,255,0.2);
            border: none;
            color: white;
            padding: 8px 12px;
            border-radius: 8px;
            font-size: 14px;
            cursor: pointer;
        }

        .model-selector option {
            background: #764ba2;
        }

        .chat-messages {
            flex: 1;
            overflow-y: auto;
            padding: 20px;
            background: #f5f5f5;
        }

        .message {
            margin-bottom: 20px;
            display: flex;
            animation: slideIn 0.3s ease;
        }

        @keyframes slideIn {
            from {
                opacity: 0;
                transform: translateY(10px);
            }
            to {
                opacity: 1;
                transform: translateY(0);
            }
        }

        .message.user {
            justify-content: flex-end;
        }

        .message-content {
            max-width: 70%;
            padding: 12px 16px;
            border-radius: 12px;
            word-wrap: break-word;
            white-space: pre-wrap;
        }

        .message.user .message-content {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }

        .message.assistant .message-content {
            background: white;
            color: #333;
            box-shadow: 0 2px 8px rgba(0,0,0,0.1);
        }

        .message-time {
            font-size: 11px;
            color: #999;
            margin-top: 5px;
        }

        .typing-indicator {
            display: none;
            padding: 12px 16px;
            background: white;
            border-radius: 12px;
            box-shadow: 0 2px 8px rgba(0,0,0,0.1);
        }

        .typing-indicator.active {
            display: inline-block;
        }

        .typing-indicator span {
            height: 8px;
            width: 8px;
            background: #667eea;
            border-radius: 50%;
            display: inline-block;
            margin: 0 2px;
            animation: bounce 1.4s infinite;
        }

        .typing-indicator span:nth-child(2) {
            animation-delay: 0.2s;
        }

        .typing-indicator span:nth-child(3) {
            animation-delay: 0.4s;
        }

        @keyframes bounce {
            0%, 60%, 100% {
                transform: translateY(0);
            }
            30% {
                transform: translateY(-10px);
            }
        }

        .chat-input-container {
            padding: 20px;
            background: white;
            border-top: 1px solid #e0e0e0;
        }

        .chat-input-wrapper {
            display: flex;
            gap: 10px;
        }

        .chat-input {
            flex: 1;
            padding: 12px 16px;
            border: 2px solid #e0e0e0;
            border-radius: 12px;
            font-size: 14px;
            resize: none;
            font-family: inherit;
            max-height: 120px;
        }

        .chat-input:focus {
            outline: none;
            border-color: #667eea;
        }

        .send-button {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            border: none;
            padding: 12px 24px;
            border-radius: 12px;
            font-size: 14px;
            font-weight: 600;
            cursor: pointer;
            transition: transform 0.2s;
        }

        .send-button:hover {
            transform: scale(1.05);
        }

        .send-button:disabled {
            opacity: 0.5;
            cursor: not-allowed;
            transform: scale(1);
        }

        .options {
            display: flex;
            gap: 15px;
            margin-top: 10px;
            font-size: 13px;
        }

        .option-group {
            display: flex;
            align-items: center;
            gap: 8px;
        }

        .option-group label {
            color: #666;
        }

        .option-group input[type="range"] {
            width: 100px;
        }

        .option-group input[type="checkbox"] {
            cursor: pointer;
        }

        .error-message {
            background: #fee;
            color: #c33;
            padding: 12px;
            border-radius: 8px;
            margin: 10px 20px;
            display: none;
        }

        .error-message.active {
            display: block;
        }
    </style>
</head>
<body>
    <div class="chat-container">
        <div class="chat-header">
            <h1>🤖 AI Chat</h1>
            <select class="model-selector" id="modelSelector">
                <option value="llama2">Llama 2</option>
                <option value="mistral">Mistral</option>
                <option value="phi">Phi</option>
                <option value="codellama">Code Llama</option>
            </select>
        </div>

        <div class="error-message" id="errorMessage"></div>

        <div class="chat-messages" id="chatMessages">
            <div class="message assistant">
                <div class="message-content">
                    👋 Hello! I'm your AI assistant. How can I help you today?
                </div>
            </div>
        </div>

        <div class="chat-input-container">
            <div class="options">
                <div class="option-group">
                    <label for="temperature">Temperature:</label>
                    <input type="range" id="temperature" min="0" max="2" step="0.1" value="0.7">
                    <span id="temperatureValue">0.7</span>
                </div>
                <div class="option-group">
                    <input type="checkbox" id="streamToggle" checked>
                    <label for="streamToggle">Stream responses</label>
                </div>
            </div>
            <div class="chat-input-wrapper">
                <textarea 
                    id="chatInput" 
                    class="chat-input" 
                    placeholder="Type your message..."
                    rows="1"
                ></textarea>
                <button id="sendButton" class="send-button">Send</button>
            </div>
        </div>
    </div>

    <script>
        const API_BASE_URL = 'http://localhost:8000/api/v1';
        let conversationId = null;

        // Elements
        const chatMessages = document.getElementById('chatMessages');
        const chatInput = document.getElementById('chatInput');
        const sendButton = document.getElementById('sendButton');
        const modelSelector = document.getElementById('modelSelector');
        const temperatureSlider = document.getElementById('temperature');
        const temperatureValue = document.getElementById('temperatureValue');
        const streamToggle = document.getElementById('streamToggle');
        const errorMessage = document.getElementById('errorMessage');

        // Auto-resize textarea
        chatInput.addEventListener('input', function() {
            this.style.height = 'auto';
            this.style.height = (this.scrollHeight) + 'px';
        });

        // Temperature slider
        temperatureSlider.addEventListener('input', function() {
            temperatureValue.textContent = this.value;
        });

        // Send message on Enter (Shift+Enter for new line)
        chatInput.addEventListener('keydown', function(e) {
            if (e.key === 'Enter' && !e.shiftKey) {
                e.preventDefault();
                sendMessage();
            }
        });

        // Send button click
        sendButton.addEventListener('click', sendMessage);

        // Show error
        function showError(message) {
            errorMessage.textContent = message;
            errorMessage.classList.add('active');
            setTimeout(() => {
                errorMessage.classList.remove('active');
            }, 5000);
        }

        // Add message to chat
        function addMessage(role, content) {
            const messageDiv = document.createElement('div');
            messageDiv.className = `message ${role}`;
            
            const contentDiv = document.createElement('div');
            contentDiv.className = 'message-content';
            contentDiv.textContent = content;
            
            messageDiv.appendChild(contentDiv);
            chatMessages.appendChild(messageDiv);
            chatMessages.scrollTop = chatMessages.scrollHeight;
            
            return contentDiv;
        }

        // Show typing indicator
        function showTyping() {
            const typingDiv = document.createElement('div');
            typingDiv.className = 'message assistant';
            typingDiv.id = 'typingIndicator';
            typingDiv.innerHTML = `
                <div class="typing-indicator active">
                    <span></span>
                    <span></span>
                    <span></span>
                </div>
            `;
            chatMessages.appendChild(typingDiv);
            chatMessages.scrollTop = chatMessages.scrollHeight;
        }

        function hideTyping() {
            const typing = document.getElementById('typingIndicator');
            if (typing) typing.remove();
        }

        // Send message (non-streaming)
        async function sendMessageNonStream(message) {
            try {
                const response = await fetch(`${API_BASE_URL}/ai/chat`, {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json',
                    },
                    body: JSON.stringify({
                        message: message,
                        model: modelSelector.value,
                        conversation_id: conversationId,
                        temperature: parseFloat(temperatureSlider.value),
                        stream: false
                    })
                });

                if (!response.ok) {
                    const error = await response.json();
                    throw new Error(error.detail || 'Request failed');
                }

                const data = await response.json();
                conversationId = data.conversation_id;

                hideTyping();
                addMessage('assistant', data.message);
                
            } catch (error) {
                hideTyping();
                showError(`Error: ${error.message}`);
                console.error('Error:', error);
            }
        }

        // Send message (streaming)
        async function sendMessageStream(message) {
            try {
                const response = await fetch(`${API_BASE_URL}/ai/chat/stream`, {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json',
                    },
                    body: JSON.stringify({
                        message: message,
                        model: modelSelector.value,
                        conversation_id: conversationId,
                        temperature: parseFloat(temperatureSlider.value),
                        stream: true
                    })
                });

                if (!response.ok) {
                    const error = await response.json();
                    throw new Error(error.detail || 'Request failed');
                }

                hideTyping();

                // Create message element for streaming
                const messageDiv = document.createElement('div');
                messageDiv.className = 'message assistant';
                const contentDiv = document.createElement('div');
                contentDiv.className = 'message-content';
                messageDiv.appendChild(contentDiv);
                chatMessages.appendChild(messageDiv);

                // Read stream
                const reader = response.body.getReader();
                const decoder = new TextDecoder();
                let fullResponse = '';

                while (true) {
                    const { done, value } = await reader.read();
                    if (done) break;

                    const chunk = decoder.decode(value);
                    const lines = chunk.split('\n');

                    for (const line of lines) {
                        if (line.startsWith('data: ')) {
                            const data = JSON.parse(line.slice(6));
                            
                            if (data.content) {
                                fullResponse += data.content;
                                contentDiv.textContent = fullResponse;
                                chatMessages.scrollTop = chatMessages.scrollHeight;
                            }
                            
                            if (data.done) {
                                conversationId = data.conversation_id;
                            }
                            
                            if (data.error) {
                                showError(data.error);
                            }
                        }
                    }
                }
                
            } catch (error) {
                hideTyping();
                showError(`Error: ${error.message}`);
                console.error('Error:', error);
            }
        }

        // Main send function
        async function sendMessage() {
            const message = chatInput.value.trim();
            if (!message) return;

            // Add user message
            addMessage('user', message);
            chatInput.value = '';
            chatInput.style.height = 'auto';

            // Disable input
            chatInput.disabled = true;
            sendButton.disabled = true;

            // Show typing
            showTyping();

            // Send based on stream toggle
            if (streamToggle.checked) {
                await sendMessageStream(message);
            } else {
                await sendMessageNonStream(message);
            }

            // Enable input
            chatInput.disabled = false;
            sendButton.disabled = false;
            chatInput.focus();
        }

        // Load available models on start
        async function loadModels() {
            try {
                const response = await fetch(`${API_BASE_URL}/ai/models`);
                const data = await response.json();
                
                modelSelector.innerHTML = '';
                data.models.forEach(model => {
                    const option = document.createElement('option');
                    option.value = model.name.split(':')[0];
                    option.textContent = model.name;
                    modelSelector.appendChild(option);
                });
            } catch (error) {
                console.error('Error loading models:', error);
                showError('Failed to load models. Make sure Ollama is running.');
            }
        }

        // Initialize
        loadModels();
        chatInput.focus();
    </script>
</body>
</html>

Step 12: Serve Static Files

Update app/main.py to serve static files:

"""
FastAPI AI Backend - Main Application

Updated with static file serving for chat interface
"""

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.exceptions import RequestValidationError
from fastapi.staticfiles import StaticFiles  # New import
from pydantic import ValidationError

from app.core.config import settings
from app.api.v1.api import api_router
from app.core.exceptions import AppException
from app.core.error_handlers import (
    app_exception_handler,
    validation_exception_handler,
    generic_exception_handler
)
from app.middleware.logging_middleware import LoggingMiddleware
from app.middleware.performance_middleware import PerformanceMiddleware
from app.utils.logger import logger


def create_application() -> FastAPI:
    """Application factory pattern"""
    
    app = FastAPI(
        title=settings.APP_NAME,
        version=settings.APP_VERSION,
        description="""
        A production-ready FastAPI backend for AI applications.
        
        ## Features
        * **AI Integration** with Ollama (local LLMs)
        * Streaming chat responses
        * Multiple AI model support
        * Conversation history management
        * Advanced dependency injection
        * Custom middleware stack
        * Structured logging
        * Background tasks
        * Comprehensive error handling
        
        ## Try the Chat Interface
        Visit: http://localhost:8000/chat.html
        
        ## API Documentation
        * Swagger UI: /docs
        * ReDoc: /redoc
        """,
        debug=settings.DEBUG,
        docs_url="/docs",
        redoc_url="/redoc",
        openapi_url="/openapi.json"
    )
    
    # ============================================
    # STATIC FILES
    # ============================================
    
    # Mount static files directory
    app.mount("/", StaticFiles(directory="app/static", html=True), name="static")
    
    # ============================================
    # MIDDLEWARE
    # ============================================
    
    app.add_middleware(PerformanceMiddleware, slow_request_threshold=1.0)
    app.add_middleware(LoggingMiddleware)
    app.add_middleware(
        CORSMiddleware,
        allow_origins=settings.allowed_origins_list,
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
        expose_headers=["X-Request-ID", "X-Process-Time"]
    )
    
    # ============================================
    # EXCEPTION HANDLERS
    # ============================================
    
    app.add_exception_handler(AppException, app_exception_handler)
    app.add_exception_handler(RequestValidationError, validation_exception_handler)
    app.add_exception_handler(ValidationError, validation_exception_handler)
    app.add_exception_handler(Exception, generic_exception_handler)
    
    # ============================================
    # ROUTERS
    # ============================================
    
    app.include_router(api_router, prefix=settings.API_V1_PREFIX)
    
    return app


app = create_application()


@app.on_event("startup")
async def startup_event():
    """Startup tasks"""
    logger.info(
        "Application startup",
        extra={
            "app_name": settings.APP_NAME,
            "version": settings.APP_VERSION,
            "environment": settings.ENVIRONMENT
        }
    )
    
    print(f"🚀 Starting {settings.APP_NAME} v{settings.APP_VERSION}")
    print(f"📝 Environment: {settings.ENVIRONMENT}")
    print(f"🤖 Ollama URL: {settings.OLLAMA_BASE_URL}")
    print(f"📚 API Docs: http://{settings.HOST}:{settings.PORT}/docs")
    print(f"💬 Chat Interface: http://{settings.HOST}:{settings.PORT}/chat.html")


@app.on_event("shutdown")
async def shutdown_event():
    """Shutdown tasks"""
    logger.info("Application shutdown", extra={"app_name": settings.APP_NAME})
    print(f"👋 Shutting down {settings.APP_NAME}")

🧪 Testing Your AI Integration

Complete Testing Workflow:

Step 1: Install and Start Ollama

# Install Ollama (if not done)
# macOS: brew install ollama
# Linux: curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server (in terminal 1)
ollama serve

# Pull a model (in terminal 2)
ollama pull llama2

# Verify
ollama list

Step 2: Start FastAPI Server

# In your project directory (terminal 3)
source venv/bin/activate
python main.py

You should see:

🚀 Starting FastAPI AI Backend v0.3.0
📝 Environment: development
🤖 Ollama URL: http://localhost:11434
📚 API Docs: http://localhost:8000/docs
💬 Chat Interface: http://localhost:8000/chat.html

Step 3: Run Tests

# In another terminal (terminal 4)
source venv/bin/activate
python test_ai_endpoints.py

Step 4: Try the Web Interface

Open your browser:

http://localhost:8000/chat.html

You’ll see:

Beautiful gradient chat interface
Model selector dropdown
Temperature slider
Streaming toggle
Real-time chat with AI!

📊 Project Structure Update

fastapi-ai-backend/
├── app/
│   ├── api/v1/endpoints/
│   │   └── ai.py                    ✅ NEW! AI endpoints
│   ├── models/
│   │   └── ai.py                    ✅ NEW! AI models
│   ├── services/
│   │   └── ai_service.py            ✅ NEW! Ollama service
│   ├── static/
│   │   └── chat.html                ✅ NEW! Chat interface
│   └── main.py                      ✅ Updated (static files)
│
├── test_ai_endpoints.py             ✅ NEW! AI tests
└── .env                             ✅ Updated (Ollama settings)

🎉 What You’ve Accomplished!

Technical Achievements:

✅ Ollama Integration

Local LLM serving
Multi-model support
Streaming responses
Conversation management

✅ Production-Ready Architecture

Service abstraction layer
Provider-agnostic design
Error handling
Structured logging

✅ Complete AI API

Chat endpoints (streaming & non-streaming)
Model management
Conversation history
Health monitoring

✅ User Interface

Real-time chat interface
Model switching
Temperature control
Streaming visualization

Skills Gained:

✅ AI model deployment and management ✅ Streaming API implementation ✅ Real-time web interfaces ✅ Conversation state management ✅ Service architecture patterns

What are the system requirements for running Ollama locally?

Ollama requires a minimum of 8GB of RAM (16GB recommended for 7B models) and roughly 4GB of disk space per model. While it can run on a CPU, an NVIDIA or AMD GPU significantly accelerates response times. It is compatible with macOS, Linux, and Windows.

Why should I use Ollama instead of OpenAI’s API?

Ollama offers three primary advantages: Privacy, as your data never leaves your machine; Cost, because there are no per-token fees; and Offline Capability, allowing your AI-powered application to function without an internet connection. It is the ideal choice for developers building secure or internal-only AI tools.

How do I implement “Streaming” in a FastAPI AI backend?

To implement streaming, you must use an AsyncGenerator within your AI service to yield data chunks as they are generated by the model. On the FastAPI endpoint, return a StreamingResponse object with the media_type set to text/event-stream. This allows the frontend to display text as it’s being written, rather than waiting for the entire paragraph to complete.

Can I use multiple different AI models in the same API?

Yes. Since Ollama supports various models (e.g., llama3, mistral, codellama), you can design your Pydantic request models to include a model field. Your FastAPI service can then dynamically pass this string to the Ollama client, allowing users to switch between a general-purpose chat model and a specialized coding model on the fly.

How do I manage conversation history (memory)?

To provide context-aware chat, you must store previous messages in a list of dictionaries with role (user/assistant) and content keys. In production, these should be saved in a database (like PostgreSQL or Redis) and linked to a conversation_id. Every time a new message is sent, the entire history is passed to the AI model so it “remembers” the prior discussion.

🚀 What’s Next?

You now have a complete, working AI backend! Here’s what you can build on this foundation:

Immediate Enhancements:

Add file upload to chat (PDFs, images)
Implement code syntax highlighting
Add conversation export (JSON, markdown)
Build conversation search

Coming in Future Posts:

Blog Post 7: React frontend with TypeScript
Blog Post 8: Database integration (SQLAlchemy)
Blog Post 9: Authentication & user management
Blog Post 10: Deployment (Docker, production setup)

💡 Pro Tips

Performance Optimization:

# Cache model responses for identical prompts
from functools import lru_cache

@lru_cache(maxsize=100)
def cached_generate(prompt: str, model: str):
    return ollama_service.generate(prompt, model)

Better Error Messages:

try:
    response = await ollama_service.chat(request)
except Exception as e:
    if "connection refused" in str(e).lower():
        raise HTTPException(
            status_code=503,
            detail="Ollama service not running. Start with: ollama serve"
        )
    raise

Request Timeout:

import asyncio

async def chat_with_timeout(request, timeout=30):
    try:
        return await asyncio.wait_for(
            ollama_service.chat(request),
            timeout=timeout
        )
    except asyncio.TimeoutError:
        raise HTTPException(408, "Request timeout")

🎯 Summary

You’ve successfully integrated AI into your FastAPI backend! This is a major milestone. You can now:

🤖 Run AI models locally for free
💬 Build chat applications
🌊 Stream responses in real-time
📝 Manage conversations
🔄 Switch between models
🎨 Create beautiful chat interfaces

Your FastAPI project is now a production-ready AI-powered backend that can scale to support real-world applications!

Congratulations! 🎉

Ep.06 AI Integration with Ollama – Building Your First AI-Powered API