Internal Setup Guide: Multimodal Python Stack

This document provides detailed instructions for setting up, testing, and running the multimodal agents framework. It covers everything from API key configuration to hardware requirements and troubleshooting.

Prerequisites
Environment Setup
API Key Configuration
Hardware Requirements
Running the Examples
Testing Individual Components
Common Issues & Troubleshooting
Development Workflow
Cost Management
Production Deployment

Prerequisites

Required Software

Software	Version	Purpose	Installation
Python	3.11+	Runtime	`brew install python@3.11` or pyenv
pip/uv	Latest	Package manager	Comes with Python / `pip install uv`
ffmpeg	6.0+	Audio/video processing	`brew install ffmpeg`
PortAudio	Latest	Microphone access	`brew install portaudio`

Verify Installation

# Check Python version
python3 --version  # Should be 3.11+

# Check ffmpeg
ffmpeg -version

# Check PortAudio (for microphone)
brew list portaudio  # On macOS

macOS-Specific Setup

# Install Homebrew if not present
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install system dependencies
brew install python@3.11 ffmpeg portaudio

# Grant camera/microphone permissions
# System Preferences > Privacy & Security > Camera/Microphone
# Add Terminal.app or your IDE

Linux-Specific Setup

# Ubuntu/Debian
sudo apt update
sudo apt install python3.11 python3.11-venv python3-pip ffmpeg portaudio19-dev libportaudio2

# Fedora
sudo dnf install python3.11 ffmpeg portaudio-devel

Environment Setup

Option 1: Using uv (Recommended)

# Install uv
pip install uv

# Create virtual environment
cd multimodal-python-stack
uv venv

# Activate
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate     # Windows

# Install dependencies
uv pip install -r requirements.txt

Option 2: Using pip

# Create virtual environment
python3.11 -m venv .venv

# Activate
source .venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Option 3: Using Poetry

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Activate shell
poetry shell

Verify Installation

# Test imports
python -c "
from src.core.agent import AgentLoop
from src.models import create_model
from src.inputs import WebcamInput
print('All imports successful!')
"

API Key Configuration

Step 1: Copy Environment Template

cp .env.example .env

Step 2: Configure API Keys

Edit .env with your actual API keys:

# ===========================================
# MODEL PROVIDERS
# ===========================================

# OpenAI - Required for GPT-4o and Whisper transcription
# Get key: https://platform.openai.com/api-keys
OPENAI_API_KEY=sk-proj-...

# Anthropic - Required for Claude models
# Get key: https://console.anthropic.com/settings/keys
ANTHROPIC_API_KEY=sk-ant-api03-...

# Google - Required for Gemini models
# Get key: https://aistudio.google.com/app/apikey
GOOGLE_API_KEY=AIza...

# Groq - Required for fast Llama inference
# Get key: https://console.groq.com/keys
GROQ_API_KEY=gsk_...

# Fireworks - Required for FireLLaVA
# Get key: https://fireworks.ai/api-keys
FIREWORKS_API_KEY=fw_...

# Together - Required for Together models
# Get key: https://api.together.xyz/settings/api-keys
TOGETHER_API_KEY=...

# ===========================================
# TOOL INTEGRATIONS
# ===========================================

# Slack Webhooks - For alert examples
# Create: https://api.slack.com/messaging/webhooks
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...

# Notion - For run-sheet logging examples
# Create integration: https://www.notion.so/my-integrations
# Then share a database with your integration
NOTION_API_KEY=secret_...
NOTION_DATABASE_ID=...

Step 3: Verify API Keys

# Test OpenAI
python -c "
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': 'Say hello'}],
    max_tokens=10
)
print('OpenAI:', response.choices[0].message.content)
"

# Test Anthropic
python -c "
import os
from dotenv import load_dotenv
from anthropic import Anthropic
load_dotenv()
client = Anthropic()
response = client.messages.create(
    model='claude-3-5-haiku-latest',
    max_tokens=10,
    messages=[{'role': 'user', 'content': 'Say hello'}]
)
print('Anthropic:', response.content[0].text)
"

# Test Google
python -c "
import os
from dotenv import load_dotenv
import google.generativeai as genai
load_dotenv()
genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content('Say hello')
print('Google:', response.text)
"

Getting API Keys - Detailed Instructions

OpenAI

Go to https://platform.openai.com/signup
Create account or sign in
Navigate to API Keys: https://platform.openai.com/api-keys
Click "Create new secret key"
Copy the key (starts with sk-proj-)
Add billing: https://platform.openai.com/account/billing

Pricing (as of 2025):

GPT-4o: $5.00/1M input, $15.00/1M output
GPT-4o-mini: $0.15/1M input, $0.60/1M output
Whisper: $0.006/minute

Anthropic

Go to https://console.anthropic.com/
Create account or sign in
Navigate to API Keys: https://console.anthropic.com/settings/keys
Click "Create Key"
Copy the key (starts with sk-ant-)

Pricing:

Claude 3.5 Sonnet: $3.00/1M input, $15.00/1M output
Claude 3.5 Haiku: $0.80/1M input, $4.00/1M output

Google (Gemini)

Go to https://aistudio.google.com/
Sign in with Google account
Click "Get API key" in the top right
Create API key for a new or existing project
Copy the key (starts with AIza)

Pricing:

Gemini 1.5 Flash: $0.075/1M input, $0.30/1M output (cheapest!)
Gemini 1.5 Pro: $1.25/1M input, $5.00/1M output

Groq

Go to https://console.groq.com/
Create account or sign in
Navigate to API Keys: https://console.groq.com/keys
Click "Create API Key"
Copy the key (starts with gsk_)

Pricing:

Llama 3.2 90B Vision: $0.11/1M tokens
Llama 3.2 11B Vision: $0.05/1M tokens

Slack Webhook

Go to https://api.slack.com/apps
Click "Create New App" > "From scratch"
Name it (e.g., "Multimodal Agent") and select workspace
Go to "Incoming Webhooks" in sidebar
Toggle "Activate Incoming Webhooks" ON
Click "Add New Webhook to Workspace"
Select the channel for alerts
Copy the webhook URL

Notion Integration

Go to https://www.notion.so/my-integrations
Click "New integration"
Name it (e.g., "Multimodal Agent")
Select workspace
Copy the "Internal Integration Token" (starts with secret_)

Database Setup:

Create a new Notion database with these properties:
- Title (title type)
- Status (select: pending, in_progress, completed, blocked)
- Notes (rich text)
- Timestamp (date)
- Tags (multi-select, optional)
Click "..." menu > "Add connections" > Select your integration
Copy the database ID from the URL:
- URL: https://www.notion.so/myworkspace/abc123def456...
- Database ID: abc123def456 (32-character hex string)

Hardware Requirements

Minimum Requirements

Component	Requirement	Notes
CPU	4 cores	For video encoding/decoding
RAM	8 GB	16 GB recommended
Storage	1 GB free	For dependencies and temp files
Camera	USB or built-in	For webcam examples
Microphone	Any	For audio examples

Camera Setup

Built-in Webcam (Laptop)

Works out of the box. Device ID is typically 0.

from src.inputs import WebcamInput
webcam = WebcamInput(device_id=0)

External USB Camera

# List available cameras (macOS)
system_profiler SPCameraDataType

# List available cameras (Linux)
v4l2-ctl --list-devices

# Usually device_id=1 for external camera
webcam = WebcamInput(device_id=1)

IP Camera (RTSP)

from src.inputs import RTSPInput

# Common RTSP URL formats:
# Hikvision: rtsp://admin:password@192.168.1.100:554/Streaming/Channels/101
# Dahua: rtsp://admin:password@192.168.1.100:554/cam/realmonitor?channel=1&subtype=0
# Generic: rtsp://user:pass@ip:port/path

camera = RTSPInput(
    url="rtsp://admin:password@192.168.1.100:554/stream",
    fps=1.0,
    auto_reconnect=True
)

Testing Camera

# Quick camera test
import cv2

cap = cv2.VideoCapture(0)  # Try 0, 1, 2...
if cap.isOpened():
    ret, frame = cap.read()
    print(f"Camera working: {ret}, Frame shape: {frame.shape if ret else 'N/A'}")
    cap.release()
else:
    print("Camera not found")

Microphone Setup

Grant Permissions (macOS)

System Preferences > Privacy & Security > Microphone
Enable for Terminal.app or your IDE

List Audio Devices

import sounddevice as sd
print(sd.query_devices())

Output:

   0 MacBook Pro Microphone, Core Audio (1 in, 0 out)
>  1 MacBook Pro Speakers, Core Audio (0 in, 2 out)
   2 External USB Mic, Core Audio (1 in, 0 out)

Use the index as device_id:

from src.inputs import MicrophoneInput
mic = MicrophoneInput(device_id=0)  # Built-in mic
mic = MicrophoneInput(device_id=2)  # External USB mic

Testing Microphone

import sounddevice as sd
import numpy as np

duration = 3  # seconds
sample_rate = 16000

print("Recording...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()
print(f"Recorded {len(audio)} samples, max amplitude: {np.max(np.abs(audio)):.4f}")

Running the Examples

Example 1: Basic Webcam

Requirements: OpenAI API key, webcam

python examples/01_basic_webcam.py

What it does:

Captures frames from webcam every 3 seconds
Sends to GPT-4o-mini for description
Prints observations to console

Expected output:

Starting basic webcam agent...
Press Ctrl+C to stop

[Agent] I see a person sitting at a desk with a laptop, appears to be in a home office setting.

[Agent] The scene is similar, the person is now typing on the keyboard.

Example 2: Security Monitor

Requirements: Anthropic or OpenAI API key, webcam or RTSP camera, Slack webhook (optional)

# With webcam (demo mode)
python examples/02_security_monitor.py

# With RTSP camera
RTSP_URL="rtsp://user:pass@192.168.1.100:554/stream" python examples/02_security_monitor.py

What it does:

Monitors camera feed every 5 seconds
Detects people, unusual activity, hazards
Sends Slack alerts when something is detected

Expected output:

==================================================
Security Monitor
==================================================
  ✓ Anthropic (claude-3-5-haiku)
Slack alerts enabled
Using webcam as demo...

Monitoring started. Press Ctrl+C to stop.
--------------------------------------------------
[Observation] The frame shows an empty room with a desk and chair. No people or unusual activity detected.
--------------------------------------------------
[Observation] A person has entered the frame from the left side. They appear to be walking toward the desk.
[Alert Triggered] send_slack_alert: Person detected entering monitored area
[Alert Sent] ✓
--------------------------------------------------

Example 3: Quality Inspector

Requirements: Google or OpenAI API key, webcam, Notion API (optional)

python examples/03_quality_inspector.py

What it does:

Simulates manufacturing line inspection
Analyzes each frame for defects
Logs all inspections to Notion
Triggers PLC reject on failures (simulated)

Expected output:

============================================================
Manufacturing Quality Inspector
============================================================
Using Gemini 1.5 Flash (cost-optimized)
Notion logging disabled (set NOTION_API_KEY and NOTION_DATABASE_ID)
PLC control enabled (simulation mode)

------------------------------------------------------------
Quality inspection started. Press Ctrl+C to stop.
------------------------------------------------------------

[Inspector] Analyzing product in frame. The item appears to be...
[✓ PASSED] Inspection #2024-01-15T10:23:45
[Stats] Total: 1 | Passed: 1 | Failed: 0 | Pass Rate: 100.0% | Rate: 12.0/min

[Inspector] Analyzing product in frame. I notice a visible scratch...
[✗ FAILED] Inspection #2024-01-15T10:23:47
   Reason: Surface scratch detected on upper left quadrant, approximately 2cm long
[PLC] Reject triggered - Register 100

Example 4: Meeting Assistant

Requirements: OpenAI API key (for GPT-4o and Whisper), webcam, microphone

python examples/04_meeting_assistant.py

What it does:

Records audio continuously
Captures video periodically (every 30s)
Transcribes speech with Whisper
Extracts action items and decisions
Logs to Notion, sends summaries to Slack

Expected output:

============================================================
Meeting Assistant
============================================================
Using GPT-4o for meeting analysis
Notion action items enabled
Slack summaries enabled

Using microphone + webcam

------------------------------------------------------------
Meeting recording started. Press Ctrl+C to end meeting.
------------------------------------------------------------

[Assistant] The meeting has begun. I can see 3 people in the room...

📋 ACTION ITEM: [ACTION] John to prepare Q4 budget proposal
[Duration: 5 min | Action Items: 1 | Decisions: 0]

✅ DECISION: [DECISION] Team agreed to postpone launch to March
[Duration: 8 min | Action Items: 1 | Decisions: 1]

📤 Slack Summary Sent

Example 5: Benchmark Providers

Requirements: API keys for providers you want to test

python examples/05_benchmark_providers.py

What it does:

Tests all available providers
Runs standardized scenarios
Generates latency and cost tables
Saves results to JSON

Expected output:

======================================================================
Multimodal Model Benchmark
======================================================================

Checking available providers...
  ✓ OpenAI (gpt-4o, gpt-4o-mini)
  ✓ Anthropic (claude-3-5-haiku, claude-3-5-sonnet)
  ✓ Google (gemini-1.5-flash)
  ✗ Groq (set GROQ_API_KEY)
  ✗ Fireworks (set FIREWORKS_API_KEY)
  ✗ Together (set TOGETHER_API_KEY)

Running benchmarks on 5 models...
Scenarios: single_frame, multi_frame, detailed_analysis, tool_calling

----------------------------------------------------------------------
Starting benchmarks (this may take a few minutes)...
----------------------------------------------------------------------

Benchmarking openai/gpt-4o-mini - single_frame
Benchmarking openai/gpt-4o-mini - multi_frame
...

======================================================================
Results
======================================================================

LATENCY (p50, milliseconds)
----------------------------------------------------------------------
| Provider | Model | single_frame | multi_frame | tool_calling |
|---|---|---:|---:|---:|
| openai | gpt-4o-mini | 423ms | 612ms | 489ms |
| openai | gpt-4o | 834ms | 1156ms | 923ms |
| anthropic | claude-3-5-haiku-latest | 367ms | 542ms | 421ms |
...

Results saved to: benchmarks/results/benchmark_results.json

Testing Individual Components

Test Input Sources

# Test webcam
import asyncio
from src.inputs import WebcamInput

async def test_webcam():
    webcam = WebcamInput(device_id=0, fps=1.0)
    count = 0
    async for frame in webcam.stream():
        print(f"Frame {count}: shape={frame.shape}, source={frame.source}")
        count += 1
        if count >= 3:
            break
    await webcam.close()

asyncio.run(test_webcam())

# Test microphone
import asyncio
from src.inputs import MicrophoneInput

async def test_mic():
    mic = MicrophoneInput(sample_rate=16000, chunk_duration=2.0)
    count = 0
    async for chunk in mic.stream():
        print(f"Chunk {count}: samples={len(chunk.data)}, duration={chunk.duration_seconds:.2f}s")
        count += 1
        if count >= 3:
            break
    await mic.close()

asyncio.run(test_mic())

Test Model Providers

# Test model with a single frame
import asyncio
import numpy as np
from src.models import create_model
from src.core.types import Frame

async def test_model(provider, model_id):
    model = create_model(provider, model_id)

    # Create a test frame (random colored image)
    test_frame = Frame(
        data=np.random.randint(0, 256, (512, 512, 3), dtype=np.uint8),
        source="test"
    )

    print(f"Testing {provider}/{model_id}...")
    async for event in model.analyze(
        frames=[test_frame],
        audio_transcript=None,
        tools=[],
        context=[],
        system_prompt="Describe this image briefly."
    ):
        print(f"  Event: {type(event).__name__}")
        if hasattr(event, 'content'):
            print(f"  Content: {event.content[:100]}...")

# Test each provider
asyncio.run(test_model("openai", "gpt-4o-mini"))
asyncio.run(test_model("anthropic", "claude-3-5-haiku-latest"))
asyncio.run(test_model("google", "gemini-1.5-flash"))

Test Tools

# Test Slack tool (requires SLACK_WEBHOOK_URL)
import asyncio
import os
from dotenv import load_dotenv
from src.tools import SlackAlertTool

load_dotenv()

async def test_slack():
    tool = SlackAlertTool(
        webhook_url=os.getenv("SLACK_WEBHOOK_URL"),
        default_channel="#test-alerts"
    )

    result = await tool.execute(
        message="Test alert from multimodal agent",
        severity="info"
    )

    print(f"Slack result: {result}")

asyncio.run(test_slack())

# Test Notion tool (requires NOTION_API_KEY and NOTION_DATABASE_ID)
import asyncio
import os
from dotenv import load_dotenv
from src.tools import NotionRunSheetTool

load_dotenv()

async def test_notion():
    tool = NotionRunSheetTool(
        api_key=os.getenv("NOTION_API_KEY"),
        database_id=os.getenv("NOTION_DATABASE_ID")
    )

    result = await tool.execute(
        title="Test Entry from Multimodal Agent",
        status="completed",
        notes="This is a test entry created by the setup script."
    )

    print(f"Notion result: {result}")

asyncio.run(test_notion())

Test Memory

from src.memory import SlidingWindowMemory
from src.core.types import Message
from datetime import datetime

memory = SlidingWindowMemory(max_messages=5)

# Add some messages
for i in range(7):
    memory.add(Message(
        role="user" if i % 2 == 0 else "assistant",
        content=f"Message {i}",
        timestamp=datetime.now()
    ))

# Check what's retained
context = memory.get_context()
print(f"Messages in memory: {len(context)}")
for msg in context:
    print(f"  {msg.role}: {msg.content}")

Common Issues & Troubleshooting

Camera Issues

Problem: cv2.VideoCapture returns False

cap = cv2.VideoCapture(0)
print(cap.isOpened())  # False

Solutions:

Check permissions (macOS):
- System Preferences > Privacy & Security > Camera
- Enable for Terminal.app

Try different device IDs:

for i in range(5):
    cap = cv2.VideoCapture(i)
    if cap.isOpened():
        print(f"Camera found at index {i}")
        cap.release()

Check if camera is in use by another app:
```
# macOS
lsof | grep -i camera
```

Restart camera service (macOS):

sudo killall VDCAssistant
sudo killall AppleCameraAssistant

Microphone Issues

Problem: sounddevice.PortAudioError

Solutions:

Install PortAudio:

# macOS
brew install portaudio

# Then reinstall sounddevice
pip uninstall sounddevice
pip install sounddevice

Check permissions (macOS):
- System Preferences > Privacy & Security > Microphone
- Enable for Terminal.app

List devices and use explicit ID:

import sounddevice as sd
print(sd.query_devices())
# Use the correct index in MicrophoneInput(device_id=X)

API Errors

Problem: openai.AuthenticationError

Solution: Check API key is set correctly:

import os
from dotenv import load_dotenv
load_dotenv()
print(f"Key starts with: {os.getenv('OPENAI_API_KEY', '')[:10]}...")

Problem: anthropic.RateLimitError

Solution: Add delays between requests or upgrade plan:

import asyncio
await asyncio.sleep(1)  # Add between requests

Problem: google.api_core.exceptions.ResourceExhausted

Solution: Gemini has strict rate limits. Add delays:

config = AgentConfig(frame_interval_ms=5000)  # Slow down

Import Errors

Problem: ModuleNotFoundError: No module named 'src'

Solution: Run from project root or add to path:

import sys
sys.path.insert(0, '/path/to/multimodal-python-stack')

Or set PYTHONPATH:

export PYTHONPATH="${PYTHONPATH}:/path/to/multimodal-python-stack"

Memory/Performance Issues

Problem: High memory usage with video

Solution: Reduce frame size and buffer:

webcam = WebcamInput(
    max_size=256,  # Smaller frames
    fps=0.5,       # Slower capture
)

config = AgentConfig(
    max_frames=2,  # Keep fewer frames
    max_context_messages=10,  # Smaller context
)

Problem: Slow response times

Solutions:

Use faster models:

model = create_model("groq", "llama-3.2-11b-vision-preview")  # Fastest
model = create_model("google", "gemini-1.5-flash")  # Fast and cheap

Reduce frame size:
```
webcam = WebcamInput(max_size=256)
```

Use low detail mode (OpenAI):

model = OpenAIVisionModel(model_id="gpt-4o-mini", image_detail="low")

Development Workflow

Running Tests

# Install dev dependencies
pip install pytest pytest-asyncio

# Run all tests
pytest tests/

# Run specific test
pytest tests/test_core.py -v

# Run with coverage
pip install pytest-cov
pytest --cov=src tests/

Code Formatting

# Install ruff
pip install ruff

# Format code
ruff format .

# Check linting
ruff check .

# Fix auto-fixable issues
ruff check --fix .

Type Checking

# Install mypy
pip install mypy

# Run type checker
mypy src/

Adding a New Model Provider

Create src/models/newprovider.py:

from src.models.base import VisionLanguageModel, ModelInfo

class NewProviderVisionModel(VisionLanguageModel):
    provider = "newprovider"

    MODELS = {
        "model-name": ModelInfo(
            model_id="model-name",
            provider="newprovider",
            display_name="Model Name",
            max_images=10,
            supports_video=False,
            supports_tools=True,
            cost_per_1k_input=0.001,
            cost_per_1k_output=0.002,
            context_window=128000,
        ),
    }

    async def analyze(self, frames, audio_transcript, tools, context, system_prompt):
        # Implementation
        ...

from src.models.newprovider import NewProviderVisionModel

PROVIDERS = {
    # ...existing providers...
    "newprovider": NewProviderVisionModel,
}

Add API key to .env.example:

NEWPROVIDER_API_KEY=...

Adding a New Tool

Create src/tools/newtool.py:

from src.tools.base import Tool
from src.core.types import ToolResult

class NewTool(Tool):
    name = "new_tool"
    description = "Does something useful"
    parameters = {
        "type": "object",
        "properties": {
            "arg1": {"type": "string", "description": "First argument"},
        },
        "required": ["arg1"]
    }

    async def execute(self, arg1: str, **kwargs) -> ToolResult:
        # Implementation
        return ToolResult(output={"result": "success"})

from src.tools.newtool import NewTool
__all__ = [..., "NewTool"]

Cost Management

Estimating Costs

Use the cost calculator:

from src.models import create_model

model = create_model("openai", "gpt-4o-mini")

# Estimate cost for a session
frames_per_minute = 12  # 1 frame every 5 seconds
minutes = 60
total_frames = frames_per_minute * minutes

# ~85 tokens per image (low detail)
# ~50 tokens output per response
tokens_in = total_frames * 85
tokens_out = total_frames * 50

cost_in = (tokens_in / 1000) * model.cost_per_1k_input_tokens
cost_out = (tokens_out / 1000) * model.cost_per_1k_output_tokens
total_cost = cost_in + cost_out

print(f"Estimated cost for 1 hour: ${total_cost:.2f}")

Cost Comparison (1 hour of monitoring)

Provider	Model	Est. Cost/Hour
Google	gemini-1.5-flash	$0.03
Groq	llama-3.2-11b	$0.05
OpenAI	gpt-4o-mini	$0.15
Anthropic	claude-3.5-haiku	$0.60
OpenAI	gpt-4o	$3.00
Anthropic	claude-3.5-sonnet	$4.50

Setting Budget Alerts

class BudgetTracker:
    def __init__(self, budget_usd: float):
        self.budget = budget_usd
        self.spent = 0.0

    def add_cost(self, cost: float):
        self.spent += cost
        if self.spent > self.budget * 0.8:
            print(f"WARNING: 80% of budget used (${self.spent:.2f}/${self.budget})")
        if self.spent > self.budget:
            raise Exception(f"Budget exceeded: ${self.spent:.2f}/${self.budget}")

# Usage
tracker = BudgetTracker(budget_usd=10.0)
# Call tracker.add_cost() after each API call

Production Deployment

Docker Setup

# Dockerfile
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    ffmpeg \
    libportaudio2 \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY src/ src/
COPY examples/ examples/

# Run
CMD ["python", "examples/02_security_monitor.py"]

# Build and run
docker build -t multimodal-agent .
docker run --env-file .env --device /dev/video0 multimodal-agent

Environment Variables for Production

# Production .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json

# Rate limiting
MAX_REQUESTS_PER_MINUTE=60

# Cost controls
BUDGET_USD_PER_HOUR=5.0

# Monitoring
SENTRY_DSN=https://...

Health Checks

# healthcheck.py
import asyncio
from src.models import create_model

async def check_health():
    checks = {}

    # Check OpenAI
    try:
        model = create_model("openai", "gpt-4o-mini")
        # Quick test
        checks["openai"] = "ok"
    except Exception as e:
        checks["openai"] = f"error: {e}"

    # Check camera
    try:
        import cv2
        cap = cv2.VideoCapture(0)
        checks["camera"] = "ok" if cap.isOpened() else "error: not found"
        cap.release()
    except Exception as e:
        checks["camera"] = f"error: {e}"

    return checks

if __name__ == "__main__":
    results = asyncio.run(check_health())
    for check, status in results.items():
        print(f"{check}: {status}")

Logging Setup

import structlog
import logging

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

# Usage
logger.info("agent_started", model="gpt-4o-mini", input="webcam")
logger.info("frame_processed", latency_ms=423, tokens=156)
logger.warning("rate_limit_approaching", requests=58, limit=60)
logger.error("api_error", provider="openai", error="timeout")

Quick Reference

Start Commands

# Basic webcam
python examples/01_basic_webcam.py

# Security monitor
python examples/02_security_monitor.py

# Quality inspector
python examples/03_quality_inspector.py

# Meeting assistant
python examples/04_meeting_assistant.py

# Benchmarks
python examples/05_benchmark_providers.py

Environment Variables

Variable	Required For	Example
`OPENAI_API_KEY`	OpenAI, Whisper	`sk-proj-...`
`ANTHROPIC_API_KEY`	Anthropic	`sk-ant-...`
`GOOGLE_API_KEY`	Gemini	`AIza...`
`GROQ_API_KEY`	Groq	`gsk_...`
`SLACK_WEBHOOK_URL`	Slack alerts	`https://hooks.slack.com/...`
`NOTION_API_KEY`	Notion logging	`secret_...`
`NOTION_DATABASE_ID`	Notion logging	`abc123...`
`RTSP_URL`	IP cameras	`rtsp://user:pass@ip:port/path`

Model Quick Reference

Use Case	Recommended	Command
Cheapest	Gemini Flash	`create_model("google", "gemini-1.5-flash")`
Fastest	Groq Llama	`create_model("groq", "llama-3.2-11b-vision-preview")`
Best balance	GPT-4o-mini	`create_model("openai", "gpt-4o-mini")`
Best quality	GPT-4o	`create_model("openai", "gpt-4o")`
Best reasoning	Claude Sonnet	`create_model("anthropic", "claude-3-5-sonnet-latest")`

FilesExpand file tree

INTERNAL_SETUP.md

Latest commit

History

INTERNAL_SETUP.md

File metadata and controls

Internal Setup Guide: Multimodal Python Stack

Table of Contents

Prerequisites

Required Software

Verify Installation

macOS-Specific Setup

Linux-Specific Setup

Environment Setup

Option 1: Using uv (Recommended)

Option 2: Using pip

Option 3: Using Poetry

Verify Installation

API Key Configuration

Step 1: Copy Environment Template

Step 2: Configure API Keys

Step 3: Verify API Keys

Getting API Keys - Detailed Instructions

OpenAI

Anthropic

Google (Gemini)

Groq

Slack Webhook

Notion Integration

Hardware Requirements

Minimum Requirements

Camera Setup

Built-in Webcam (Laptop)

External USB Camera

IP Camera (RTSP)

Testing Camera

Microphone Setup

Grant Permissions (macOS)

List Audio Devices

Testing Microphone

Running the Examples

Example 1: Basic Webcam

Example 2: Security Monitor

Example 3: Quality Inspector

Example 4: Meeting Assistant

Example 5: Benchmark Providers

Testing Individual Components

Test Input Sources

Test Model Providers

Test Tools

Test Memory

Common Issues & Troubleshooting

Camera Issues

Microphone Issues

API Errors

Import Errors

Memory/Performance Issues

Development Workflow

Running Tests

Code Formatting

Type Checking

Adding a New Model Provider

Adding a New Tool

Cost Management

Estimating Costs

Cost Comparison (1 hour of monitoring)

Setting Budget Alerts

Production Deployment

Docker Setup

Environment Variables for Production

Health Checks

Logging Setup

Quick Reference

Start Commands

Environment Variables

Model Quick Reference