13 KiB
Project: Auteur AI (Flow Edition) - Production Spec
Version: 2.1 (Expanded Production Build) Target Platform: Full Stack Web Application Infrastructure: Self-Hosted Linux Environment (Docker Compose / K8s) Goal: Create a functional pre-production asset management suite for Google Flow (Veo 3.1).
1. System Architecture & Infrastructure
Since resources are unlimited and the target environment is a robust self-hosted Linux cluster, we will run a microservices-ready monolithic structure via Docker. This architecture prioritizes data privacy (keeping scripts and assets local), low latency for heavy asset manipulation, and the flexibility to scale individual components (like the AI inference engine) without refactoring the entire stack.
The Stack
-
Frontend: React 18 (built via Vite). We will utilize TypeScript for strict type safety across the complex JSON data structures required by Google Flow.
- Styling: TailwindCSS for utility-first styling combined with Shadcn/UI (Radix Primitives) for accessible, keyboard-navigable components.
- State Management: TanStack Query (React Query) is critical here. It will handle server-state caching, deduping requests, and managing the "loading" and "error" states of asynchronous AI operations. We will use Zustand for transient client-side state (e.g., dragging an ingredient into a slot).
-
Backend: Python (FastAPI).
- Rationale: While Node.js is capable, Python is the native language of AI. Using FastAPI allows us to integrate directly with libraries like
langchain,llama-index, or rawtransformerspipelines if we decide to move beyond API-based LLMs in the future. FastAPI also provides automatic OpenAPI (Swagger) documentation and high-performance async support via Starlette.
- Rationale: While Node.js is capable, Python is the native language of AI. Using FastAPI allows us to integrate directly with libraries like
-
Database: PostgreSQL 16.
- Rationale: We need a robust relational database to manage the strict hierarchy of Projects -> Scenes -> Shots. PostgreSQL's binary JSON (JSONB) support is essential for storing the flexible metadata associated with AI assets and the complex, nested JSON payloads generated for Veo.
-
Object Storage: MinIO.
- Rationale: A self-hosted, S3-compatible object storage server. This allows us to handle gigabytes of video references and high-res character sheets without clogging the database or the application server's file system. It supports pre-signed URLs, offloading file serving traffic directly to the client.
-
AI Inference: Local Ollama instance.
- Rationale: Running Llama 3 (8B or 70B parameter) or Mistral locally ensures zero data leakage. The API will communicate with Ollama via HTTP requests, allowing for easy model swapping (e.g., testing
codellamafor JSON generation vsllama3for creative writing).
- Rationale: Running Llama 3 (8B or 70B parameter) or Mistral locally ensures zero data leakage. The API will communicate with Ollama via HTTP requests, allowing for easy model swapping (e.g., testing
Docker Compose Services
The production docker-compose.yml will orchestrate the following interconnected services:
frontend: A high-performance Nginx container serving the static React build. It will also act as a reverse proxy to route/apirequests to the backend, eliminating CORS issues.backend: The FastAPI application server running viauvicorn(Port 8000). It acts as the orchestrator.db: PostgreSQL 16 (Port 5432) with a persistent volume for data safety.minio: The S3-compatible storage engine (Port 9000 for API, 9001 for Console).redis: Redis (Port 6379).- Usage: This is crucial for a robust production app. It will serve as the message broker for Celery or ARQ (async task queues). When a user uploads a 4K video or requests a full script breakdown, the API will offload this "heavy lifting" to a background worker to keep the interface snappy.
worker: A Python container running the background task consumer (Celery/ARQ) to process video thumbnails and long-running LLM inference tasks.
2. Database Schema (PostgreSQL)
The database schema needs to be robust enough to handle the relationships between creative entities. The agent must create migrations (using Alembic for Python) for the following schema. Note the addition of indices for performance.
-- Enum for strict typing of asset categories, critical for the "Slot" system
CREATE TYPE asset_type AS ENUM ('Character', 'Location', 'Object', 'Style');
-- Projects: The top-level container
CREATE TABLE projects (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
resolution TEXT DEFAULT '4K', -- e.g., '3840x2160'
aspect_ratio TEXT DEFAULT '16:9',
veo_version TEXT DEFAULT '3.1',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Ingredients: The reusable assets (Actors, Sets, Props)
CREATE TABLE ingredients (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
name TEXT NOT NULL,
type asset_type NOT NULL,
s3_key TEXT NOT NULL, -- The path in the MinIO bucket
s3_bucket TEXT DEFAULT 'auteur-assets',
thumbnail_key TEXT, -- Path to a generated low-res thumbnail
metadata JSONB DEFAULT '{}', -- Stores AI-generated tags (e.g., {"hair": "blue", "mood": "dark"})
created_at TIMESTAMP DEFAULT NOW()
);
-- Index for faster filtering by type within a project
CREATE INDEX idx_ingredients_project_type ON ingredients(project_id, type);
-- Scenes: Logical groupings within the script
CREATE TABLE scenes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
slugline TEXT NOT NULL, -- e.g., "INT. SERVER ROOM - NIGHT"
raw_content TEXT, -- The full text body of the scene
sequence_number INT NOT NULL, -- For ordering scenes in the UI
created_at TIMESTAMP DEFAULT NOW()
);
-- Shots: The atomic unit of generation
CREATE TABLE shots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
scene_id UUID REFERENCES scenes(id) ON DELETE CASCADE,
description TEXT NOT NULL, -- The visual description
duration FLOAT, -- Estimated duration in seconds
sequence_number INT, -- Order within the scene
-- "The Slot System": JSONB array of Ingredient UUIDs assigned to slots 1, 2, 3.
-- Example: ["uuid-char-1", "uuid-loc-2", null]
assigned_ingredients JSONB DEFAULT '[]',
-- The computed prompt context sent to the LLM
llm_context_cache TEXT,
-- The final output for Google Flow/Veo
veo_json_payload JSONB,
status TEXT DEFAULT 'draft', -- 'draft', 'generating_json', 'ready'
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_shots_scene ON shots(scene_id);
3. API Module Specifications
Module 1: Asset Library (Real Uploads & Processing)
Endpoint: POST /api/assets/upload
Logic Flow:
- Validation: Frontend sends
FormDatacontaining the file and metadata (project_id, type). Backend validates file type (image/png, image/jpeg) and size limits. - Storage: Backend streams the file directly to the MinIO bucket
auteur-assetsusingboto3orminio-py. It generates a unique object key (e.g.,proj_id/uuid.jpg). - Database: A record is created in the
ingredientstable with thes3_key. - Background Processing (Async): A task is pushed to the Redis queue to:
- Generate a 200px thumbnail for the UI.
- (Optional) Run a "Vision" LLM task (using Ollama's
llavamodel) to auto-caption the image and populate themetadataJSONB field (e.g., "A robotic dog standing in rain").
- Return: The API returns the new Asset object, including a pre-signed URL for immediate display.
Module 2: Intelligent Script Parser (The "Ingestion Engine")
Endpoint: POST /api/scripts/parse
Logic Flow:
- Ingest: User uploads a
.txtor.fountainscreenplay file. - Preprocessing: Backend reads the text. If it's a large script, it chunks it by Scene Headers (
INT.,EXT.). - AI Analysis (Ollama): The content is sent to the local LLM.
- System Prompt: "You are a Script Supervisor. Break the following screenplay text into a structured JSON array of shots. Identify the action lines that denote visual changes. Ignore dialogue unless it implies visual action."
- Schema Enforcement: We will use Pydantic models to validate that the LLM's output matches the expected JSON structure (Shot Description, Estimated Duration).
- Persistence: The backend iterates through the validated JSON array and performs a bulk insert into the
scenesandshotstables, ensuringsequence_numbersare preserved. - Notification: The frontend is notified (via polling or WebSocket) that the script is ready for review.
Module 3: Flow Assembly & JSON Generation (The "Translator")
Endpoint: POST /api/shots/:id/generate-flow
Logic Flow:
- Data Gathering: The endpoint fetches the
shotrecord. It then queries theingredientstable to retrieve the full details (Name, Metadata, Visual Description) of the UUIDs stored inassigned_ingredients. - Context Construction: A rich text prompt is assembled.
- Example: "Construct a Google Veo 3.1 JSON configuration. The shot is: '{shot.description}'. The Character is '{ingredient[0].name}', described as '{ingredient[0].metadata}'. The Location is '{ingredient[1].name}'."
- Prompt Engineering: The prompt explicitly forbids the LLM from adding hallucinated details and forces it to map the Ingredient characteristics to the specific JSON fields required by Veo (e.g.,
subject.description,environment.lighting). - AI Action: Send to Ollama. We use a low-temperature setting (e.g., 0.2) to ensure deterministic, strictly formatted JSON output.
- Validation: The backend parses the returned JSON string into a Python Dictionary. If parsing fails, it retries up to 2 times.
- Update: The valid JSON is saved to
shots.veo_json_payload, and the status is updated to 'ready'.
4. Frontend Integration Guidelines
- API Client: Use
axioswith a configured base URL (e.g.,/api/v1). Implement interceptors to handle 401/403 errors or global loading states. - Image Handling (Presigned URLs): The frontend should never try to fetch images directly from the MinIO container's internal IP. Instead, the API returns a presigned URL (valid for 1 hour) that allows the browser to fetch the image directly from the MinIO public endpoint.
- Optimistic UI: When a user updates a shot description or drags an ingredient, the UI should update immediately (using
react-query'ssetQueryData) before the API call resolves. If the API call fails, the change is rolled back with a toast notification. - Editor Component: For the JSON editor, use
@monaco-editor/reactto provide syntax highlighting and code folding, giving the "IDE" feel.
5. Implementation Prompt for Coding Agent
Copy and paste this detailed instruction block into Antigravity/Cursor/Windsurf to begin the build process:
"Act as a Senior Full-Stack Software Architect. We are building 'Auteur AI', a professional video production management application.
Core Constraint: DO NOT USE MOCK DATA. This is a real implementation meant for production deployment on a Linux cluster.
Technology Stack Definition:
Backend: Python FastAPI.
- Use
SQLAlchemy(Async) for ORM.- Use
Alembicfor database migrations.- Use
Pydanticfor strict data validation (Models).- Use
boto3for MinIO (S3) interaction.- Use
ollamapython library for communicating with the local LLM (http://host.docker.internal:11434).Frontend: React (Vite) + TypeScript + TailwindCSS.
- Use
axiosfor API requests.- Use
tanstack/react-queryfor data fetching and caching.- Use
shadcn/uicomponents for the interface.- Use
lucide-reactfor iconography.Task 1: Infrastructure Setup Create a production-ready
docker-compose.yml. It must include:
postgres(v16) with a named volume for persistence.miniowith a create-bucket entrypoint script.backendservice (FastAPI) with hot-reload enabled for dev.frontendservice (Node/Vite) proxying requests to the backend.Task 2: Database Layer Define the SQLAlchemy models exactly matching the schema provided in the TDD (Projects, Ingredients, Scenes, Shots). Create the initial Alembic migration script.
Task 3: Backend API Implementation
- Implement the
POST /api/assets/uploadendpoint usingUploadFile. It must save to MinIO and Postgres.- Implement the
POST /api/scripts/parseendpoint. It must accept a text file, chunk it, send it to Ollama for analysis, and store the resulting Shots.Task 4: Frontend Development
- Set up the React Router with layouts (Sidebar/Header).
- Build the 'Asset Library' view: Fetch real data from the API, display images using presigned URLs, and implement a real file upload dropzone."