Recipe: RAG System Architecture

Purpose

Design and implement a production-ready Retrieval-Augmented Generation (RAG) system for LLM-in-Product features that can scale to handle enterprise workloads while maintaining accuracy and performance.

Context

Use when building AI features that need to reference external knowledge, provide contextual responses, or maintain up-to-date information. Ideal for chatbots, knowledge management systems, customer support automation, and intelligent search features.

Complexity Level: 🔴 Advanced
Track Focus: 🎯 LLM-in-Product
Phase: 📐 Design
Time Investment: 6-12 hours

Architecture Diagram

Component Specifications

1. Data Ingestion Pipeline

Content Source Integration

interface ContentSource {
  id: string;
  type: 'api' | 'file' | 'database' | 'web_scrape';
  config: {
    endpoint?: string;
    credentials?: object;
    schedule?: string;
    filters?: object;
  };
  lastUpdated: Date;
  status: 'active' | 'inactive' | 'error';
}

class ContentIngestionService {
  async ingestFromSource(source: ContentSource): Promise<Document[]> {
    // 1. Connect to source
    const connector = this.getConnector(source.type);
    const rawContent = await connector.fetch(source.config);
    
    // 2. Validate content structure
    const validContent = await this.validateContent(rawContent);
    
    // 3. Extract and clean text
    const cleanedContent = await this.preprocessContent(validContent);
    
    // 4. Create document objects
    return this.createDocuments(cleanedContent, source.id);
  }
  
  private async preprocessContent(content: any[]): Promise<string[]> {
    return content.map(item => {
      // Remove HTML tags, normalize whitespace, handle encoding
      let text = this.stripHtml(item.content);
      text = this.normalizeWhitespace(text);
      text = this.handleEncoding(text);
      return text;
    });
  }
}

Content Validation Rules

validation_rules:
  min_content_length: 50  # Minimum characters per document
  max_content_length: 10000  # Maximum characters per document
  required_fields: ['title', 'content', 'source']
  content_quality:
    min_word_count: 10
    max_duplicate_percentage: 80
    language_detection: true
    profanity_filter: true
  metadata_requirements:
    author: optional
    created_date: required
    modified_date: optional
    tags: optional
    category: optional

2. Document Chunking Strategy

Intelligent Chunking Implementation

interface ChunkingStrategy {
  type: 'fixed' | 'semantic' | 'sliding' | 'hierarchical';
  config: {
    chunkSize: number;
    overlap: number;
    preserveStructure: boolean;
    splitOn: string[];
  };
}

class IntelligentChunker {
  async chunkDocument(document: Document, strategy: ChunkingStrategy): Promise<Chunk[]> {
    switch (strategy.type) {
      case 'semantic':
        return this.semanticChunking(document, strategy.config);
      case 'hierarchical':
        return this.hierarchicalChunking(document, strategy.config);
      case 'sliding':
        return this.slidingWindowChunking(document, strategy.config);
      default:
        return this.fixedSizeChunking(document, strategy.config);
    }
  }
  
  private async semanticChunking(document: Document, config: any): Promise<Chunk[]> {
    // Use sentence embeddings to identify semantic boundaries
    const sentences = this.splitIntoSentences(document.content);
    const embeddings = await this.generateSentenceEmbeddings(sentences);
    
    // Find semantic break points using cosine similarity
    const breakPoints = this.findSemanticBreakpoints(embeddings, config.chunkSize);
    
    // Create chunks based on semantic boundaries
    return this.createSemanticChunks(sentences, breakPoints, document.metadata);
  }
  
  private findSemanticBreakpoints(embeddings: number[][], targetSize: number): number[] {
    const breakPoints: number[] = [];
    let currentChunkStart = 0;
    
    for (let i = 1; i < embeddings.length; i++) {
      const similarity = this.cosineSimilarity(embeddings[i-1], embeddings[i]);
      const currentChunkLength = i - currentChunkStart;
      
      // Break if similarity is low and chunk is approaching target size
      if (similarity < 0.7 && currentChunkLength >= targetSize * 0.8) {
        breakPoints.push(i);
        currentChunkStart = i;
      }
    }
    
    return breakPoints;
  }
}

3. Embedding Generation

Multi-Model Embedding Strategy

interface EmbeddingModel {
  name: string;
  dimensions: number;
  maxTokens: number;
  costPerToken: number;
  latency: number; // ms
}

class EmbeddingService {
  private models: Map<string, EmbeddingModel> = new Map([
    ['text-embedding-ada-002', { 
      name: 'OpenAI Ada v2', 
      dimensions: 1536, 
      maxTokens: 8192, 
      costPerToken: 0.0001,
      latency: 200
    }],
    ['sentence-transformers', { 
      name: 'SentenceTransformers', 
      dimensions: 384, 
      maxTokens: 512, 
      costPerToken: 0,
      latency: 50
    }]
  ]);
  
  async generateEmbeddings(chunks: Chunk[], modelName: string): Promise<Embedding[]> {
    const model = this.models.get(modelName);
    if (!model) throw new Error(`Model ${modelName} not found`);
    
    // Batch process for efficiency
    const batches = this.createBatches(chunks, 100);
    const embeddings: Embedding[] = [];
    
    for (const batch of batches) {
      const batchEmbeddings = await this.processBatch(batch, model);
      embeddings.push(...batchEmbeddings);
    }
    
    return embeddings;
  }
  
  private async processBatch(chunks: Chunk[], model: EmbeddingModel): Promise<Embedding[]> {
    const texts = chunks.map(chunk => chunk.content);
    
    try {
      const response = await this.callEmbeddingAPI(texts, model);
      return response.data.map((embedding, index) => ({
        chunkId: chunks[index].id,
        vector: embedding.embedding,
        model: model.name,
        timestamp: new Date()
      }));
    } catch (error) {
      // Implement retry logic with exponential backoff
      return this.retryWithBackoff(() => this.processBatch(chunks, model));
    }
  }
}

4. Vector Database Implementation

Production Vector Store Configuration

interface VectorStoreConfig {
  provider: 'pinecone' | 'weaviate' | 'chroma' | 'qdrant';
  dimensions: number;
  indexType: 'hnsw' | 'ivf' | 'flat';
  similarity: 'cosine' | 'euclidean' | 'dot_product';
  replicas: number;
  shards: number;
}

class ProductionVectorStore {
  private config: VectorStoreConfig;
  
  async createIndex(indexName: string): Promise<void> {
    const indexConfig = {
      name: indexName,
      dimensions: this.config.dimensions,
      metric: this.config.similarity,
      replicas: this.config.replicas,
      shards: this.config.shards,
      metadata_config: {
        indexed: ['category', 'source', 'timestamp', 'language']
      }
    };
    
    await this.vectorProvider.createIndex(indexConfig);
  }
  
  async upsertEmbeddings(embeddings: Embedding[]): Promise<void> {
    // Batch upsert for performance
    const batchSize = 1000;
    const batches = this.createBatches(embeddings, batchSize);
    
    const promises = batches.map(batch => 
      this.vectorProvider.upsert({
        vectors: batch.map(emb => ({
          id: emb.chunkId,
          values: emb.vector,
          metadata: emb.metadata
        }))
      })
    );
    
    await Promise.all(promises);
  }
  
  async search(queryVector: number[], options: SearchOptions): Promise<SearchResult[]> {
    const results = await this.vectorProvider.query({
      vector: queryVector,
      topK: options.topK || 10,
      filter: options.filters,
      includeMetadata: true,
      includeValues: false
    });
    
    return results.matches.map(match => ({
      chunkId: match.id,
      score: match.score,
      metadata: match.metadata
    }));
  }
}

5. Query Processing Pipeline

Advanced Query Enhancement

class QueryProcessor {
  async processQuery(userQuery: string, context: UserContext): Promise<ProcessedQuery> {
    // 1. Intent classification
    const intent = await this.classifyIntent(userQuery, context);
    
    // 2. Query expansion
    const expandedQuery = await this.expandQuery(userQuery, intent);
    
    // 3. Generate query embedding
    const queryEmbedding = await this.embedQuery(expandedQuery);
    
    // 4. Apply context-based filtering
    const filters = this.buildFilters(intent, context);
    
    return {
      original: userQuery,
      expanded: expandedQuery,
      embedding: queryEmbedding,
      intent: intent,
      filters: filters,
      timestamp: new Date()
    };
  }
  
  private async expandQuery(query: string, intent: IntentClassification): Promise<string> {
    const expansionPrompt = `
    Expand this user query to improve search relevance while maintaining the original intent.
    
    Original Query: "${query}"
    Intent: ${intent.primary}
    Domain: ${intent.domain}
    
    Generate 2-3 alternative phrasings that capture the same meaning.
    Include relevant synonyms and related terms.
    
    Expanded Query:`;
    
    const expansion = await this.llmService.complete({
      prompt: expansionPrompt,
      temperature: 0.3,
      maxTokens: 100
    });
    
    return `${query} ${expansion}`;
  }
}

6. Retrieval and Re-ranking

Hybrid Retrieval Strategy

class HybridRetriever {
  async retrieve(processedQuery: ProcessedQuery, options: RetrievalOptions): Promise<RetrievalResult[]> {
    // 1. Semantic search using vector similarity
    const semanticResults = await this.vectorStore.search(
      processedQuery.embedding, 
      { topK: options.topK * 2, filters: processedQuery.filters }
    );
    
    // 2. Keyword-based search for exact matches
    const keywordResults = await this.keywordSearch(
      processedQuery.expanded,
      { topK: options.topK, filters: processedQuery.filters }
    );
    
    // 3. Combine and deduplicate results
    const combinedResults = this.combineResults(semanticResults, keywordResults);
    
    // 4. Re-rank using cross-encoder model
    const rerankedResults = await this.rerank(combinedResults, processedQuery.original);
    
    return rerankedResults.slice(0, options.topK);
  }
  
  private async rerank(results: RetrievalResult[], originalQuery: string): Promise<RetrievalResult[]> {
    const pairs = results.map(result => ({
      query: originalQuery,
      passage: result.content
    }));
    
    const relevanceScores = await this.crossEncoder.predict(pairs);
    
    return results.map((result, index) => ({
      ...result,
      relevanceScore: relevanceScores[index],
      finalScore: this.combineScores(result.similarityScore, relevanceScores[index])
    })).sort((a, b) => b.finalScore - a.finalScore);
  }
}

7. Context Assembly and Generation

Intelligent Context Management

class ContextAssembler {
  async assembleContext(
    retrievalResults: RetrievalResult[], 
    userQuery: string,
    contextWindow: number = 4000
  ): Promise<AssembledContext> {
    // 1. Filter by relevance threshold
    const relevantResults = retrievalResults.filter(r => r.finalScore > 0.7);
    
    // 2. Optimize context within token limits
    const optimizedContext = await this.optimizeContext(relevantResults, contextWindow);
    
    // 3. Structure context for LLM consumption
    const structuredContext = this.structureContext(optimizedContext, userQuery);
    
    return {
      content: structuredContext,
      sources: optimizedContext.map(r => r.source),
      tokenCount: this.countTokens(structuredContext),
      confidence: this.calculateConfidence(optimizedContext)
    };
  }
  
  private async optimizeContext(
    results: RetrievalResult[], 
    maxTokens: number
  ): Promise<RetrievalResult[]> {
    const optimizedResults: RetrievalResult[] = [];
    let currentTokenCount = 0;
    
    // Sort by relevance and add highest scoring chunks first
    const sortedResults = results.sort((a, b) => b.finalScore - a.finalScore);
    
    for (const result of sortedResults) {
      const resultTokens = this.countTokens(result.content);
      
      if (currentTokenCount + resultTokens <= maxTokens) {
        optimizedResults.push(result);
        currentTokenCount += resultTokens;
      } else {
        // Try to fit truncated version
        const truncated = this.truncateToFit(result.content, maxTokens - currentTokenCount);
        if (truncated.length > 50) { // Minimum useful content
          optimizedResults.push({
            ...result,
            content: truncated
          });
        }
        break;
      }
    }
    
    return optimizedResults;
  }
  
  private structureContext(results: RetrievalResult[], userQuery: string): string {
    const contextSections = results.map((result, index) => `
    ## Source ${index + 1}: ${result.source}
    **Relevance:** ${(result.finalScore * 100).toFixed(1)}%
    
    ${result.content}
    
    ---`).join('\n');
    
    return `
    # Context Information for Query: "${userQuery}"
    
    The following information may help answer the user's question:
    
    ${contextSections}
    
    Please use this information to provide an accurate, helpful response. 
    If the context doesn't contain relevant information, please say so clearly.
    `;
  }
}

Implementation Checklist

Phase 1: Foundation Setup (Week 1-2)

Choose vector database provider (Pinecone, Weaviate, etc.)
Set up development environment and dependencies
Implement basic document ingestion pipeline
Create document chunking service
Set up embedding generation service
Configure vector store with test data

Phase 2: Core RAG Pipeline (Week 3-4)

Implement query processing and expansion
Build semantic retrieval service
Create re-ranking mechanism
Develop context assembly logic
Integrate with LLM for generation
Implement basic output validation

Phase 3: Advanced Features (Week 5-6)

Add intent classification
Implement hybrid search (semantic + keyword)
Create sophisticated re-ranking with cross-encoder
Build feedback collection system
Add performance monitoring
Implement caching for common queries

Phase 4: Production Readiness (Week 7-8)

Add comprehensive error handling and retries
Implement rate limiting and load balancing
Set up monitoring and alerting
Create deployment scripts and configurations
Add security measures (authentication, input validation)
Conduct load testing and optimization

Phase 5: Quality & Analytics (Week 9-10)

Implement A/B testing framework
Add detailed analytics and reporting
Create automated evaluation pipeline
Set up continuous improvement mechanisms
Document API and usage patterns
Train team on system operation

Performance Optimization

Caching Strategy

interface CacheStrategy {
  queryCache: {
    ttl: number; // Time to live in seconds
    maxSize: number; // Maximum number of cached queries
    keyStrategy: 'exact' | 'semantic'; // How to match cached queries
  };
  embeddingCache: {
    ttl: number;
    maxSize: number;
  };
  resultCache: {
    ttl: number;
    maxSize: number;
  };
}

class RAGCacheManager {
  async getCachedResults(queryHash: string): Promise<RetrievalResult[] | null> {
    const cached = await this.cache.get(`query:${queryHash}`);
    if (cached && !this.isExpired(cached.timestamp, this.config.queryCache.ttl)) {
      return cached.results;
    }
    return null;
  }
  
  async cacheResults(queryHash: string, results: RetrievalResult[]): Promise<void> {
    await this.cache.set(`query:${queryHash}`, {
      results,
      timestamp: new Date(),
      ttl: this.config.queryCache.ttl
    });
  }
}

Monitoring and Observability

interface RAGMetrics {
  latency: {
    p50: number;
    p95: number;
    p99: number;
  };
  accuracy: {
    retrievalPrecision: number;
    retrievalRecall: number;
    generationQuality: number;
  };
  cost: {
    embeddingCost: number;
    vectorStoreCost: number;
    llmGenerationCost: number;
  };
  usage: {
    queriesPerSecond: number;
    cacheHitRate: number;
    errorRate: number;
  };
}

class RAGMonitor {
  async recordMetrics(operation: string, startTime: Date, success: boolean, metadata: any): Promise<void> {
    const duration = Date.now() - startTime.getTime();
    
    // Record to metrics backend (Prometheus, DataDog, etc.)
    this.metricsClient.histogram('rag_operation_duration', duration, {
      operation,
      success: success.toString(),
      ...metadata
    });
    
    this.metricsClient.counter('rag_operations_total', 1, {
      operation,
      status: success ? 'success' : 'error'
    });
  }
}

Validation Checklist

Functional Validation

System correctly ingests and processes documents
Chunking preserves document meaning and context
Embeddings are generated accurately and consistently
Vector search returns relevant results
Re-ranking improves result quality
Context assembly stays within token limits
LLM generates accurate responses based on retrieved context
System handles edge cases (empty results, malformed queries)

Performance Validation

Query latency meets SLA requirements (< 2 seconds p95)
System handles expected concurrent load
Memory usage remains stable under load
Cache hit rates meet targets (> 80% for common queries)
Cost per query stays within budget
Vector database performance is optimized

Quality Validation

Retrieved documents are relevant to user queries
Generated responses are accurate and helpful
System avoids hallucination and fabricated information
Responses include appropriate source citations
User feedback indicates high satisfaction (> 85%)
A/B tests show improvement over baseline

Security Validation

Input validation prevents injection attacks
Authentication and authorization work correctly
Sensitive information is properly filtered
Audit logs capture all significant events
Rate limiting prevents abuse
Data handling complies with privacy regulations

Variations and Customization

Industry-Specific Adaptations

Customer Support RAG

Emphasize recent ticket data and resolution patterns
Include conversation context and customer history
Integrate with ticketing systems for real-time updates
Focus on actionable response generation

Knowledge Management RAG

Implement document versioning and approval workflows
Add expert annotations and manual curation
Support multi-language content and queries
Include access control and permission systems

E-commerce RAG

Integrate with product catalogs and inventory systems
Include price and availability in context
Support image and multimodal search
Focus on conversion-oriented responses

Architecture Variations

Serverless RAG

Use cloud-native embedding services (AWS Bedrock, Azure OpenAI)
Implement event-driven ingestion pipelines
Leverage managed vector databases
Optimize for cost and auto-scaling

Edge RAG

Deploy smaller models for local inference
Implement hybrid cloud-edge architecture
Use model quantization and optimization
Focus on latency and privacy

Multi-tenant RAG

Implement data isolation and access controls
Support per-tenant customization
Use namespace-based vector storage
Optimize for shared infrastructure efficiency

Success Metrics

Technical KPIs

Query Latency: p95 < 2 seconds, p99 < 5 seconds
Relevance Score: > 85% of results rated as relevant
System Uptime: > 99.9% availability
Cost Efficiency: < $0.10 per query including all services
Cache Hit Rate: > 80% for repeated queries

Business KPIs

User Satisfaction: > 85% positive ratings
Task Completion Rate: > 90% of user questions answered successfully
Support Ticket Reduction: 30-50% decrease in human support requests
Time to Answer: < 30 seconds average response time
User Engagement: > 60% return usage within 30 days

Ready to build your RAG system? Start with Phase 1: Foundation Setup and progress systematically through each implementation phase. Remember to validate extensively at each step and gather user feedback early and often.

Purpose​

Context​

Architecture Diagram​

Component Specifications​

1. Data Ingestion Pipeline​

Content Source Integration​

Content Validation Rules​

2. Document Chunking Strategy​

Intelligent Chunking Implementation​

3. Embedding Generation​

Multi-Model Embedding Strategy​

4. Vector Database Implementation​

Production Vector Store Configuration​

5. Query Processing Pipeline​

Advanced Query Enhancement​

6. Retrieval and Re-ranking​

Hybrid Retrieval Strategy​

7. Context Assembly and Generation​

Intelligent Context Management​

Implementation Checklist​

Phase 1: Foundation Setup (Week 1-2)​

Phase 2: Core RAG Pipeline (Week 3-4)​

Phase 3: Advanced Features (Week 5-6)​

Phase 4: Production Readiness (Week 7-8)​

Phase 5: Quality & Analytics (Week 9-10)​

Performance Optimization​

Caching Strategy​

Monitoring and Observability​

Validation Checklist​

Functional Validation​

Performance Validation​

Quality Validation​

Security Validation​

Variations and Customization​

Industry-Specific Adaptations​

Architecture Variations​

Success Metrics​

Technical KPIs​

Business KPIs​