Recipe: RAG System Architecture
Purpose
Design and implement a production-ready Retrieval-Augmented Generation (RAG) system for LLM-in-Product features that can scale to handle enterprise workloads while maintaining accuracy and performance.
Context
Use when building AI features that need to reference external knowledge, provide contextual responses, or maintain up-to-date information. Ideal for chatbots, knowledge management systems, customer support automation, and intelligent search features.
Complexity Level: 🔴 Advanced
Track Focus: 🎯 LLM-in-Product
Phase: 📐 Design
Time Investment: 6-12 hours
Architecture Diagram
Component Specifications
1. Data Ingestion Pipeline
Content Source Integration
interface ContentSource {
id: string;
type: 'api' | 'file' | 'database' | 'web_scrape';
config: {
endpoint?: string;
credentials?: object;
schedule?: string;
filters?: object;
};
lastUpdated: Date;
status: 'active' | 'inactive' | 'error';
}
class ContentIngestionService {
async ingestFromSource(source: ContentSource): Promise<Document[]> {
// 1. Connect to source
const connector = this.getConnector(source.type);
const rawContent = await connector.fetch(source.config);
// 2. Validate content structure
const validContent = await this.validateContent(rawContent);
// 3. Extract and clean text
const cleanedContent = await this.preprocessContent(validContent);
// 4. Create document objects
return this.createDocuments(cleanedContent, source.id);
}
private async preprocessContent(content: any[]): Promise<string[]> {
return content.map(item => {
// Remove HTML tags, normalize whitespace, handle encoding
let text = this.stripHtml(item.content);
text = this.normalizeWhitespace(text);
text = this.handleEncoding(text);
return text;
});
}
}
Content Validation Rules
validation_rules:
min_content_length: 50 # Minimum characters per document
max_content_length: 10000 # Maximum characters per document
required_fields: ['title', 'content', 'source']
content_quality:
min_word_count: 10
max_duplicate_percentage: 80
language_detection: true
profanity_filter: true
metadata_requirements:
author: optional
created_date: required
modified_date: optional
tags: optional
category: optional
2. Document Chunking Strategy
Intelligent Chunking Implementation
interface ChunkingStrategy {
type: 'fixed' | 'semantic' | 'sliding' | 'hierarchical';
config: {
chunkSize: number;
overlap: number;
preserveStructure: boolean;
splitOn: string[];
};
}
class IntelligentChunker {
async chunkDocument(document: Document, strategy: ChunkingStrategy): Promise<Chunk[]> {
switch (strategy.type) {
case 'semantic':
return this.semanticChunking(document, strategy.config);
case 'hierarchical':
return this.hierarchicalChunking(document, strategy.config);
case 'sliding':
return this.slidingWindowChunking(document, strategy.config);
default:
return this.fixedSizeChunking(document, strategy.config);
}
}
private async semanticChunking(document: Document, config: any): Promise<Chunk[]> {
// Use sentence embeddings to identify semantic boundaries
const sentences = this.splitIntoSentences(document.content);
const embeddings = await this.generateSentenceEmbeddings(sentences);
// Find semantic break points using cosine similarity
const breakPoints = this.findSemanticBreakpoints(embeddings, config.chunkSize);
// Create chunks based on semantic boundaries
return this.createSemanticChunks(sentences, breakPoints, document.metadata);
}
private findSemanticBreakpoints(embeddings: number[][], targetSize: number): number[] {
const breakPoints: number[] = [];
let currentChunkStart = 0;
for (let i = 1; i < embeddings.length; i++) {
const similarity = this.cosineSimilarity(embeddings[i-1], embeddings[i]);
const currentChunkLength = i - currentChunkStart;
// Break if similarity is low and chunk is approaching target size
if (similarity < 0.7 && currentChunkLength >= targetSize * 0.8) {
breakPoints.push(i);
currentChunkStart = i;
}
}
return breakPoints;
}
}
3. Embedding Generation
Multi-Model Embedding Strategy
interface EmbeddingModel {
name: string;
dimensions: number;
maxTokens: number;
costPerToken: number;
latency: number; // ms
}
class EmbeddingService {
private models: Map<string, EmbeddingModel> = new Map([
['text-embedding-ada-002', {
name: 'OpenAI Ada v2',
dimensions: 1536,
maxTokens: 8192,
costPerToken: 0.0001,
latency: 200
}],
['sentence-transformers', {
name: 'SentenceTransformers',
dimensions: 384,
maxTokens: 512,
costPerToken: 0,
latency: 50
}]
]);
async generateEmbeddings(chunks: Chunk[], modelName: string): Promise<Embedding[]> {
const model = this.models.get(modelName);
if (!model) throw new Error(`Model ${modelName} not found`);
// Batch process for efficiency
const batches = this.createBatches(chunks, 100);
const embeddings: Embedding[] = [];
for (const batch of batches) {
const batchEmbeddings = await this.processBatch(batch, model);
embeddings.push(...batchEmbeddings);
}
return embeddings;
}
private async processBatch(chunks: Chunk[], model: EmbeddingModel): Promise<Embedding[]> {
const texts = chunks.map(chunk => chunk.content);
try {
const response = await this.callEmbeddingAPI(texts, model);
return response.data.map((embedding, index) => ({
chunkId: chunks[index].id,
vector: embedding.embedding,
model: model.name,
timestamp: new Date()
}));
} catch (error) {
// Implement retry logic with exponential backoff
return this.retryWithBackoff(() => this.processBatch(chunks, model));
}
}
}
4. Vector Database Implementation
Production Vector Store Configuration
interface VectorStoreConfig {
provider: 'pinecone' | 'weaviate' | 'chroma' | 'qdrant';
dimensions: number;
indexType: 'hnsw' | 'ivf' | 'flat';
similarity: 'cosine' | 'euclidean' | 'dot_product';
replicas: number;
shards: number;
}
class ProductionVectorStore {
private config: VectorStoreConfig;
async createIndex(indexName: string): Promise<void> {
const indexConfig = {
name: indexName,
dimensions: this.config.dimensions,
metric: this.config.similarity,
replicas: this.config.replicas,
shards: this.config.shards,
metadata_config: {
indexed: ['category', 'source', 'timestamp', 'language']
}
};
await this.vectorProvider.createIndex(indexConfig);
}
async upsertEmbeddings(embeddings: Embedding[]): Promise<void> {
// Batch upsert for performance
const batchSize = 1000;
const batches = this.createBatches(embeddings, batchSize);
const promises = batches.map(batch =>
this.vectorProvider.upsert({
vectors: batch.map(emb => ({
id: emb.chunkId,
values: emb.vector,
metadata: emb.metadata
}))
})
);
await Promise.all(promises);
}
async search(queryVector: number[], options: SearchOptions): Promise<SearchResult[]> {
const results = await this.vectorProvider.query({
vector: queryVector,
topK: options.topK || 10,
filter: options.filters,
includeMetadata: true,
includeValues: false
});
return results.matches.map(match => ({
chunkId: match.id,
score: match.score,
metadata: match.metadata
}));
}
}
5. Query Processing Pipeline
Advanced Query Enhancement
class QueryProcessor {
async processQuery(userQuery: string, context: UserContext): Promise<ProcessedQuery> {
// 1. Intent classification
const intent = await this.classifyIntent(userQuery, context);
// 2. Query expansion
const expandedQuery = await this.expandQuery(userQuery, intent);
// 3. Generate query embedding
const queryEmbedding = await this.embedQuery(expandedQuery);
// 4. Apply context-based filtering
const filters = this.buildFilters(intent, context);
return {
original: userQuery,
expanded: expandedQuery,
embedding: queryEmbedding,
intent: intent,
filters: filters,
timestamp: new Date()
};
}
private async expandQuery(query: string, intent: IntentClassification): Promise<string> {
const expansionPrompt = `
Expand this user query to improve search relevance while maintaining the original intent.
Original Query: "${query}"
Intent: ${intent.primary}
Domain: ${intent.domain}
Generate 2-3 alternative phrasings that capture the same meaning.
Include relevant synonyms and related terms.
Expanded Query:`;
const expansion = await this.llmService.complete({
prompt: expansionPrompt,
temperature: 0.3,
maxTokens: 100
});
return `${query} ${expansion}`;
}
}
6. Retrieval and Re-ranking
Hybrid Retrieval Strategy
class HybridRetriever {
async retrieve(processedQuery: ProcessedQuery, options: RetrievalOptions): Promise<RetrievalResult[]> {
// 1. Semantic search using vector similarity
const semanticResults = await this.vectorStore.search(
processedQuery.embedding,
{ topK: options.topK * 2, filters: processedQuery.filters }
);
// 2. Keyword-based search for exact matches
const keywordResults = await this.keywordSearch(
processedQuery.expanded,
{ topK: options.topK, filters: processedQuery.filters }
);
// 3. Combine and deduplicate results
const combinedResults = this.combineResults(semanticResults, keywordResults);
// 4. Re-rank using cross-encoder model
const rerankedResults = await this.rerank(combinedResults, processedQuery.original);
return rerankedResults.slice(0, options.topK);
}
private async rerank(results: RetrievalResult[], originalQuery: string): Promise<RetrievalResult[]> {
const pairs = results.map(result => ({
query: originalQuery,
passage: result.content
}));
const relevanceScores = await this.crossEncoder.predict(pairs);
return results.map((result, index) => ({
...result,
relevanceScore: relevanceScores[index],
finalScore: this.combineScores(result.similarityScore, relevanceScores[index])
})).sort((a, b) => b.finalScore - a.finalScore);
}
}
7. Context Assembly and Generation
Intelligent Context Management
class ContextAssembler {
async assembleContext(
retrievalResults: RetrievalResult[],
userQuery: string,
contextWindow: number = 4000
): Promise<AssembledContext> {
// 1. Filter by relevance threshold
const relevantResults = retrievalResults.filter(r => r.finalScore > 0.7);
// 2. Optimize context within token limits
const optimizedContext = await this.optimizeContext(relevantResults, contextWindow);
// 3. Structure context for LLM consumption
const structuredContext = this.structureContext(optimizedContext, userQuery);
return {
content: structuredContext,
sources: optimizedContext.map(r => r.source),
tokenCount: this.countTokens(structuredContext),
confidence: this.calculateConfidence(optimizedContext)
};
}
private async optimizeContext(
results: RetrievalResult[],
maxTokens: number
): Promise<RetrievalResult[]> {
const optimizedResults: RetrievalResult[] = [];
let currentTokenCount = 0;
// Sort by relevance and add highest scoring chunks first
const sortedResults = results.sort((a, b) => b.finalScore - a.finalScore);
for (const result of sortedResults) {
const resultTokens = this.countTokens(result.content);
if (currentTokenCount + resultTokens <= maxTokens) {
optimizedResults.push(result);
currentTokenCount += resultTokens;
} else {
// Try to fit truncated version
const truncated = this.truncateToFit(result.content, maxTokens - currentTokenCount);
if (truncated.length > 50) { // Minimum useful content
optimizedResults.push({
...result,
content: truncated
});
}
break;
}
}
return optimizedResults;
}
private structureContext(results: RetrievalResult[], userQuery: string): string {
const contextSections = results.map((result, index) => `
## Source ${index + 1}: ${result.source}
**Relevance:** ${(result.finalScore * 100).toFixed(1)}%
${result.content}
---`).join('\n');
return `
# Context Information for Query: "${userQuery}"
The following information may help answer the user's question:
${contextSections}
Please use this information to provide an accurate, helpful response.
If the context doesn't contain relevant information, please say so clearly.
`;
}
}
Implementation Checklist
Phase 1: Foundation Setup (Week 1-2)
- Choose vector database provider (Pinecone, Weaviate, etc.)
- Set up development environment and dependencies
- Implement basic document ingestion pipeline
- Create document chunking service
- Set up embedding generation service
- Configure vector store with test data
Phase 2: Core RAG Pipeline (Week 3-4)
- Implement query processing and expansion
- Build semantic retrieval service
- Create re-ranking mechanism
- Develop context assembly logic
- Integrate with LLM for generation
- Implement basic output validation
Phase 3: Advanced Features (Week 5-6)
- Add intent classification
- Implement hybrid search (semantic + keyword)
- Create sophisticated re-ranking with cross-encoder
- Build feedback collection system
- Add performance monitoring
- Implement caching for common queries
Phase 4: Production Readiness (Week 7-8)
- Add comprehensive error handling and retries
- Implement rate limiting and load balancing
- Set up monitoring and alerting
- Create deployment scripts and configurations
- Add security measures (authentication, input validation)
- Conduct load testing and optimization
Phase 5: Quality & Analytics (Week 9-10)
- Implement A/B testing framework
- Add detailed analytics and reporting
- Create automated evaluation pipeline
- Set up continuous improvement mechanisms
- Document API and usage patterns
- Train team on system operation
Performance Optimization
Caching Strategy
interface CacheStrategy {
queryCache: {
ttl: number; // Time to live in seconds
maxSize: number; // Maximum number of cached queries
keyStrategy: 'exact' | 'semantic'; // How to match cached queries
};
embeddingCache: {
ttl: number;
maxSize: number;
};
resultCache: {
ttl: number;
maxSize: number;
};
}
class RAGCacheManager {
async getCachedResults(queryHash: string): Promise<RetrievalResult[] | null> {
const cached = await this.cache.get(`query:${queryHash}`);
if (cached && !this.isExpired(cached.timestamp, this.config.queryCache.ttl)) {
return cached.results;
}
return null;
}
async cacheResults(queryHash: string, results: RetrievalResult[]): Promise<void> {
await this.cache.set(`query:${queryHash}`, {
results,
timestamp: new Date(),
ttl: this.config.queryCache.ttl
});
}
}
Monitoring and Observability
interface RAGMetrics {
latency: {
p50: number;
p95: number;
p99: number;
};
accuracy: {
retrievalPrecision: number;
retrievalRecall: number;
generationQuality: number;
};
cost: {
embeddingCost: number;
vectorStoreCost: number;
llmGenerationCost: number;
};
usage: {
queriesPerSecond: number;
cacheHitRate: number;
errorRate: number;
};
}
class RAGMonitor {
async recordMetrics(operation: string, startTime: Date, success: boolean, metadata: any): Promise<void> {
const duration = Date.now() - startTime.getTime();
// Record to metrics backend (Prometheus, DataDog, etc.)
this.metricsClient.histogram('rag_operation_duration', duration, {
operation,
success: success.toString(),
...metadata
});
this.metricsClient.counter('rag_operations_total', 1, {
operation,
status: success ? 'success' : 'error'
});
}
}
Validation Checklist
Functional Validation
- System correctly ingests and processes documents
- Chunking preserves document meaning and context
- Embeddings are generated accurately and consistently
- Vector search returns relevant results
- Re-ranking improves result quality
- Context assembly stays within token limits
- LLM generates accurate responses based on retrieved context
- System handles edge cases (empty results, malformed queries)
Performance Validation
- Query latency meets SLA requirements (< 2 seconds p95)
- System handles expected concurrent load
- Memory usage remains stable under load
- Cache hit rates meet targets (> 80% for common queries)
- Cost per query stays within budget
- Vector database performance is optimized
Quality Validation
- Retrieved documents are relevant to user queries
- Generated responses are accurate and helpful
- System avoids hallucination and fabricated information
- Responses include appropriate source citations
- User feedback indicates high satisfaction (> 85%)
- A/B tests show improvement over baseline
Security Validation
- Input validation prevents injection attacks
- Authentication and authorization work correctly
- Sensitive information is properly filtered
- Audit logs capture all significant events
- Rate limiting prevents abuse
- Data handling complies with privacy regulations
Variations and Customization
Industry-Specific Adaptations
Customer Support RAG
- Emphasize recent ticket data and resolution patterns
- Include conversation context and customer history
- Integrate with ticketing systems for real-time updates
- Focus on actionable response generation
Knowledge Management RAG
- Implement document versioning and approval workflows
- Add expert annotations and manual curation
- Support multi-language content and queries
- Include access control and permission systems
E-commerce RAG
- Integrate with product catalogs and inventory systems
- Include price and availability in context
- Support image and multimodal search
- Focus on conversion-oriented responses
Architecture Variations
Serverless RAG
- Use cloud-native embedding services (AWS Bedrock, Azure OpenAI)
- Implement event-driven ingestion pipelines
- Leverage managed vector databases
- Optimize for cost and auto-scaling
Edge RAG
- Deploy smaller models for local inference
- Implement hybrid cloud-edge architecture
- Use model quantization and optimization
- Focus on latency and privacy
Multi-tenant RAG
- Implement data isolation and access controls
- Support per-tenant customization
- Use namespace-based vector storage
- Optimize for shared infrastructure efficiency
Success Metrics
Technical KPIs
- Query Latency: p95 < 2 seconds, p99 < 5 seconds
- Relevance Score: > 85% of results rated as relevant
- System Uptime: > 99.9% availability
- Cost Efficiency: < $0.10 per query including all services
- Cache Hit Rate: > 80% for repeated queries
Business KPIs
- User Satisfaction: > 85% positive ratings
- Task Completion Rate: > 90% of user questions answered successfully
- Support Ticket Reduction: 30-50% decrease in human support requests
- Time to Answer: < 30 seconds average response time
- User Engagement: > 60% return usage within 30 days
Ready to build your RAG system? Start with Phase 1: Foundation Setup and progress systematically through each implementation phase. Remember to validate extensively at each step and gather user feedback early and often.