Deploying Your RAG Application to Production

In this final section, we'll cover best practices for deploying your MongoDB-RAG application to production, including optimization, monitoring, and scaling strategies.

Architecture Considerations

When deploying a RAG application to production, consider this reference architecture:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Document       │     │  Search         │     │  LLM            │
│  Processing     │◄────┤  Service        │◄────┤  Service        │
│  Service        │     │                 │     │                 │
│                 │     │                 │     │                 │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  MongoDB Atlas  │     │  Cache          │     │  LLM Provider   │
│  (Vector Store) │     │  (Redis/Memcached)    │  (OpenAI/etc.)  │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Microservices Approach

Consider splitting your application into specialized microservices:

Document Processing Service
- Handles document ingestion, chunking, and embedding generation
- Manages the document ingestion pipeline
- Processes documents asynchronously with a queue
Search Service
- Manages vector search and retrieval
- Implements hybrid search capabilities
- Caches common search results
LLM Service
- Handles prompt construction and LLM interaction
- Manages prompt templates and response generation
- Implements retry logic and fallback mechanisms

Optimizing MongoDB Atlas

Sizing and Scaling

For production deployments, ensure your MongoDB Atlas cluster is properly sized:

Cluster Tier Selection
- M10+ for small production workloads
- M30+ for medium-sized applications
- M60+ for larger workloads with high query volumes
Horizontal Scaling
- Use sharding for very large vector collections (billions of vectors)
- Distribute data across multiple shards based on a shard key
Instance Size
- Ensure sufficient RAM for vector search operations
- Consider storage requirements for documents and embeddings

Index Optimization

Optimize your vector search indexes:

const optimizedIndex = {
  name: "production_vector_index",
  type: "vectorSearch",
  definition: {
    fields: [
      {
        type: "vector",
        path: "embedding",
        numDimensions: 1536,
        similarity: "cosine",
        quantization: {
          type: "scalar",
          config: {
            components: { type: "int8" }  // Quantize for better performance
          }
        }
      }
    ],
    indexedFields: [
      "metadata.source",
      "metadata.type",
      "metadata.created"
    ],
    preImageStorage: {
      enabled: false  // Disable for better space utilization
    }
  }
};

Connection Pooling

Implement connection pooling for efficient database connections:

// In your MongoDB connection configuration
const client = new MongoClient(uri, {
  maxPoolSize: 50,              // Adjust based on expected concurrency
  minPoolSize: 5,               // Maintain minimum connections
  maxIdleTimeMS: 30000,         // Close idle connections after 30 seconds
  waitQueueTimeoutMS: 10000,    // Timeout for waiting for connection
  connectTimeoutMS: 30000,      // Timeout for initial connection
  socketTimeoutMS: 45000        // Timeout for socket operations
});

Caching Strategies

Implement multi-level caching to improve performance:

Query Result Caching

Cache common search results:

const NodeCache = require('node-cache');
const cache = new NodeCache({ stdTTL: 3600, checkperiod: 120 });

async function cachedSearch(query, options = {}) {
  // Generate cache key based on query and options
  const cacheKey = `search:${JSON.stringify({query, options})}`;
  
  // Check cache first
  const cachedResults = cache.get(cacheKey);
  if (cachedResults) {
    console.log('Cache hit for query:', query);
    return cachedResults;
  }
  
  // If not in cache, perform search
  const results = await rag.search(query, options);
  
  // Cache the results
  cache.set(cacheKey, results);
  
  return results;
}

Embedding Caching

Cache embeddings to avoid regenerating them:

const NodeCache = require('node-cache');
const embeddingCache = new NodeCache({ stdTTL: 86400, checkperiod: 600 });

// Extend the embedding provider with caching
class CachedEmbeddingProvider {
  constructor(provider) {
    this.provider = provider;
  }
  
  async getEmbedding(text) {
    const cacheKey = `embed:${text}`;
    
    // Check cache first
    const cachedEmbedding = embeddingCache.get(cacheKey);
    if (cachedEmbedding) {
      return cachedEmbedding;
    }
    
    // Generate embedding if not cached
    const embedding = await this.provider.getEmbedding(text);
    
    // Cache the embedding
    embeddingCache.set(cacheKey, embedding);
    
    return embedding;
  }
  
  async getEmbeddings(texts) {
    // Implement batch caching logic
    // ...
  }
}

LLM Response Caching

Cache LLM responses for repeating queries:

const NodeCache = require('node-cache');
const llmCache = new NodeCache({ stdTTL: 3600, checkperiod: 120 });

async function cachedLLMGeneration(messages, options = {}) {
  // Generate cache key based on messages
  const cacheKey = `llm:${JSON.stringify(messages)}`;
  
  // Check cache first
  const cachedResponse = llmCache.get(cacheKey);
  if (cachedResponse) {
    return cachedResponse;
  }
  
  // Generate response if not cached
  const response = await openai.chat.completions.create({
    model: options.model || config.llm.model,
    messages,
    temperature: options.temperature || 0.3
  });
  
  // Cache the response
  llmCache.set(cacheKey, response);
  
  return response;
}

Rate Limiting and Queueing

Implement rate limiting and job queueing for robust production applications:

Rate Limiting

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis(process.env.REDIS_URL);

// Apply rate limiting middleware
app.use('/api/rag', rateLimit({
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args)
  }),
  windowMs: 60 * 1000, // 1 minute
  max: 20, // 20 requests per minute
  standardHeaders: true,
  message: {
    error: 'Too many requests, please try again later.'
  }
}));

Job Queueing

Use a job queue for background processing:

const { Queue, Worker } = require('bullmq');
const Redis = require('ioredis');

// Create Redis connection
const connection = new Redis(process.env.REDIS_URL);

// Create document processing queue
const documentQueue = new Queue('document-processing', { connection });

// Add document to processing queue
async function queueDocumentForProcessing(document) {
  await documentQueue.add('process-document', {
    document,
    timestamp: Date.now()
  }, {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 5000
    }
  });
  
  return { status: 'queued', documentId: document.id };
}

// Process documents from queue
const processingWorker = new Worker('document-processing', async (job) => {
  const { document } = job.data;
  
  try {
    // Step 1: Chunk document
    const chunks = await chunker.chunkDocument(document);
    
    // Step 2: Generate embeddings and store in MongoDB
    const result = await rag.ingestBatch(chunks);
    
    return { processed: result.processed };
  } catch (error) {
    console.error('Document processing error:', error);
    throw error;
  }
}, { connection });

Monitoring and Observability

Implement comprehensive monitoring for your production RAG application:

Prometheus Metrics

const client = require('prom-client');
const register = new client.Registry();

// Add default metrics
client.collectDefaultMetrics({ register });

// Create custom metrics
const searchLatency = new client.Histogram({
  name: 'rag_search_latency_seconds',
  help: 'RAG search latency in seconds',
  labelNames: ['status'],
  buckets: [0.1, 0.5, 1, 2, 5, 10]
});

const llmLatency = new client.Histogram({
  name: 'llm_request_latency_seconds',
  help: 'LLM request latency in seconds',
  labelNames: ['model', 'status'],
  buckets: [0.5, 1, 3, 5, 10, 30]
});

const searchCounter = new client.Counter({
  name: 'rag_search_total',
  help: 'Total number of RAG searches',
  labelNames: ['status']
});

// Register metrics
register.registerMetric(searchLatency);
register.registerMetric(llmLatency);
register.registerMetric(searchCounter);

// Add metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Instrument search function
async function instrumentedSearch(query, options = {}) {
  const end = searchLatency.startTimer();
  searchCounter.inc({ status: 'started' });
  
  try {
    const results = await rag.search(query, options);
    end({ status: 'success' });
    searchCounter.inc({ status: 'success' });
    return results;
  } catch (error) {
    end({ status: 'error' });
    searchCounter.inc({ status: 'error' });
    throw error;
  }
}

Logging

Implement structured logging for better observability:

const winston = require('winston');
const { ElasticsearchTransport } = require('winston-elasticsearch');

// Configure logger
const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: 'rag-service' },
  transports: [
    new winston.transports.Console(),
    new ElasticsearchTransport({
      level: 'info',
      clientOpts: { node: process.env.ELASTICSEARCH_URL },
      indexPrefix: 'rag-logs'
    })
  ]
});

// Log search events
async function loggedSearch(query, options = {}) {
  logger.info('Search initiated', {
    query,
    options,
    userId: options.userId,
    requestId: options.requestId
  });
  
  try {
    const results = await rag.search(query, options);
    
    logger.info('Search completed', {
      query,
      resultCount: results.length,
      topScore: results[0]?.score,
      durationMs: Date.now() - startTime,
      requestId: options.requestId
    });
    
    return results;
  } catch (error) {
    logger.error('Search failed', {
      query,
      error: error.message,
      stack: error.stack,
      requestId: options.requestId
    });
    
    throw error;
  }
}

Deployment Options

Docker Containerization

Create a Dockerfile for your application:

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .

ENV NODE_ENV=production
ENV PORT=3000

EXPOSE 3000

CMD ["node", "src/index.js"]

Kubernetes Deployment

Create a Kubernetes deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-application
  labels:
    app: rag-application
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rag-application
  template:
    metadata:
      labels:
        app: rag-application
    spec:
      containers:
      - name: rag-application
        image: your-registry/rag-application:latest
        ports:
        - containerPort: 3000
        env:
        - name: MONGODB_URI
          valueFrom:
            secretKeyRef:
              name: mongodb-credentials
              key: uri
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: openai-credentials
              key: api-key
        resources:
          requests:
            memory: "512Mi"
            cpu: "200m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /healthz
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /healthz
            port: 3000
          initialDelaySeconds: 20
          periodSeconds: 15

Security Considerations

API Key Management

Use environment variables and secrets management:

// Load environment variables from a secure source
require('dotenv').config();

// Access keys securely
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

const rag = new MongoRAG({
  mongoUrl: process.env.MONGODB_URI,
  // ...other config
  embedding: {
    provider: process.env.EMBEDDING_PROVIDER,
    apiKey: process.env.EMBEDDING_API_KEY
  }
});

MongoDB Atlas Security

Network Security
- Use IP allowlists or private endpoints
- Enable VPC peering for cloud deployments
- Configure TLS/SSL for all connections
Authentication
- Use strong, unique passwords
- Implement database users with least privilege
- Consider SCRAM authentication
Encryption
- Enable encryption at rest
- Use TLS/SSL for data in transit
- Consider client-side field level encryption for sensitive data

Cost Optimization

Embedding Generation Cost

Optimize embedding generation costs:

Batch Processing
- Process documents in batches
- Use the lowest-cost embedding model that meets quality requirements
- Cache embeddings to avoid regeneration
Selective Updates
- Only regenerate embeddings when content changes
- Implement delta updates for document changes

LLM Usage Optimization

Reduce LLM API costs:

Context Trimming
- Send only the most relevant context to the LLM
- Implement intelligent context selection
Response Caching
- Cache common LLM responses
- Implement semantic caching based on query similarity
Model Selection
- Use smaller models for simpler tasks
- Implement model cascading (try smaller models first)

Continuous Improvement

A/B Testing Framework

Implement A/B testing to evaluate RAG improvements:

async function abTestSearch(query, options = {}) {
  // Determine which variant to use
  const variant = Math.random() < 0.5 ? 'A' : 'B';
  
  // Execute the appropriate search method
  let results;
  const startTime = Date.now();
  
  if (variant === 'A') {
    // Control: Standard search
    results = await rag.search(query, options);
  } else {
    // Treatment: New search algorithm
    results = await advancedSearch(query, options);
  }
  
  // Record metrics
  const duration = Date.now() - startTime;
  
  // Log experiment data
  logger.info('Search experiment', {
    variant,
    query,
    resultCount: results.length,
    topScore: results[0]?.score,
    duration,
    experimentId: 'search-algorithm-v2'
  });
  
  return results;
}

Feedback Collection

Implement user feedback collection:

async function recordSearchFeedback(searchId, feedback) {
  await feedbackCollection.insertOne({
    searchId,
    query: feedback.query,
    results: feedback.results,
    rating: feedback.rating,
    comments: feedback.comments,
    timestamp: new Date()
  });
  
  // Track relevance metrics
  if (feedback.rating >= 4) {
    relevanceMetrics.inc({ status: 'good' });
  } else if (feedback.rating <= 2) {
    relevanceMetrics.inc({ status: 'poor' });
  }
}

Conclusion

Congratulations! You've completed the MongoDB-RAG Workshop. You now have the knowledge and skills to:

Set up MongoDB Atlas with Vector Search
Create and manage vector embeddings
Build a complete RAG application with MongoDB
Implement advanced retrieval techniques
Deploy and optimize your application for production

Next Steps

To continue your RAG journey:

Explore Advanced Features
- Try different embedding models
- Experiment with hybrid search approaches
- Implement multi-modal RAG (text, images, etc.)
Integrate with Your Systems
- Connect your RAG system to your existing data sources
- Build domain-specific knowledge bases
- Implement custom workflows
Stay Updated
- Follow the MongoDB Developer Blog
- Join the MongoDB Community Forums
- Check for updates to the mongodb-rag library

We hope you found this workshop valuable and look forward to seeing what you build with MongoDB-RAG!

Architecture Considerations​

Microservices Approach​

Optimizing MongoDB Atlas​

Sizing and Scaling​

Index Optimization​

Connection Pooling​

Caching Strategies​

Query Result Caching​

Embedding Caching​

LLM Response Caching​

Rate Limiting and Queueing​

Rate Limiting​

Job Queueing​

Monitoring and Observability​

Prometheus Metrics​

Logging​

Deployment Options​

Docker Containerization​

Kubernetes Deployment​

Security Considerations​

API Key Management​

MongoDB Atlas Security​

Cost Optimization​

Embedding Generation Cost​

LLM Usage Optimization​

Continuous Improvement​

A/B Testing Framework​

Feedback Collection​

Conclusion​

Next Steps​