Building RAG Pipelines with pgvector and OpenAI: A Practical Guide

Retrieval-Augmented Generation (RAG) has become the go-to architecture for building AI applications that need to reason over custom data. After implementing RAG pipelines that achieved 99% accuracy in product identification for a sales automation system, I want to share practical insights on building production-ready RAG systems.

What is RAG?

RAG combines the power of large language models with your own data. Instead of relying solely on the model’s training data, you retrieve relevant context from your knowledge base and augment the prompt with it. This approach:

Reduces hallucinations by grounding responses in facts
Keeps your knowledge base up-to-date without retraining
Provides source attribution for better trust

Architecture Overview

User Query → Vector Search → Context Retrieval → Augmented Prompt → LLM Response

Setting Up pgvector

First, enable the pgvector extension in PostgreSQL:

CREATE EXTENSION IF NOT EXISTS vector;

-- Create a products table with vector embeddings
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    description TEXT,
    price DECIMAL(10, 2),
    embedding vector(1536)  -- OpenAI embedding dimension
);

Generating Embeddings with OpenAI

Using Python and OpenAI’s API:

from openai import AsyncOpenAI
import numpy as np

client = AsyncOpenAI(api_key="your-api-key")

async def generate_embedding(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

Building the RAG Pipeline

Here’s a practical implementation I’ve used in production:

from sqlalchemy import text
from pgvector.sqlalchemy import Vector

async def search_similar_products(query: str, limit: int = 5):
    # Generate query embedding
    query_embedding = await generate_embedding(query)

    # Vector similarity search with cosine distance
    sql_query = text("""
        SELECT id, name, description, price,
               1 - (embedding <=> :embedding) as similarity
        FROM products
        ORDER BY embedding <=> :embedding
        LIMIT :limit
    """)

    results = await db.execute(
        sql_query,
        {"embedding": query_embedding, "limit": limit}
    )

    return results.fetchall()

Augmenting the Prompt

Once you retrieve relevant context, augment it with the user’s query:

async def generate_rag_response(query: str) -> str:
    # Retrieve relevant context
    context = await search_similar_products(query)

    # Build context string
    context_text = "\n".join([
        f"- {row.name}: {row.description} (${row.price})"
        for row in context
    ])

    # Augmented prompt
    prompt = f"""
    Based on the following product catalog:

    {context_text}

    Customer Query: {query}

    Provide a helpful response identifying the relevant product(s).
    """

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful sales assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content

Performance Optimization Tips

1. Batch Embedding Generation

Generate embeddings in batches to reduce API calls:

async def batch_generate_embeddings(texts: list[str]) -> list[list[float]]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

2. Create HNSW Index for Faster Search

CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops);

3. Implement Caching

Cache frequently queried embeddings to reduce costs:

from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_embedding(text: str) -> list[float]:
    return asyncio.run(generate_embedding(text))

Key Learnings from Production

After deploying RAG pipelines processing 100+ queries daily with 99% accuracy:

Data Quality Matters: Clean, structured product data significantly improves retrieval accuracy
Embedding Model Choice: text-embedding-3-small offers great performance-to-cost ratio
Similarity Threshold: Set a minimum similarity score (e.g., 0.75) to filter irrelevant results
Context Window: Include 3-5 retrieved items in context to avoid overwhelming the model

Conclusion

RAG pipelines are incredibly powerful for building domain-specific AI applications. With pgvector and OpenAI, you can achieve production-ready accuracy without the complexity of separate vector databases. The key is to focus on data quality, efficient retrieval, and proper prompt engineering.

Have questions about implementing RAG in your project? Feel free to reach out!