TL;DR : I built a full‑stack RAG chatbot platform using Next.js 15, Express, and PostgreSQL. It features semantic chunking, hybrid search (vector + keyword), LLM reranking, and a Playwright‑based web crawler, all inside a Turborepo monorepo. This post is a deep technical breakdown of how it actually works in production.
The Problem That Started It All
Horizon started with a recurring problem my team (YBM) kept hitting: finding information inside massive websites and documentation sets is painful. As developers, we’ve all wasted hours digging through 200‑page docs, broken search boxes, or half‑maintained wikis just to answer one specific question.
So we built Horizon, a no‑nonsense chatbot that actually understands your content using Retrieval‑Augmented Generation (RAG).
What Makes Horizon Interesting (Technically)
Upload documents or let the crawler handle everything
Proper semantic chunking (no naive fixed‑size splits)
Hybrid search: vector similarity + keyword matching
LLM‑powered reranking for precision
Headless browser crawling for client‑side rendered sites
Production‑ready monorepo with clean separation of concerns
Architecture: The Monorepo Setup (Using Turborepo)
This is a Turborepo, because managing multiple repos for a single product is unnecessary pain and I wanted to get my hands dirty with it :)
horizon/
├── apps/
│ ├── frontend/ # Next.js 15 (React 19, Tailwind CSS 4)
│ ├── backend/ # Express.js API
│ └── worker/ # Background job processor
├── packages/
│ ├── database/ # Prisma + pgvector
│ ├── rag/ # Core RAG engine
│ ├── queue/ # BullMQ job queue
│ ├── storage/ # S3-compatible storage
│ └── shared/ # Types and utilities
Each package does one thing well. The @repo/rag package is fully standalone - it could be dropped into another project without dragging the rest of the system with it.
This structure makes testing, debugging, and refactoring dramatically easier.
Tech Stack (And Why)
Next.js 15 + React 19: Server components, modern routing, and strong performance
Express.js: Boring, stable, predictable (that’s a compliment xD)
PostgreSQL + pgvector: Fast vector search without running a separate vector DB
Prisma: Type‑safe queries; hard to go back once you use it and people recommending Drizzle please stay away
BullMQ + Redis: Background jobs for crawling and document processing
OpenAI: Embeddings (
text-embedding-3-small) and chat models
RAG Implementation: Where Quality Is Won or Lost
Most RAG tutorials show a simplified approach: split text into chunks, embed, and search. That works for demos but fails for production!
The Problem With Naive Chunking
If you split every 500 characters, you end up breaking meaning:
Sentences get cut in half
Context disappears
Retrieval quality drops
When someone asks a question, the model gets incomplete or misleading chunks.
Semantic Chunking (Solution Basically)
Instead of fixed‑size splitting, Horizon uses structure‑aware chunking:
Detect document structure (Markdown headings, numbered sections, ALL‑CAPS titles)
Build hierarchical section paths
Preserve semantic boundaries
Add overlap for continuity
Prefix each chunk with contextual breadcrumbs
Example context prefix:
[Documentation > Authentication > OAuth 2.0]
This context is embedded with the chunk, so semantic meaning survives vectorization.
Embeddings: Turning Text Into Vectors
Each chunk is converted into a 1536‑dimensional vector using OpenAI embeddings (As the model we use for embedding outputs 1536 numbers).
Optimizations that matter in production:
Redis caching to avoid duplicate embedding calls
Batching to reduce API overhead
Retry logic with exponential backoff
Embeddings are stored directly in PostgreSQL using pgvector.
Vector Search With pgvector
Instead of a separate vector database, Horizon runs cosine similarity search inside Postgres:
Uses HNSW indexing (Don't ask the full form, just helps to find nearest neighbours)
~20–50ms query latency on ~100k chunks
Simple operational model
For most products under ~10M vectors, this is more than enough.
Hybrid Search: Vector + Keyword
Pure vector search struggles with exact terms like:
API_KEYPostgreSQL 15.3Error codes
Horizon runs two searches in parallel:
Semantic vector search
Keyword‑based text search
Results are merged using Reciprocal Rank Fusion (RRF), boosting chunks that score well in both.
This dramatically improves recall without hurting precision.
LLM Reranking: The Final Precision Pass
Even hybrid search isn’t perfect. The top results still need ordering.
Horizon sends the top candidates to a small LLM (gpt-5-mini) and asks it to rerank them by relevance.
This final step improved answer accuracy by ~15–20% in testing.
Small model, big impact.
Query Rewriting: Fixing User Input
Users ask vague questions like (Can’t blame them even we do the same):
“how does it work?”
Horizon rewrites queries using conversation context:
Before search
Using a lightweight LLM
Producing clearer, intent‑focused queries
Better input → better retrieval.
The Complete RAG Pipeline
User query
Query rewriting
Vector + keyword search (parallel)
Reciprocal rank fusion
LLM reranking
Context‑limited chunk selection
Final answer generation
Total latency: Usually 650ms, but sometimes spikes to 2.3s when Redis decides to have a moment. Still debugging that.
But worth it for the quality jump.
Web Crawling: When Docs Don’t Exist
Most teams don’t have clean documentation files - but they do have websites.
So we decided to build an integrated website crawler in Horizon. This was a simple web crawler implemented using the cheerio library from npm, we were just fetching the HTML and then parsing it which worker just fine initially but the moment we tried scraping one of the client’s website which was CSR (Client Side Rendered) site it pretty much failed.
After debugging the cause we were very sure that the basic crawler wouldn’t work for all types of sites since these CSR sites take time to populate/hydrate the html but our crawler was not waiting for that to happen.
Solution: There’s not much options other than simulating the browser so I installed the popular npm library Playwright. It simulates a headless browser where we can load the website and then we wait for a few seconds to let the page load up completely before scraping it and that solves it.
Background Processing With Workers
Crawling and document processing are handled asynchronously using BullMQ:
API stays fast and mostly unaffected
Jobs retry on failure which is good
Progress is tracked
Failures are debuggable
We tried doing this synchronously at first. That was a mistake so shifted to good old microservices
Challenges (And What Actually Fixed Them)
Token limits → smart context truncation
Embedding costs → caching, deduplication, chunk sizing
Crawler blocks → throttling, UA rotation, robots compliance
Search quality → hybrid search + reranking + better chunking
Each fix added a few percentage points. Together, they made the system usable.
Performance Numbers
RAG pipeline: Usually around 500-800ms.
Vector search: 20–50ms most of the times
Embedding generation: 100–200ms per chunk
Crawler speed:
Static: 50–80 pages/min
CSR: 10–20 pages/min
What I Learned
Monorepos are worth it
pgvector is massively underrated
LLM reranking works surprisingly well
Playwright solves problems nothing else can
Chunking quality matters more than model choice
Caching is not optional
Conclusion
RAG is simple in theory and messy in practice.
Next: Might add multiple model support or a feature to add specific metadata to file uploads which might help in retrieval.
Tech Stack Summary
Frontend: Next.js 15, React 19, Tailwind CSS 4, Better Auth
Backend: Express.js, Prisma, PostgreSQL + pgvector
Worker: BullMQ, Redis, Playwright
AI: OpenAI (GPT‑4, text‑embedding‑3‑small)
Infra: Turborepo, Bun, Docker
Signing Off.
