Reimagine Your Company With AI Powered Workflows

Book A Call →

How I Built Horizon: A Production‑Ready RAG Chatbot Platform

Feb 6, 2026

TL;DR : I built a full‑stack RAG chatbot platform using Next.js 15, Express, and PostgreSQL. It features semantic chunking, hybrid search (vector + keyword), LLM reranking, and a Playwright‑based web crawler, all inside a Turborepo monorepo. This post is a deep technical breakdown of how it actually works in production.

The Problem That Started It All

Horizon started with a recurring problem my team (YBM) kept hitting: finding information inside massive websites and documentation sets is painful. As developers, we’ve all wasted hours digging through 200‑page docs, broken search boxes, or half‑maintained wikis just to answer one specific question.

So we built Horizon, a no‑nonsense chatbot that actually understands your content using Retrieval‑Augmented Generation (RAG).

What Makes Horizon Interesting (Technically)

Upload documents or let the crawler handle everything
Proper semantic chunking (no naive fixed‑size splits)
Hybrid search: vector similarity + keyword matching
LLM‑powered reranking for precision
Headless browser crawling for client‑side rendered sites
Production‑ready monorepo with clean separation of concerns

Architecture: The Monorepo Setup (Using Turborepo)

This is a Turborepo, because managing multiple repos for a single product is unnecessary pain and I wanted to get my hands dirty with it :)

horizon/

├── apps/

│ ├── frontend/ # Next.js 15 (React 19, Tailwind CSS 4)

│ ├── backend/ # Express.js API

│ └── worker/ # Background job processor

├── packages/

│ ├── database/ # Prisma + pgvector

│ ├── rag/ # Core RAG engine

│ ├── queue/ # BullMQ job queue

│ ├── storage/ # S3-compatible storage

│ └── shared/ # Types and utilities

Each package does one thing well. The @repo/rag package is fully standalone - it could be dropped into another project without dragging the rest of the system with it.

This structure makes testing, debugging, and refactoring dramatically easier.

Tech Stack (And Why)

Next.js 15 + React 19: Server components, modern routing, and strong performance
Express.js: Boring, stable, predictable (that’s a compliment xD)
PostgreSQL + pgvector: Fast vector search without running a separate vector DB
Prisma: Type‑safe queries; hard to go back once you use it and people recommending Drizzle please stay away
BullMQ + Redis: Background jobs for crawling and document processing
OpenAI: Embeddings (text-embedding-3-small) and chat models

RAG Implementation: Where Quality Is Won or Lost

Most RAG tutorials show a simplified approach: split text into chunks, embed, and search. That works for demos but fails for production!

The Problem With Naive Chunking

If you split every 500 characters, you end up breaking meaning:

Sentences get cut in half
Context disappears
Retrieval quality drops

When someone asks a question, the model gets incomplete or misleading chunks.

Semantic Chunking (Solution Basically)

Instead of fixed‑size splitting, Horizon uses structure‑aware chunking:

Detect document structure (Markdown headings, numbered sections, ALL‑CAPS titles)
Build hierarchical section paths
Preserve semantic boundaries
Add overlap for continuity
Prefix each chunk with contextual breadcrumbs

Example context prefix:

[Documentation > Authentication > OAuth 2.0]

This context is embedded with the chunk, so semantic meaning survives vectorization.

Embeddings: Turning Text Into Vectors

Each chunk is converted into a 1536‑dimensional vector using OpenAI embeddings (As the model we use for embedding outputs 1536 numbers).

Optimizations that matter in production:

Redis caching to avoid duplicate embedding calls
Batching to reduce API overhead
Retry logic with exponential backoff

Embeddings are stored directly in PostgreSQL using pgvector.

Vector Search With pgvector

Instead of a separate vector database, Horizon runs cosine similarity search inside Postgres:

Uses HNSW indexing (Don't ask the full form, just helps to find nearest neighbours)
~20–50ms query latency on ~100k chunks
Simple operational model

For most products under ~10M vectors, this is more than enough.

Hybrid Search: Vector + Keyword

Pure vector search struggles with exact terms like:

API_KEY
PostgreSQL 15.3
Error codes

Horizon runs two searches in parallel:

Semantic vector search
Keyword‑based text search

Results are merged using Reciprocal Rank Fusion (RRF), boosting chunks that score well in both.

This dramatically improves recall without hurting precision.

LLM Reranking: The Final Precision Pass

Even hybrid search isn’t perfect. The top results still need ordering.

Horizon sends the top candidates to a small LLM (gpt-5-mini) and asks it to rerank them by relevance.

This final step improved answer accuracy by ~15–20% in testing.

Small model, big impact.

Query Rewriting: Fixing User Input

Users ask vague questions like (Can’t blame them even we do the same):

“how does it work?”

Horizon rewrites queries using conversation context:

Before search
Using a lightweight LLM
Producing clearer, intent‑focused queries

Better input → better retrieval.

The Complete RAG Pipeline

User query
Query rewriting
Vector + keyword search (parallel)
Reciprocal rank fusion
LLM reranking
Context‑limited chunk selection
Final answer generation

Total latency: Usually 650ms, but sometimes spikes to 2.3s when Redis decides to have a moment. Still debugging that.

But worth it for the quality jump.

Web Crawling: When Docs Don’t Exist

Most teams don’t have clean documentation files - but they do have websites.

So we decided to build an integrated website crawler in Horizon. This was a simple web crawler implemented using the cheerio library from npm, we were just fetching the HTML and then parsing it which worker just fine initially but the moment we tried scraping one of the client’s website which was CSR (Client Side Rendered) site it pretty much failed.

After debugging the cause we were very sure that the basic crawler wouldn’t work for all types of sites since these CSR sites take time to populate/hydrate the html but our crawler was not waiting for that to happen.

Solution: There’s not much options other than simulating the browser so I installed the popular npm library Playwright. It simulates a headless browser where we can load the website and then we wait for a few seconds to let the page load up completely before scraping it and that solves it.

Background Processing With Workers

Crawling and document processing are handled asynchronously using BullMQ:

API stays fast and mostly unaffected
Jobs retry on failure which is good
Progress is tracked
Failures are debuggable

We tried doing this synchronously at first. That was a mistake so shifted to good old microservices

Challenges (And What Actually Fixed Them)

Token limits → smart context truncation

Embedding costs → caching, deduplication, chunk sizing

Crawler blocks → throttling, UA rotation, robots compliance

Search quality → hybrid search + reranking + better chunking

Each fix added a few percentage points. Together, they made the system usable.

Performance Numbers

RAG pipeline: Usually around 500-800ms.
Vector search: 20–50ms most of the times
Embedding generation: 100–200ms per chunk
Crawler speed:
- Static: 50–80 pages/min
- CSR: 10–20 pages/min

What I Learned

Monorepos are worth it
pgvector is massively underrated
LLM reranking works surprisingly well
Playwright solves problems nothing else can
Chunking quality matters more than model choice
Caching is not optional

Conclusion

RAG is simple in theory and messy in practice.

Next: Might add multiple model support or a feature to add specific metadata to file uploads which might help in retrieval.

Tech Stack Summary

Frontend: Next.js 15, React 19, Tailwind CSS 4, Better Auth
Backend: Express.js, Prisma, PostgreSQL + pgvector
Worker: BullMQ, Redis, Playwright
AI: OpenAI (GPT‑4, text‑embedding‑3‑small)
Infra: Turborepo, Bun, Docker

Signing Off.

Is Your Enterprise Ready for AI? A Complete Readiness Guide ›