---
name: vector-index-health-check
description: Probe a vector index for near-duplicate embeddings, orphaned chunks, and stale source docs — the three silent killers of RAG quality.
title: Vector Index Health Check
category: search-retrieval
difficulty: advanced
author: admin
icon: 🩺
input: structured-data
output: markdown
phase: post
domain: data
tags: vector-database,rag-quality,embeddings,duplicate-detection,metadata-audit,pinecone,qdrant,weaviate,faiss,data-validation,index-maintenance,cosine-similarity
best_for:
  - RAG pipeline health monitoring
  - Vector database maintenance and hygiene
  - Retrieval quality diagnostics
  - Content deduplication at scale
---

## Description

Runs against a Pinecone, Qdrant, Weaviate, or FAISS index and reports three classes of problem: near-duplicates (chunks with cosine similarity > 0.98 that differ only by boilerplate), orphans (chunks whose source doc no longer exists), and stale chunks (last_modified older than a configurable threshold). Each problem class includes a suggested remediation.

## Why it works

RAG quality degrades silently over time as duplicates accumulate (the same FAQ reindexed from three sources), as source docs get deleted but chunks don't, and as content becomes stale without being refreshed. Users see 'worse answers' and don't know why. A periodic audit surfaces the cause instead of waiting for complaints.

## How it works

1. Pull metadata for every vector — source URL, last_modified, chunk_id. 2. Near-dup scan: for each chunk, get its top-5 nearest neighbors; flag any pair with similarity > threshold where the text diff is only whitespace + boilerplate. 3. Orphan scan: resolve each source URL (HEAD request) and flag 404s or redirects to different content. 4. Stale scan: flag chunks whose source last_modified exceeds the threshold. 5. Emit a markdown report with counts, example entries, and suggested remediations (dedup, delete, refresh).
