Ai Docs Vector Db Hybrid Scraper

2 MIT

FreeCommunity

AI Systems

🚀 Hybrid AI documentation scraping system combining Crawl4AI (bulk) + Firecrawl MCP (on-demand) with Qdrant vector database for Claude Desktop/Code integration. Ultra-fast, cost-effective documentation search for developers.

What is Ai Docs Vector Db Hybrid Scraper

ai-docs-vector-db-hybrid-scraper is an advanced hybrid AI documentation scraping system that integrates Crawl4AI for bulk scraping and Firecrawl MCP for on-demand scraping, utilizing the Qdrant vector database for efficient documentation search and retrieval.

Use cases

Use cases include building knowledge bases for software libraries, enhancing developer tools with intelligent search capabilities, and automating documentation scraping for research and development purposes.

How to use

To use ai-docs-vector-db-hybrid-scraper, set up the environment by installing the required dependencies, configure the scraping parameters, and execute the scraping commands to build a vector knowledge base for documentation search.

Key features

Key features include ultra-fast embedding generation, significant storage cost reduction, improved retrieval accuracy through hybrid search, and lower API costs, making it a cost-effective solution for developers.

Where to use

ai-docs-vector-db-hybrid-scraper is suitable for use in software development, documentation management, and any field requiring efficient and accurate information retrieval from various documentation sources.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Ai Docs Vector Db Hybrid Scraper

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Intelligent Vector RAG Knowledge Base with Multi-Tier Web Crawling

A production-grade vector RAG system implementing research-backed best practices for intelligent
document processing, multi-tier web crawling, and hybrid search with reranking. Built with modern
Python architecture and comprehensive testing.

System Overview
Technical Architecture
Performance Benchmarks
Installation & Setup
Configuration
Usage Examples
API Reference
Testing & Quality Assurance
Development Guidelines
Deployment
Troubleshooting
Documentation
Contributing
How to Cite
License

System Overview

This system implements a sophisticated vector-based Retrieval-Augmented Generation (RAG) pipeline with intelligent web
crawling capabilities. The architecture combines multiple crawling tiers, advanced embedding techniques, and hybrid search
strategies to achieve superior performance compared to existing solutions.

Core Features

Multi-Tier Browser Automation: Five-tier routing system (httpx → Crawl4AI → Enhanced → browser-use → Playwright)
Enhanced Database Connection Pool: ML-based predictive scaling with 50.9% latency reduction and 887.9% throughput increase
Advanced Configuration Management: Interactive wizard, templates, backup/restore, and migration system
Advanced Filtering Architecture: Temporal, content type, metadata, and similarity filtering with boolean logic
Federated Search: Cross-collection search with intelligent ranking and result fusion
Personalized Ranking: User-based ranking with preference learning and collaborative filtering
Query Processing Pipeline: 14-category intent classification with Matryoshka embeddings
ML Security: Minimalistic security implementation with input validation, dependency scanning, and monitoring
Result Clustering: HDBSCAN-based organization with cluster summaries
Hybrid Vector Search: Dense + sparse embeddings with reciprocal rank fusion
Query Enhancement: HyDE (Hypothetical Document Embeddings) implementation
Advanced Reranking: Cross-encoder reranking with BGE-reranker-v2-m3
Memory-Adaptive Processing: Dynamic concurrency control based on system resources
Vector Quantization: Storage optimization with minimal accuracy loss
Collection Aliases: Zero-downtime deployments with blue-green switching
MCP Protocol Integration: Unified server for Claude Desktop/Code integration
Comprehensive Caching: DragonflyDB + in-memory LRU with intelligent warming

Technology Stack

Component	Technology	Version
Web Crawling	Crawl4AI	0.4.0+
Browser Automation	Playwright + browser-use	Latest
Vector Database	Qdrant	1.12+
Cache Layer	DragonflyDB	Latest
Database Connection Pool	SQLAlchemy Async + ML Monitoring	Latest
Machine Learning	scikit-learn + psutil	Latest
Embeddings	OpenAI + FastEmbed	Latest
Reranking	BGE-reranker-v2-m3	1.0+
Web Framework	FastAPI	0.115.0+
Configuration	Pydantic	2.0+
Package Manager	uv	Latest
Task Queue	ARQ	Latest

Technical Architecture

Multi-Tier Crawling System

The system implements a five-tier browser automation hierarchy with intelligent routing:

flowchart TB
    subgraph "Tier 1: Lightweight HTTP"
        A1[httpx] --> A2[Basic HTML parsing]
    end

    subgraph "Tier 2: Enhanced Crawling"
        B1[Crawl4AI] --> B2[JavaScript execution]
        B1 --> B3[Memory-adaptive concurrency]
    end

    subgraph "Tier 3: Advanced Routing"
        C1[Enhanced Router] --> C2[Dynamic tier selection]
        C1 --> C3[Failure recovery]
    end

    subgraph "Tier 4: AI Browser Control"
        D1[browser-use] --> D2[LLM-guided interaction]
        D1 --> D3[Multi-model support]
    end

    subgraph "Tier 5: Full Browser"
        E1[Playwright] --> E2[Complete JS rendering]
        E1 --> E3[Complex interactions]
    end

    A1 --> B1
    B1 --> C1
    C1 --> D1
    D1 --> E1

Vector Processing Pipeline

flowchart LR
    subgraph "Input Processing"
        A[Raw Documents] --> B[AST-Aware Chunking]
        B --> C[Metadata Extraction]
    end

    subgraph "Embedding Generation"
        C --> D[Dense Embeddings<br/>text-embedding-3-small]
        C --> E[Sparse Embeddings<br/>SPLADE++]
    end

    subgraph "Storage & Indexing"
        D --> F[Qdrant Vector DB]
        E --> F
        F --> G[Payload Indexing]
        F --> H[Vector Quantization]
    end

    subgraph "Search & Retrieval"
        I[Query] --> J[HyDE Enhancement]
        J --> K[Hybrid Search]
        K --> L[BGE Reranking]
        L --> M[Results]
    end

    F --> K

Enhanced Database Connection Pool Architecture

flowchart TB
    subgraph "Predictive Load Monitor"
        A1[System Metrics<br/>CPU, Memory, Connections] --> A2[ML Model<br/>RandomForestRegressor]
        A2 --> A3[Load Prediction<br/>High/Medium/Low]
    end

    subgraph "Adaptive Configuration"
        A3 --> B1[Dynamic Scaling<br/>Pool Size: 5-50]
        B1 --> B2[Timeout Adjustment<br/>30s-300s]
        B2 --> B3[Monitoring Interval<br/>10s-60s]
    end

    subgraph "Multi-Level Circuit Breaker"
        C1[CONNECTION Failures] --> C2[Circuit Breaker<br/>Per Failure Type]
        C3[TIMEOUT Failures] --> C2
        C4[QUERY Failures] --> C2
        C5[TRANSACTION Failures] --> C2
        C6[RESOURCE Failures] --> C2
    end

    subgraph "Connection Affinity Manager"
        D1[Query Pattern Analysis] --> D2[Connection Optimization<br/>READ/WRITE/TRANSACTION]
        D2 --> D3[Performance Tracking<br/>Per Connection]
        D3 --> D4[Optimal Routing]
    end

    B3 --> C2
    C2 --> D4
    D4 --> E1[Enhanced AsyncConnectionManager<br/>50.9% Latency ↓ | 887.9% Throughput ↑]

Performance Benchmarks

Crawling Performance vs. Alternatives

Metric	This System	Firecrawl	Beautiful Soup	Improvement
Average Latency	0.4s	2.5s	1.8s	6.25x faster
Success Rate	97%	92%	85%	5.4% better
Memory Usage	120MB	200MB	150MB	40% less
JS Rendering	✅	✅	❌	Feature parity
Cost	$0	$0.005/page	$0	Zero cost

Embedding Model Performance Comparison

Model	MTEB Score	Cost (per 1M tokens)	Dimensions	Use Case
text-embedding-3-small	62.3	$0.02	1536	Recommended
text-embedding-3-large	64.6	$0.13	3072	High accuracy
text-embedding-ada-002	61.0	$0.10	1536	Legacy compatibility
BGE-M3 (local)	64.1	Free	1024	Local deployment

Search Strategy Performance

Strategy	Accuracy	P95 Latency	Storage Overhead	Complexity
Dense Only	Baseline	45ms	1x	Low
Sparse Only	-15%	40ms	1.5x	Low
Hybrid + Reranking	+30%	65ms	1.2x	Optimal

Enhanced Database Connection Pool Performance

Metric	Baseline System	Enhanced System	Improvement
P95 Latency	820ms	402ms	50.9% reduction
P50 Latency	450ms	198ms	56.0% reduction
P99 Latency	1200ms	612ms	49.0% reduction
Throughput	85 ops/sec	839 ops/sec	887.9% increase
Connection Utilization	65%	92%	41.5% improvement
Failure Recovery Time	12s	3.2s	73.3% faster
Memory Usage	180MB	165MB	8.3% reduction

Key Features Delivered

🤖 ML-Based Predictive Scaling: RandomForestRegressor model for load prediction
🔧 Multi-Level Circuit Breaker: Failure categorization with intelligent recovery
🎯 Connection Affinity: Query pattern optimization for optimal routing
📊 Adaptive Configuration: Dynamic pool sizing based on system metrics
✅ Comprehensive Testing: 43% coverage with 56 passing tests

System Performance Metrics

Production Benchmarks (1000-document corpus):
┌─────────────────────────┬──────────────┬──────────────┬─────────────┐
│ Operation               │ P50 Latency  │ P95 Latency  │ Throughput  │
├─────────────────────────┼──────────────┼──────────────┼─────────────┤
│ Document Indexing       │ 0.5s         │ 1.1s         │ 28 docs/sec │
│ Database Operations     │ 198ms        │ 402ms        │ 839 ops/sec │
│ Vector Search (dense)   │ 15ms         │ 45ms         │ 250 qps     │
│ Hybrid Search + Rerank  │ 35ms         │ 85ms         │ 120 qps     │
│ Cache Hit              │ 0.8ms        │ 2.1ms        │ 5000 qps    │
│ Memory Usage           │ 415MB        │ 645MB        │ -           │
└─────────────────────────┴──────────────┴──────────────┴─────────────┘

Installation & Setup

Prerequisites

Python 3.13+ (recommended for optimal performance)
Docker Desktop with WSL2 integration (Windows) or Docker Engine (Linux/macOS)
OpenAI API key
4GB+ RAM (8GB+ recommended for production)

Quick Installation

# Clone repository
git clone https://github.com/BjornMelin/ai-docs-vector-db-hybrid-scraper.git
cd ai-docs-vector-db-hybrid-scraper

# Automated setup with dependency validation
chmod +x setup.sh
./setup.sh

# Verify installation
uv run python -c "import src; print('Installation successful')"

Environment Configuration

# Create .env file with required API keys
cat > .env << EOF
# Required
OPENAI_API_KEY="sk-..."

# Optional - For enhanced browser automation
ANTHROPIC_API_KEY="sk-ant-..."
GEMINI_API_KEY="..."
BROWSER_USE_LLM_PROVIDER="openai"
BROWSER_USE_MODEL="gpt-4o-mini"

# Optional - For premium crawling features
FIRECRAWL_API_KEY="fc-..."

# System Configuration
QDRANT_URL="http://localhost:6333"
DRAGONFLY_URL="redis://localhost:6379"
EOF

Service Initialization

# Start vector database and cache
./scripts/start-services.sh

# Verify services
curl -s http://localhost:6333/health | jq '.status'  # Should return "ok"
redis-cli -p 6379 ping  # Should return "PONG"

# Start background task worker
./scripts/start-worker.sh

Configuration

Interactive Configuration Setup

Get started quickly with the configuration wizard:

# Launch interactive setup wizard
uv run python -m src.cli.main config wizard

# Create configuration from template
uv run python -m src.cli.main config template apply production -o config.json

# Backup current configuration
uv run python -m src.cli.main config backup create config.json --description "Production backup"

# Validate configuration
uv run python -m src.cli.main config validate config.json --health-check

Advanced System Configuration

from src.config import get_config
from src.config.models import EmbeddingConfig, VectorSearchStrategy
from src.config.wizard import ConfigurationWizard

# Interactive configuration setup
wizard = ConfigurationWizard()
config_path = wizard.run_setup_wizard()

# Get unified configuration with validation
config = get_config()

# Advanced embedding configuration
embedding_config = EmbeddingConfig(
    provider="HYBRID",
    dense_model="text-embedding-3-small",
    sparse_model="SPLADE_PP_EN_V1",
    search_strategy=VectorSearchStrategy.HYBRID_RRF,
    enable_quantization=True,
    enable_reranking=True,
    reranker_model="BAAI/bge-reranker-v2-m3",
    batch_size=32,
    max_tokens_per_chunk=512
)

Configuration Templates

Five optimized templates are available for different deployment scenarios:

# Development environment
uv run python -m src.cli.main config template apply development

# Production with security hardening
uv run python -m src.cli.main config template apply production

# High-performance for maximum throughput
uv run python -m src.cli.main config template apply high_performance

# Memory-optimized for resource-constrained environments
uv run python -m src.cli.main config template apply memory_optimized

# Distributed multi-node deployment
uv run python -m src.cli.main config template apply distributed

Crawling Configuration

from src.config.models import Crawl4AIConfig

# Memory-adaptive crawler configuration
crawler_config = Crawl4AIConfig(
    enable_memory_adaptive_dispatcher=True,
    memory_threshold_percent=75.0,
    max_session_permit=20,
    enable_streaming=True,
    rate_limit_base_delay_min=0.5,
    rate_limit_max_retries=3,
    bypass_cache=False,
    word_count_threshold=50
)

Usage Examples

Basic Document Processing

from src.services import EmbeddingManager, QdrantService
from src.config import get_config

config = get_config()

async def process_documents():
    async with EmbeddingManager(config) as embeddings:
        async with QdrantService(config) as qdrant:
            # Create collection with hybrid search support
            await qdrant.create_collection(
                "knowledge_base",
                vector_size=1536,
                sparse_vector_name="sparse"
            )

            # Process documents with chunking
            texts = ["Document content...", "More content..."]
            dense_vectors, sparse_vectors = await embeddings.generate_embeddings(
                texts,
                generate_sparse=True
            )

            # Store with metadata
            await qdrant.upsert_documents(
                collection_name="knowledge_base",
                documents=texts,
                dense_vectors=dense_vectors,
                sparse_vectors=sparse_vectors,
                metadata=[{"source": "doc1"}, {"source": "doc2"}]
            )

Advanced Search with Reranking

from src.services.embeddings.reranker import BGEReranker

async def hybrid_search_with_reranking():
    async with QdrantService(config) as qdrant:
        # Perform hybrid search
        results = await qdrant.hybrid_search(
            collection_name="knowledge_base",
            query_text="vector database optimization",
            dense_weight=0.7,
            sparse_weight=0.3,
            limit=20
        )

        # Rerank results for improved relevance
        reranker = BGEReranker()
        reranked_results = await reranker.rerank(
            query="vector database optimization",
            results=results,
            top_k=5
        )

        return reranked_results

Multi-Tier Web Crawling

from src.services.browser import UnifiedBrowserManager

async def crawl_with_intelligent_routing():
    async with UnifiedBrowserManager(config) as browser:
        # Automatic tier selection based on page complexity
        result = await browser.scrape_url(
            "https://docs.complex-site.com",
            tier_preference="auto",  # Let system choose optimal tier
            enable_javascript=True,
            wait_for_content=True
        )

        # Process with enhanced chunking
        from src.chunking import enhanced_chunk_text
        chunks = enhanced_chunk_text(
            result.content,
            chunk_size=1600,
            preserve_code_blocks=True,
            enable_ast_chunking=True
        )

        return chunks

Enhanced Database Connection Pool Usage

from src.infrastructure.database import AsyncConnectionManager
from src.infrastructure.database.adaptive_config import AdaptiveConfigManager
from src.infrastructure.database.connection_affinity import ConnectionAffinityManager

async def use_enhanced_database():
    # Initialize with ML-based predictive scaling
    async with AsyncConnectionManager() as conn_mgr:
        # Adaptive configuration automatically adjusts pool size
        adaptive_config = AdaptiveConfigManager(
            strategy="AUTO_SCALING",
            initial_pool_size=10,
            max_pool_size=50,
            min_pool_size=5
        )

        # Connection affinity optimizes query routing
        affinity_mgr = ConnectionAffinityManager()

        # Execute optimized query
        async with conn_mgr.get_connection() as conn:
            # System automatically routes to optimal connection
            result = await conn.execute("SELECT * FROM documents WHERE id = ?", [doc_id])

            # Performance metrics tracked automatically
            stats = await adaptive_config.get_performance_metrics()
            print(f"P95 Latency: {stats['p95_latency_ms']}ms")
            print(f"Throughput: {stats['ops_per_second']} ops/sec")

Database Connection Pool Configuration

from src.config.models import DatabaseConfig

# Production-optimized configuration
db_config = DatabaseConfig(
    enable_adaptive_config=True,
    enable_connection_affinity=True,
    enable_circuit_breaker=True,
    adaptive_config={
        "strategy": "AUTO_SCALING",
        "monitoring_interval_seconds": 30,
        "load_prediction_window_minutes": 5,
        "scaling_factor": 1.5
    },
    circuit_breaker_config={
        "failure_threshold": 5,
        "recovery_timeout_seconds": 30,
        "half_open_max_calls": 3
    }
)

API Reference

Core Services

EmbeddingManager

class EmbeddingManager:
    """Manages embedding generation with multiple providers."""

    async def generate_embeddings(
        self,
        texts: List[str],
        generate_sparse: bool = False,
        quality_tier: str = "BALANCED"
    ) -> Tuple[List[List[float]], Optional[List[SparseVector]]]:
        """Generate dense and optionally sparse embeddings."""

    async def get_provider_stats(self) -> Dict[str, Any]:
        """Get embedding provider statistics and costs."""

QdrantService

class QdrantService:
    """Qdrant vector database operations with hybrid search."""

    async def hybrid_search(
        self,
        collection_name: str,
        query_text: str,
        dense_weight: float = 0.7,
        sparse_weight: float = 0.3,
        limit: int = 10,
        filter_conditions: Optional[Dict] = None
    ) -> List[SearchResult]:
        """Perform hybrid search with RRF fusion."""

    async def create_collection_with_quantization(
        self,
        name: str,
        vector_size: int,
        quantization_type: str = "binary"
    ) -> bool:
        """Create optimized collection with quantization."""

MCP Server Tools

The system provides 25+ MCP tools for integration with Claude Desktop:

# Available via MCP protocol
tools = [
    "search_documents",          # Hybrid search with reranking
    "add_document",             # Single document ingestion
    "add_documents_batch",      # Batch processing
    "create_project",           # Project management
    "get_server_stats",         # Performance monitoring
    "lightweight_scrape",       # Multi-tier web crawling
    # ... and 20+ more
]

Testing & Quality Assurance

Test Coverage

The system maintains comprehensive test coverage across all modules:

Test Coverage Report:
┌─────────────────────┬───────────┬─────────────┬─────────────┐
│ Module Category     │ Tests     │ Coverage    │ Status      │
├─────────────────────┼───────────┼─────────────┼─────────────┤
│ Configuration       │ 380+      │ 94-100%     │ ✅ Complete  │
│ API Contracts       │ 67        │ 100%        │ ✅ Complete  │
│ Document Processing │ 33        │ 95%         │ ✅ Complete  │
│ Vector Search       │ 51        │ 92%         │ ✅ Complete  │
│ Security            │ 33        │ 98%         │ ✅ Complete  │
│ MCP Tools           │ 136+      │ 90%+        │ ✅ Complete  │
│ Infrastructure      │ 87        │ 80%+        │ ✅ Complete  │
│ Browser Services    │ 120+      │ 85%+        │ ✅ Complete  │
│ Cache Services      │ 90+       │ 88%+        │ ✅ Complete  │
│ Total               │ 1000+     │ 90%+        │ ✅ Production │
└─────────────────────┴───────────┴─────────────┴─────────────┘

Running Tests

# Full test suite with coverage
uv run pytest --cov=src --cov-report=html

# Specific test categories
uv run pytest tests/unit/config/           # Configuration tests
uv run pytest tests/unit/services/         # Service layer tests
uv run pytest tests/integration/           # Integration tests

# Performance benchmarks
uv run pytest tests/benchmarks/            # Performance tests

Code Quality

# Linting and formatting
ruff check . --fix && ruff format .

# Type checking
mypy src/

# Security scanning
bandit -r src/

Development Guidelines

Architecture Principles

Service-Oriented Architecture: Clean separation of concerns with dependency injection
Async-First Design: Full async/await support for optimal performance
Configuration-Driven: Centralized Pydantic-based configuration with validation
Error Handling: Comprehensive error types with automatic retry logic
Observability: Built-in metrics, logging, and health checks

Contributing Workflow

# Development setup
git checkout -b feature/enhancement-name
uv sync --dev

# Pre-commit validation
ruff check . --fix && ruff format .
uv run pytest --cov=src -x
mypy src/

# Commit with conventional commits
git commit -m "feat: add enhancement description"

Deployment

Production Configuration

# docker-compose.prod.yml
version: "3.8"
services:
  qdrant:
    image: qdrant/qdrant:v1.12.0
    deploy:
      resources:
        limits:
          memory: 8G
          cpus: "4"
    environment:
      - QDRANT__STORAGE__QUANTIZATION__ALWAYS_RAM=true
      - QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS=8

  dragonfly:
    image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.23.0
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2"
    command: >
      --logtostderr
      --cache_mode
      --maxmemory_policy=allkeys-lru
      --compression=zstd

Monitoring & Health Checks

# System health validation
./scripts/health-check.sh

# Performance monitoring
./scripts/performance-benchmark.sh

# Service metrics
curl -s http://localhost:8000/health | jq

Troubleshooting

Common Issues

High Memory Usage

# Enable quantization to reduce memory by 83%
export ENABLE_QUANTIZATION=true

# Reduce batch size for embedding generation
export EMBEDDING_BATCH_SIZE=16

Slow Search Performance

# Enable payload indexing for filtered queries
export ENABLE_PAYLOAD_INDEXING=true

# Use DragonflyDB for faster caching
export CACHE_PROVIDER=dragonfly

Connection Issues

# Verify service health
docker-compose ps
curl http://localhost:6333/health
redis-cli -p 6379 ping

# Restart services
docker-compose restart

Performance Optimization

For detailed optimization guidelines, see our consolidated documentation:

Documentation

Our documentation is organized by user role for efficient navigation and focused guidance:

📚 For End Users

Quick Start Guide - Get up and running in minutes
Search & Retrieval - Complete search functionality guide
Web Scraping - Multi-tier browser automation guide
Examples & Recipes - Practical usage examples
Troubleshooting - Common questions and solutions

👩‍💻 For Developers

API Reference - Complete API documentation (REST, Browser, MCP)
Integration Guide - SDK, Docker, and framework integration
Architecture Guide - System design and component details
Configuration Reference - Complete configuration documentation
Getting Started - Development setup and workflow
Contributing Guide - Contribution guidelines and standards

🚀 For Operators

Deployment Guide - Production deployment and optimization
Monitoring & Observability - Comprehensive monitoring setup
Operations Manual - Day-to-day operational procedures
Configuration Management - Environment and configuration management
Security Guide - Security implementation and best practices

📋 Additional Resources

Backup Documentation - Previous documentation structure (archived)
Project Analysis - Comprehensive feature analysis report

🎯 Quick Navigation by Task

What you want to do	Go to
Set up the system	Quick Start Guide
Integrate with your app	Integration Guide
Deploy to production	Deployment Guide
Monitor performance	Monitoring Guide
Understand the API	API Reference
Configure the system	Configuration Reference
Secure your deployment	Security Guide
Troubleshoot issues	Operations Manual

Contributing

We welcome contributions to improve the system’s capabilities and performance. Please see
CONTRIBUTING.md for detailed guidelines on:

Development setup and workflow
Code style and testing requirements
Performance benchmarking procedures
Documentation standards

How to Cite

If you use this system in your research or production environment, please cite:

@software{intelligent_vector_rag_2024,
  title={Intelligent Vector RAG Knowledge Base with Multi-Tier Web Crawling},
  author={Melin, Bjorn and Contributors},
  year={2024},
  url={https://github.com/BjornMelin/ai-docs-vector-db-hybrid-scraper},
  version={1.0},
  note={A production-grade vector RAG system with hybrid search and intelligent web crawling}
}

Research Foundations

This implementation builds upon established research in:

Hybrid Search: Dense-sparse vector fusion with reciprocal rank fusion
[Chen et al., 2024]
Vector Quantization: Binary and scalar quantization techniques
[Malkov & Yashunin, 2018]
Cross-Encoder Reranking: BGE reranker architecture [Xiao et al., 2023]
Memory-Adaptive Processing: Dynamic concurrency control for optimal resource utilization
HyDE Query Enhancement: Hypothetical document embedding generation
[Gao et al., 2022]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Built for the AI developer community with research-backed best practices and production-grade reliability.

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers

Ai Docs Vector Db Hybrid Scraper

What is Ai Docs Vector Db Hybrid Scraper

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

Overview

What is Ai Docs Vector Db Hybrid Scraper

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

Content

Intelligent Vector RAG Knowledge Base with Multi-Tier Web Crawling

Table of Contents

System Overview

Technical Architecture

Performance Benchmarks

Key Features Delivered

Installation & Setup

Configuration

Usage Examples

API Reference

EmbeddingManager

QdrantService

Testing & Quality Assurance

Development Guidelines

Deployment

Troubleshooting

High Memory Usage

Slow Search Performance

Connection Issues

Documentation

Contributing

How to Cite

License

Dev Tools Supporting MCP

Tools

Comments

Recommend MCP Servers