Nova AI 3c9cee28d7 v1.0.3: Fix hard-coded paths, address security scan feedback
- Replace all absolute paths with dynamic resolution
- Add path portability and network behavior documentation
- Verify no custom network calls in codebase
- Update version to 1.0.3
2026-02-12 16:59:33 +00:00

OpenClaw RAG Knowledge System

Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.

Features

  • Semantic Search: Find relevant context by meaning, not just keywords
  • Multi-Source Indexing: Sessions, workspace files, skill documentation
  • Local Vector Store: ChromaDB with built-in embeddings (no API keys required)
  • Automatic Integration: AI automatically consults knowledge base when responding
  • Type Filtering: Search by document type (session, workspace, skill, memory)
  • Management Tools: Add/remove documents, view statistics, reset collection

Quick Start

Installation

# Install Python dependency
cd ~/.openclaw/workspace/rag
python3 -m pip install --user chromadb

No API keys required - This system is fully local:

  • Embeddings: all-MiniLM-L6-v2 (downloaded once, 79MB)
  • Vector store: ChromaDB (persistent disk storage)
  • Data location: ~/.openclaw/data/rag/ (auto-created)

All operations run offline with no external dependencies besides the initial ChromaDB download.

Index Your Data

# Index all chat sessions
python3 ingest_sessions.py

# Index workspace code and docs
python3 ingest_docs.py workspace

# Index skill documentation
python3 ingest_docs.py skills

Search the Knowledge Base

# Interactive search mode
python3 rag_query.py -i

# Quick search
python3 rag_query.py "how to send SMS"

# Search by type
python3 rag_query.py "voip.ms" --type session
python3 rag_query.py "Porkbun DNS" --type skill

Integration in Python Code

import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
from rag_query_wrapper import search_knowledge

# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} results")

# Format for AI consumption
from rag_query_wrapper import format_for_ai
context = format_for_ai(results)
print(context)

Architecture

rag/
├── rag_system.py          # Core RAG class (ChromaDB wrapper)
├── ingest_sessions.py     # Load chat history from sessions
├── ingest_docs.py         # Load workspace files & skill docs
├── rag_query.py           # Search the knowledge base
├── rag_manage.py          # Document management
├── rag_query_wrapper.py   # Simple Python API
└── SKILL.md               # OpenClaw skill documentation

Data storage: ~/.openclaw/data/rag/ (ChromaDB persistent storage)

Usage Examples

Find Past Solutions

When you encounter a problem, search for similar past issues:

python3 rag_query.py "cloudflare bypass failed selenium"
python3 rag_query.py "voip.ms SMS client"
python3 rag_query.py "porkbun DNS API"

Search Through Codebase

Find code and documentation across your entire workspace:

python3 rag_query.py --type workspace "chromedriver setup"
python3 rag_query.py --type workspace "unifi gateway API"

Access Skill Documentation

Quick reference for any openclaw skill:

python3 rag_query.py --type skill "how to check UniFi"
python3 rag_query.py --type skill "Porkbun DNS management"

Manage Knowledge Base

# View statistics
python3 rag_manage.py stats

# Delete all sessions
python3 rag_manage.py delete --by-type session

# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"

How It Works

Document Ingestion

  1. Session transcripts: Process chat history from ~/.openclaw/agents/main/sessions/*.jsonl

    • Handles OpenClaw event format (session metadata, messages, tool calls)
    • Chunks messages into groups of 20 with overlap
    • Extracts and formats thinking, tool calls, and results
  2. Workspace files: Scans workspace for code, docs, configs

    • Supports: .py, .js, .ts, .md, .json, . yaml, .sh, .html, .css
    • Skips files > 1MB and binary files
    • Chunking for long documents
  3. Skills: Indexes all SKILL.md files

    • Captures skill documentation and usage examples
    • Organized by skill name

ChromaDB uses all-MiniLM-L6-v2 embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.

Automatic RAG Integration

When the AI responds to a question that could benefit from context, it automatically:

  1. Searches the knowledge base
  2. Retrieves relevant past conversations, code, or docs
  3. Includes that context in the response

This happens transparently - the AI just "knows" about your past work.

Configuration

Custom Session Directory

python3 ingest_sessions.py --sessions-dir /path/to/sessions

Chunk Size Control

python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10

Custom Collection Name

from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")

Data Types

Type Source Description
session session:{key} Chat history transcripts
workspace relative/path Code, configs, docs
skill skill:{name} Skill documentation
memory MEMORY.md Long-term memory entries
manual {custom} Manually added docs
api api-docs:{name} API documentation

Performance

  • Embedding model: all-MiniLM-L6-v2 (79MB, cached locally)
  • Storage: ~100MB per 1,000 documents
  • Indexing time: ~1,000 docs/min
  • Search time: <100ms (after first query loads embeddings)

Troubleshooting

No Results Found

  • Check if anything is indexed: python3 rag_manage.py stats
  • Try broader queries or different wording
  • Try without filters: remove --type if using it

The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.

Memory Issues

Reset collection if needed:

python3 rag_manage.py reset

Duplicate ID Errors

If you see "Expected IDs to be unique" errors:

  1. Reset the collection
  2. Re-run ingestion
  3. The fix includes chunk_index in ID generation

ChromaDB Download Stuck

On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.

Automatic Updates

Setup Scheduled Indexing

The RAG system includes an automatic update script that runs daily:

# Manual test
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh

What it does:

  • Detects new/updated chat sessions and re-indexes them
  • Re-indexes workspace files (captures code changes)
  • Updates skill documentation
  • Maintains state to avoid re-processing unchanged files
  • Runs via cron at 4:00 AM UTC daily

Configuration:

# View cron job
openclaw cron list

# Edit schedule (if needed)
openclaw cron update <job-id> --schedule "{\"expr\":\"0 4 * * *\"}"

State tracking: ~/.openclaw/workspace/memory/rag-auto-state.json Log file: ~/.openclaw/workspace/memory/rag-auto-update.log

Best Practices

Automatic Update Enabled

The RAG system now automatically updates daily - no manual re-indexing needed.

After significant work, you can still manually update:

bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh

Use Specific Queries

Better results with focused queries:

# Good
python3 rag_query.py "voip.ms getSMS API method"

# Less specific
python3 rag_query.py "API"

Filter by Type

When you know the data type:

# Looking for code
python3 rag_query.py --type workspace "chromedriver"

# Looking for past conversations
python3 rag_query.py --type session "SMS"

Document Decisions

After important decisions, add to knowledge base:

python3 rag_manage.py add \
  --text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
  --source "decision:reddit-automation" \
  --type "decision"

Limitations

  • Files > 1MB are automatically skipped (performance)
  • First search is slower (embedding load)
  • Requires ~100MB disk space per 1,000 documents
  • Python 3.7+ required

License

MIT License - Free to use and modify

Contributing

Contributions welcome! Areas for improvement:

  • API documentation indexing from external URLs
  • File system watch for automatic re-indexing
  • Better chunking strategies for long documents
  • Integration with external vector stores (Pinecone, Weaviate)

Documentation Files

  • CHANGELOG.md - Version history and changes
  • SKILL.md - OpenClaw skill integration guide
  • package.json - Skill metadata (no credentials required)
  • LICENSE - MIT License

Author

Nova AI Assistant for William Mantly (Theta42)

Repository

https://git.theta42.com/nova/openclaw-rag-skill Published on: clawhub.com

Description
RAG Knowledge System for OpenClaw - Semantic search across chat history, code, docs, and skills
Readme MIT 176 KiB
Languages
Python 71.7%
HTML 21.5%
Shell 6.8%