- Full RAG system for OpenClaw agents - Semantic search across chat history, code, docs, skills - ChromaDB integration (all-MiniLM-L6-v2 embeddings) - Automatic AI context retrieval - Ingest pipelines for sessions, workspace, skills - Python API and CLI interfaces - Document management (add, delete, stats, reset)
8.1 KiB
OpenClaw RAG Knowledge System
Retrieval-Augmented Generation for OpenClaw – Search chat history, code, docs, and skills with semantic understanding
Overview
This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base – chat transcripts, workspace code, skill documentation – and enables semantic search across everything.
Key features:
- 🧠 Semantic search across all conversations and code
- 📚 Automatic knowledge base management
- 🔍 Find past solutions, code patterns, decisions instantly
- 💾 Local ChromaDB storage (no API keys required)
- 🚀 Automatic AI integration – retrieves context transparently
Installation
Prerequisites
- Python 3.7+
- OpenClaw workspace
Setup
# Navigate to your OpenClaw workspace
cd ~/.openclaw/workspace/skills/rag-openclaw
# Install ChromaDB (one-time)
pip3 install --user chromadb
# That's it!
Quick Start
1. Index Your Knowledge
# Index all chat history
python3 ingest_sessions.py
# Index workspace code and docs
python3 ingest_docs.py workspace
# Index skill documentation
python3 ingest_docs.py skills
2. Search the Knowledge Base
# Interactive search mode
python3 rag_query.py -i
# Quick search
python3 rag_query.py "how to send SMS via voip.ms"
# Search by type
python3 rag_query.py "porkbun DNS" --type skill
python3 rag_query.py "chromedriver" --type workspace
python3 rag_query.py "Reddit automation" --type session
3. Check Statistics
# See what's indexed
python3 rag_manage.py stats
Usage Examples
Finding Past Solutions
Hit a problem? Search for how you solved it before:
python3 rag_query.py "cloudflare bypass selenium"
python3 rag_query.py "voip.ms SMS configuration"
python3 rag_query.py "porkbun update DNS record"
Searching Through Codebase
Find specific code or documentation:
python3 rag_query.py --type workspace "unifi gateway API"
python3 rag_query.py --type workspace "SMS client"
Quick Reference
Access skill documentation without digging through files:
python3 rag_query.py --type skill "how to monitor UniFi"
python3 rag_query.py --type skill "Porkbun tool usage"
Programmatic Use
From within Python scripts or OpenClaw sessions:
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
from rag_query_wrapper import search_knowledge, format_for_ai
# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} relevant items")
# Format for AI consumption
context = format_for_ai(results)
print(context)
Files Reference
| File | Purpose |
|---|---|
rag_system.py |
Core RAG class (ChromaDB wrapper) |
ingest_sessions.py |
Index chat history |
ingest_docs.py |
Index workspace files & skills |
rag_query.py |
Search interface (CLI & interactive) |
rag_manage.py |
Document management (stats, delete, reset) |
rag_query_wrapper.py |
Simple Python API for programmatic use |
README.md |
Full documentation |
How It Works
Indexing
Sessions:
- Reads
~/.openclaw/agents/main/sessions/*.jsonl - Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages (20 per chunk, 5 message overlap)
- Extracts and formats thinking, tool calls, results
Workspace:
- Scans for
.py,.js,.ts,.md,.json,.yaml,.sh,.html,.css - Skips files > 1MB and binary files
- Chunks long documents for better retrieval
Skills:
- Indexes all
SKILL.mdfiles - Organized by skill name for easy reference
Search
ChromaDB uses all-MiniLM-L6-v2 embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by meaning not just keywords.
Automatic Integration
When the AI responds, it automatically:
- Searches the knowledge base for relevant context
- Retrieves past conversations, code, or docs
- Includes that context in the response
This happens transparently – the AI "remembers" your past work.
Management
View Statistics
python3 rag_manage.py stats
Output:
📊 OpenClaw RAG Statistics
Collection: openclaw_knowledge
Total Documents: 635
By Source:
session-001: 23
my-script.py: 5
porkbun: 12
By Type:
session: 500
workspace: 100
skill: 35
Delete Documents
# Delete all sessions
python3 rag_manage.py delete --by-type session
# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
# Reset entire collection
python3 rag_manage.py reset
Add Manual Document
python3 rag_manage.py add \
--text "API endpoint: https://api.example.com/endpoint" \
--source "api-docs:example.com" \
--type "manual"
Configuration
Custom Session Directory
python3 ingest_sessions.py --sessions-dir /path/to/sessions
Chunk Size Control
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
Custom Collection
from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")
Data Types
| Type | Source Format | Description |
|---|---|---|
session |
session:{key} |
Chat history transcripts |
workspace |
relative/path/to/file |
Code, configs, docs |
skill |
skill:{name} |
Skill documentation |
memory |
MEMORY.md |
Long-term memory entries |
manual |
{custom} |
Manually added docs |
api |
api-docs:{name} |
API documentation |
Performance
- Embedding model:
all-MiniLM-L6-v2(79MB, cached locally) - Storage: ~100MB per 1,000 documents
- Indexing: ~1,000 documents/minute
- Search: <100ms (after first query)
Troubleshooting
No Results Found
# Check what's indexed
python3 rag_manage.py stats
# Try broader query
python3 rag_query.py "SMS" # instead of "voip.ms SMS API endpoint"
Slow First Search
First search loads embeddings (~1-2 seconds). Subsequent searches are instant.
Duplicate ID Errors
# Reset and re-index
python3 rag_manage.py reset
python3 ingest_sessions.py
python3 ingest_docs.py workspace
ChromaDB Model Download
First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.
Best Practices
Re-index Regularly
After significant work:
python3 ingest_sessions.py # New conversations
python3 ingest_docs.py workspace # New code/changes
Use Specific Queries
# Better
python3 rag_query.py "voip.ms getSMS method"
# Too broad
python3 rag_query.py "SMS"
Filter by Type
# Looking for code
python3 rag_query.py --type workspace "chromedriver"
# Looking for past conversations
python3 rag_query.py --type session "Reddit"
Document Decisions
After important decisions, add them manually:
python3 rag_manage.py add \
--text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
--source "decision:reddit-automation" \
--type "decision"
Limitations
- Files > 1MB automatically skipped (performance)
- Python 3.7+ required
- ~100MB disk per 1,000 documents
- First search slower (embedding load)
Integration with OpenClaw
This skill integrates seamlessly with OpenClaw:
- Automatic RAG: AI automatically retrieves relevant context when responding
- Session history: All conversations indexed and searchable
- Workspace awareness: Code and docs indexed for reference
- Skill accessible: Use from any OpenClaw session or script
Example Workflow
Scenario: You're working on a new automation but hit a Cloudflare challenge.
# Search for past Cloudflare solutions
python3 rag_query.py "Cloudflare bypass selenium"
# Result shows relevant past conversation:
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."
# Now you know the solution before trying it!
Repository
https://git.theta42.com/nova/openclaw-rag-skill
Published: clawhub.com Maintainer: Nova AI Assistant For: William Mantly (Theta42)
License
MIT License - Free to use and modify