391 lines
9.8 KiB
Markdown
391 lines
9.8 KiB
Markdown
# OpenClaw RAG Knowledge System
|
|
|
|
Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.
|
|
|
|
## Features
|
|
|
|
- **Semantic Search**: Find relevant context by meaning, not just keywords
|
|
- **Multi-Source Indexing**: Sessions, workspace files, skill documentation
|
|
- **Local Vector Store**: ChromaDB with built-in embeddings (no API keys required)
|
|
- **Automatic Integration**: AI automatically consults knowledge base when responding
|
|
- **Type Filtering**: Search by document type (session, workspace, skill, memory)
|
|
- **Management Tools**: Add/remove documents, view statistics, reset collection
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install Python dependency
|
|
cd ~/.openclaw/workspace/rag
|
|
python3 -m pip install --user chromadb
|
|
```
|
|
|
|
**No API keys required** - This system is fully local:
|
|
- Embeddings: all-MiniLM-L6-v2 (downloaded once, 79MB)
|
|
- Vector store: ChromaDB (persistent disk storage)
|
|
- Data location: `~/.openclaw/data/rag/` (auto-created)
|
|
|
|
All operations run offline with no external dependencies besides the initial ChromaDB download.
|
|
|
|
### Index Your Data
|
|
|
|
```bash
|
|
# Index all chat sessions
|
|
python3 ingest_sessions.py
|
|
|
|
# Index workspace code and docs
|
|
python3 ingest_docs.py workspace
|
|
|
|
# Index skill documentation
|
|
python3 ingest_docs.py skills
|
|
```
|
|
|
|
### Search the Knowledge Base
|
|
|
|
```bash
|
|
# Interactive search mode
|
|
python3 rag_query.py -i
|
|
|
|
# Quick search
|
|
python3 rag_query.py "how to send SMS"
|
|
|
|
# Search by type
|
|
python3 rag_query.py "voip.ms" --type session
|
|
python3 rag_query.py "Porkbun DNS" --type skill
|
|
```
|
|
|
|
### Integration in Python Code
|
|
|
|
```python
|
|
import sys
|
|
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
|
|
from rag_query_wrapper import search_knowledge
|
|
|
|
# Search and get structured results
|
|
results = search_knowledge("Reddit account automation")
|
|
print(f"Found {results['count']} results")
|
|
|
|
# Format for AI consumption
|
|
from rag_query_wrapper import format_for_ai
|
|
context = format_for_ai(results)
|
|
print(context)
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
rag/
|
|
├── rag_system.py # Core RAG class (ChromaDB wrapper)
|
|
├── ingest_sessions.py # Load chat history from sessions
|
|
├── ingest_docs.py # Load workspace files & skill docs
|
|
├── rag_query.py # Search the knowledge base
|
|
├── rag_manage.py # Document management
|
|
├── rag_query_wrapper.py # Simple Python API
|
|
└── SKILL.md # OpenClaw skill documentation
|
|
```
|
|
|
|
Data storage: `~/.openclaw/data/rag/` (ChromaDB persistent storage)
|
|
|
|
## Usage Examples
|
|
|
|
### Find Past Solutions
|
|
|
|
When you encounter a problem, search for similar past issues:
|
|
|
|
```bash
|
|
python3 rag_query.py "cloudflare bypass failed selenium"
|
|
python3 rag_query.py "voip.ms SMS client"
|
|
python3 rag_query.py "porkbun DNS API"
|
|
```
|
|
|
|
### Search Through Codebase
|
|
|
|
Find code and documentation across your entire workspace:
|
|
|
|
```bash
|
|
python3 rag_query.py --type workspace "chromedriver setup"
|
|
python3 rag_query.py --type workspace "unifi gateway API"
|
|
```
|
|
|
|
### Access Skill Documentation
|
|
|
|
Quick reference for any openclaw skill:
|
|
|
|
```bash
|
|
python3 rag_query.py --type skill "how to check UniFi"
|
|
python3 rag_query.py --type skill "Porkbun DNS management"
|
|
```
|
|
|
|
### Manage Knowledge Base
|
|
|
|
```bash
|
|
# View statistics
|
|
python3 rag_manage.py stats
|
|
|
|
# Delete all sessions
|
|
python3 rag_manage.py delete --by-type session
|
|
|
|
# Delete specific file
|
|
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### Document Ingestion
|
|
|
|
1. **Session transcripts**: Process chat history from `~/.openclaw/agents/main/sessions/*.jsonl`
|
|
- Handles OpenClaw event format (session metadata, messages, tool calls)
|
|
- Chunks messages into groups of 20 with overlap
|
|
- Extracts and formats thinking, tool calls, and results
|
|
|
|
2. **Workspace files**: Scans workspace for code, docs, configs
|
|
- Supports: `.py`, `.js`, `.ts`, `.md`, `.json`, `. yaml`, `.sh`, `.html`, `.css`
|
|
- Skips files > 1MB and binary files
|
|
- Chunking for long documents
|
|
|
|
3. **Skills**: Indexes all `SKILL.md` files
|
|
- Captures skill documentation and usage examples
|
|
- Organized by skill name
|
|
|
|
### Semantic Search
|
|
|
|
ChromaDB uses `all-MiniLM-L6-v2` embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.
|
|
|
|
### Automatic RAG Integration
|
|
|
|
When the AI responds to a question that could benefit from context, it automatically:
|
|
1. Searches the knowledge base
|
|
2. Retrieves relevant past conversations, code, or docs
|
|
3. Includes that context in the response
|
|
|
|
This happens transparently - the AI just "knows" about your past work.
|
|
|
|
## Configuration
|
|
|
|
### Custom Session Directory
|
|
|
|
```bash
|
|
python3 ingest_sessions.py --sessions-dir /path/to/sessions
|
|
```
|
|
|
|
### Chunk Size Control
|
|
|
|
```bash
|
|
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
|
|
```
|
|
|
|
### Custom Collection Name
|
|
|
|
```python
|
|
from rag_system import RAGSystem
|
|
rag = RAGSystem(collection_name="my_knowledge")
|
|
```
|
|
|
|
## Data Types
|
|
|
|
| Type | Source | Description |
|
|
|------|--------|-------------|
|
|
| **session** | `session:{key}` | Chat history transcripts |
|
|
| **workspace** | `relative/path` | Code, configs, docs |
|
|
| **skill** | `skill:{name}` | Skill documentation |
|
|
| **memory** | `MEMORY.md` | Long-term memory entries |
|
|
| **manual** | `{custom}` | Manually added docs |
|
|
| **api** | `api-docs:{name}` | API documentation |
|
|
|
|
## Performance
|
|
|
|
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
|
|
- **Storage**: ~100MB per 1,000 documents
|
|
- **Indexing time**: ~1,000 docs/min
|
|
- **Search time**: <100ms (after first query loads embeddings)
|
|
|
|
## Troubleshooting
|
|
|
|
### No Results Found
|
|
|
|
- Check if anything is indexed: `python3 rag_manage.py stats`
|
|
- Try broader queries or different wording
|
|
- Try without filters: remove `--type` if using it
|
|
|
|
### Slow First Search
|
|
|
|
The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.
|
|
|
|
### Memory Issues
|
|
|
|
Reset collection if needed:
|
|
```bash
|
|
python3 rag_manage.py reset
|
|
```
|
|
|
|
### Duplicate ID Errors
|
|
|
|
If you see "Expected IDs to be unique" errors:
|
|
1. Reset the collection
|
|
2. Re-run ingestion
|
|
3. The fix includes `chunk_index` in ID generation
|
|
|
|
### ChromaDB Download Stuck
|
|
|
|
On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.
|
|
|
|
## Automatic Updates
|
|
|
|
### Setup Scheduled Indexing
|
|
|
|
The RAG system includes an automatic update script that runs daily:
|
|
|
|
```bash
|
|
# Manual test
|
|
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
|
|
```
|
|
|
|
**What it does:**
|
|
- Detects new/updated chat sessions and re-indexes them
|
|
- Re-indexes workspace files (captures code changes)
|
|
- Updates skill documentation
|
|
- Maintains state to avoid re-processing unchanged files
|
|
- Runs via cron at 4:00 AM UTC daily
|
|
|
|
**Configuration:**
|
|
```bash
|
|
# View cron job
|
|
openclaw cron list
|
|
|
|
# Edit schedule (if needed)
|
|
openclaw cron update <job-id> --schedule "{\"expr\":\"0 4 * * *\"}"
|
|
```
|
|
|
|
**State tracking:** `~/.openclaw/workspace/memory/rag-auto-state.json`
|
|
**Log file:** `~/.openclaw/workspace/memory/rag-auto-update.log`
|
|
|
|
## Moltbook Integration
|
|
|
|
Share RAG updates and announcements with the Moltbook community.
|
|
|
|
### Quick Post
|
|
|
|
```bash
|
|
# Post from draft
|
|
python3 scripts/moltbook_post.py --file drafts/moltbook-post-rag-release.md
|
|
|
|
# Post directly
|
|
python3 scripts/moltbook_post.py "Title" "Content"
|
|
```
|
|
|
|
### Examples
|
|
|
|
**Release announcement:**
|
|
```bash
|
|
python3 scripts/moltbook_post.py --file drafts/moltbook-post-rag-release.md --submolt general
|
|
```
|
|
|
|
**Quick update:**
|
|
```bash
|
|
python3 scripts/moltbook_post.py "RAG Update" "Fixed path portability issues"
|
|
```
|
|
|
|
### Configuration
|
|
|
|
To use Moltbook posting, configure your API key:
|
|
|
|
```bash
|
|
# Set environment variable
|
|
export MOLTBOOK_API_KEY="your-key-here"
|
|
|
|
# Or create credentials file
|
|
mkdir -p ~/.config/moltbook
|
|
cat > ~/.config/moltbook/credentials.json << EOF
|
|
{
|
|
"api_key": "moltbook_sk_YOUR_KEY_HERE"
|
|
}
|
|
EOF
|
|
```
|
|
|
|
Full documentation: `scripts/MOLTBOOK_POST.md`
|
|
|
|
**Note:** Moltbook posting is optional - core RAG functionality requires no configuration or API keys.
|
|
|
|
### Rate Limits
|
|
|
|
- Posts: 1 per 30 minutes
|
|
- Comments: 1 per 20 seconds
|
|
|
|
### Best Practices
|
|
|
|
### Automatic Update Enabled
|
|
|
|
The RAG system now automatically updates daily - no manual re-indexing needed.
|
|
|
|
After significant work, you can still manually update:
|
|
```bash
|
|
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
|
|
```
|
|
|
|
### Use Specific Queries
|
|
|
|
Better results with focused queries:
|
|
```bash
|
|
# Good
|
|
python3 rag_query.py "voip.ms getSMS API method"
|
|
|
|
# Less specific
|
|
python3 rag_query.py "API"
|
|
```
|
|
|
|
### Filter by Type
|
|
|
|
When you know the data type:
|
|
```bash
|
|
# Looking for code
|
|
python3 rag_query.py --type workspace "chromedriver"
|
|
|
|
# Looking for past conversations
|
|
python3 rag_query.py --type session "SMS"
|
|
```
|
|
|
|
### Document Decisions
|
|
|
|
After important decisions, add to knowledge base:
|
|
```bash
|
|
python3 rag_manage.py add \
|
|
--text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
|
|
--source "decision:reddit-automation" \
|
|
--type "decision"
|
|
```
|
|
|
|
## Limitations
|
|
|
|
- Files > 1MB are automatically skipped (performance)
|
|
- First search is slower (embedding load)
|
|
- Requires ~100MB disk space per 1,000 documents
|
|
- Python 3.7+ required
|
|
|
|
## License
|
|
|
|
MIT License - Free to use and modify
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Areas for improvement:
|
|
- API documentation indexing from external URLs
|
|
- File system watch for automatic re-indexing
|
|
- Better chunking strategies for long documents
|
|
- Integration with external vector stores (Pinecone, Weaviate)
|
|
|
|
## Documentation Files
|
|
|
|
- **CHANGELOG.md** - Version history and changes
|
|
- **SKILL.md** - OpenClaw skill integration guide
|
|
- **package.json** - Skill metadata (no credentials required)
|
|
- **LICENSE** - MIT License
|
|
|
|
## Author
|
|
|
|
Nova AI Assistant for William Mantly (Theta42)
|
|
|
|
## Repository
|
|
|
|
https://openclaw-rag-skill.projects.theta42.com
|
|
Published on: clawhub.com |