Files
openclaw-rag-skill/README.md

391 lines
9.8 KiB
Markdown

# OpenClaw RAG Knowledge System
Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.
## Features
- **Semantic Search**: Find relevant context by meaning, not just keywords
- **Multi-Source Indexing**: Sessions, workspace files, skill documentation
- **Local Vector Store**: ChromaDB with built-in embeddings (no API keys required)
- **Automatic Integration**: AI automatically consults knowledge base when responding
- **Type Filtering**: Search by document type (session, workspace, skill, memory)
- **Management Tools**: Add/remove documents, view statistics, reset collection
## Quick Start
### Installation
```bash
# Install Python dependency
cd ~/.openclaw/workspace/rag
python3 -m pip install --user chromadb
```
**No API keys required** - This system is fully local:
- Embeddings: all-MiniLM-L6-v2 (downloaded once, 79MB)
- Vector store: ChromaDB (persistent disk storage)
- Data location: `~/.openclaw/data/rag/` (auto-created)
All operations run offline with no external dependencies besides the initial ChromaDB download.
### Index Your Data
```bash
# Index all chat sessions
python3 ingest_sessions.py
# Index workspace code and docs
python3 ingest_docs.py workspace
# Index skill documentation
python3 ingest_docs.py skills
```
### Search the Knowledge Base
```bash
# Interactive search mode
python3 rag_query.py -i
# Quick search
python3 rag_query.py "how to send SMS"
# Search by type
python3 rag_query.py "voip.ms" --type session
python3 rag_query.py "Porkbun DNS" --type skill
```
### Integration in Python Code
```python
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
from rag_query_wrapper import search_knowledge
# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} results")
# Format for AI consumption
from rag_query_wrapper import format_for_ai
context = format_for_ai(results)
print(context)
```
## Architecture
```
rag/
├── rag_system.py # Core RAG class (ChromaDB wrapper)
├── ingest_sessions.py # Load chat history from sessions
├── ingest_docs.py # Load workspace files & skill docs
├── rag_query.py # Search the knowledge base
├── rag_manage.py # Document management
├── rag_query_wrapper.py # Simple Python API
└── SKILL.md # OpenClaw skill documentation
```
Data storage: `~/.openclaw/data/rag/` (ChromaDB persistent storage)
## Usage Examples
### Find Past Solutions
When you encounter a problem, search for similar past issues:
```bash
python3 rag_query.py "cloudflare bypass failed selenium"
python3 rag_query.py "voip.ms SMS client"
python3 rag_query.py "porkbun DNS API"
```
### Search Through Codebase
Find code and documentation across your entire workspace:
```bash
python3 rag_query.py --type workspace "chromedriver setup"
python3 rag_query.py --type workspace "unifi gateway API"
```
### Access Skill Documentation
Quick reference for any openclaw skill:
```bash
python3 rag_query.py --type skill "how to check UniFi"
python3 rag_query.py --type skill "Porkbun DNS management"
```
### Manage Knowledge Base
```bash
# View statistics
python3 rag_manage.py stats
# Delete all sessions
python3 rag_manage.py delete --by-type session
# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
```
## How It Works
### Document Ingestion
1. **Session transcripts**: Process chat history from `~/.openclaw/agents/main/sessions/*.jsonl`
- Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages into groups of 20 with overlap
- Extracts and formats thinking, tool calls, and results
2. **Workspace files**: Scans workspace for code, docs, configs
- Supports: `.py`, `.js`, `.ts`, `.md`, `.json`, `. yaml`, `.sh`, `.html`, `.css`
- Skips files > 1MB and binary files
- Chunking for long documents
3. **Skills**: Indexes all `SKILL.md` files
- Captures skill documentation and usage examples
- Organized by skill name
### Semantic Search
ChromaDB uses `all-MiniLM-L6-v2` embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.
### Automatic RAG Integration
When the AI responds to a question that could benefit from context, it automatically:
1. Searches the knowledge base
2. Retrieves relevant past conversations, code, or docs
3. Includes that context in the response
This happens transparently - the AI just "knows" about your past work.
## Configuration
### Custom Session Directory
```bash
python3 ingest_sessions.py --sessions-dir /path/to/sessions
```
### Chunk Size Control
```bash
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
```
### Custom Collection Name
```python
from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")
```
## Data Types
| Type | Source | Description |
|------|--------|-------------|
| **session** | `session:{key}` | Chat history transcripts |
| **workspace** | `relative/path` | Code, configs, docs |
| **skill** | `skill:{name}` | Skill documentation |
| **memory** | `MEMORY.md` | Long-term memory entries |
| **manual** | `{custom}` | Manually added docs |
| **api** | `api-docs:{name}` | API documentation |
## Performance
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
- **Storage**: ~100MB per 1,000 documents
- **Indexing time**: ~1,000 docs/min
- **Search time**: <100ms (after first query loads embeddings)
## Troubleshooting
### No Results Found
- Check if anything is indexed: `python3 rag_manage.py stats`
- Try broader queries or different wording
- Try without filters: remove `--type` if using it
### Slow First Search
The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.
### Memory Issues
Reset collection if needed:
```bash
python3 rag_manage.py reset
```
### Duplicate ID Errors
If you see "Expected IDs to be unique" errors:
1. Reset the collection
2. Re-run ingestion
3. The fix includes `chunk_index` in ID generation
### ChromaDB Download Stuck
On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.
## Automatic Updates
### Setup Scheduled Indexing
The RAG system includes an automatic update script that runs daily:
```bash
# Manual test
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
```
**What it does:**
- Detects new/updated chat sessions and re-indexes them
- Re-indexes workspace files (captures code changes)
- Updates skill documentation
- Maintains state to avoid re-processing unchanged files
- Runs via cron at 4:00 AM UTC daily
**Configuration:**
```bash
# View cron job
openclaw cron list
# Edit schedule (if needed)
openclaw cron update <job-id> --schedule "{\"expr\":\"0 4 * * *\"}"
```
**State tracking:** `~/.openclaw/workspace/memory/rag-auto-state.json`
**Log file:** `~/.openclaw/workspace/memory/rag-auto-update.log`
## Moltbook Integration
Share RAG updates and announcements with the Moltbook community.
### Quick Post
```bash
# Post from draft
python3 scripts/moltbook_post.py --file drafts/moltbook-post-rag-release.md
# Post directly
python3 scripts/moltbook_post.py "Title" "Content"
```
### Examples
**Release announcement:**
```bash
python3 scripts/moltbook_post.py --file drafts/moltbook-post-rag-release.md --submolt general
```
**Quick update:**
```bash
python3 scripts/moltbook_post.py "RAG Update" "Fixed path portability issues"
```
### Configuration
To use Moltbook posting, configure your API key:
```bash
# Set environment variable
export MOLTBOOK_API_KEY="your-key-here"
# Or create credentials file
mkdir -p ~/.config/moltbook
cat > ~/.config/moltbook/credentials.json << EOF
{
"api_key": "moltbook_sk_YOUR_KEY_HERE"
}
EOF
```
Full documentation: `scripts/MOLTBOOK_POST.md`
**Note:** Moltbook posting is optional - core RAG functionality requires no configuration or API keys.
### Rate Limits
- Posts: 1 per 30 minutes
- Comments: 1 per 20 seconds
### Best Practices
### Automatic Update Enabled
The RAG system now automatically updates daily - no manual re-indexing needed.
After significant work, you can still manually update:
```bash
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
```
### Use Specific Queries
Better results with focused queries:
```bash
# Good
python3 rag_query.py "voip.ms getSMS API method"
# Less specific
python3 rag_query.py "API"
```
### Filter by Type
When you know the data type:
```bash
# Looking for code
python3 rag_query.py --type workspace "chromedriver"
# Looking for past conversations
python3 rag_query.py --type session "SMS"
```
### Document Decisions
After important decisions, add to knowledge base:
```bash
python3 rag_manage.py add \
--text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
--source "decision:reddit-automation" \
--type "decision"
```
## Limitations
- Files > 1MB are automatically skipped (performance)
- First search is slower (embedding load)
- Requires ~100MB disk space per 1,000 documents
- Python 3.7+ required
## License
MIT License - Free to use and modify
## Contributing
Contributions welcome! Areas for improvement:
- API documentation indexing from external URLs
- File system watch for automatic re-indexing
- Better chunking strategies for long documents
- Integration with external vector stores (Pinecone, Weaviate)
## Documentation Files
- **CHANGELOG.md** - Version history and changes
- **SKILL.md** - OpenClaw skill integration guide
- **package.json** - Skill metadata (no credentials required)
- **LICENSE** - MIT License
## Author
Nova AI Assistant for William Mantly (Theta42)
## Repository
https://openclaw-rag-skill.projects.theta42.com
Published on: clawhub.com