Files
openclaw-rag-skill/SKILL.md
Nova AI b272748209 Initial commit: OpenClaw RAG Knowledge System
- Full RAG system for OpenClaw agents
- Semantic search across chat history, code, docs, skills
- ChromaDB integration (all-MiniLM-L6-v2 embeddings)
- Automatic AI context retrieval
- Ingest pipelines for sessions, workspace, skills
- Python API and CLI interfaces
- Document management (add, delete, stats, reset)
2026-02-11 03:47:38 +00:00

361 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OpenClaw RAG Knowledge System
**Retrieval-Augmented Generation for OpenClaw Search chat history, code, docs, and skills with semantic understanding**
## Overview
This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base chat transcripts, workspace code, skill documentation and enables semantic search across everything.
**Key features:**
- 🧠 Semantic search across all conversations and code
- 📚 Automatic knowledge base management
- 🔍 Find past solutions, code patterns, decisions instantly
- 💾 Local ChromaDB storage (no API keys required)
- 🚀 Automatic AI integration retrieves context transparently
## Installation
### Prerequisites
- Python 3.7+
- OpenClaw workspace
### Setup
```bash
# Navigate to your OpenClaw workspace
cd ~/.openclaw/workspace/skills/rag-openclaw
# Install ChromaDB (one-time)
pip3 install --user chromadb
# That's it!
```
## Quick Start
### 1. Index Your Knowledge
```bash
# Index all chat history
python3 ingest_sessions.py
# Index workspace code and docs
python3 ingest_docs.py workspace
# Index skill documentation
python3 ingest_docs.py skills
```
### 2. Search the Knowledge Base
```bash
# Interactive search mode
python3 rag_query.py -i
# Quick search
python3 rag_query.py "how to send SMS via voip.ms"
# Search by type
python3 rag_query.py "porkbun DNS" --type skill
python3 rag_query.py "chromedriver" --type workspace
python3 rag_query.py "Reddit automation" --type session
```
### 3. Check Statistics
```bash
# See what's indexed
python3 rag_manage.py stats
```
## Usage Examples
### Finding Past Solutions
Hit a problem? Search for how you solved it before:
```bash
python3 rag_query.py "cloudflare bypass selenium"
python3 rag_query.py "voip.ms SMS configuration"
python3 rag_query.py "porkbun update DNS record"
```
### Searching Through Codebase
Find specific code or documentation:
```bash
python3 rag_query.py --type workspace "unifi gateway API"
python3 rag_query.py --type workspace "SMS client"
```
### Quick Reference
Access skill documentation without digging through files:
```bash
python3 rag_query.py --type skill "how to monitor UniFi"
python3 rag_query.py --type skill "Porkbun tool usage"
```
### Programmatic Use
From within Python scripts or OpenClaw sessions:
```python
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
from rag_query_wrapper import search_knowledge, format_for_ai
# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} relevant items")
# Format for AI consumption
context = format_for_ai(results)
print(context)
```
## Files Reference
| File | Purpose |
|------|---------|
| `rag_system.py` | Core RAG class (ChromaDB wrapper) |
| `ingest_sessions.py` | Index chat history |
| `ingest_docs.py` | Index workspace files & skills |
| `rag_query.py` | Search interface (CLI & interactive) |
| `rag_manage.py` | Document management (stats, delete, reset) |
| `rag_query_wrapper.py` | Simple Python API for programmatic use |
| `README.md` | Full documentation |
## How It Works
### Indexing
**Sessions:**
- Reads `~/.openclaw/agents/main/sessions/*.jsonl`
- Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages (20 per chunk, 5 message overlap)
- Extracts and formats thinking, tool calls, results
**Workspace:**
- Scans for `.py`, `.js`, `.ts`, `.md`, `.json`, `.yaml`, `.sh`, `.html`, `.css`
- Skips files > 1MB and binary files
- Chunks long documents for better retrieval
**Skills:**
- Indexes all `SKILL.md` files
- Organized by skill name for easy reference
### Search
ChromaDB uses `all-MiniLM-L6-v2` embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by *meaning* not just *keywords*.
### Automatic Integration
When the AI responds, it automatically:
1. Searches the knowledge base for relevant context
2. Retrieves past conversations, code, or docs
3. Includes that context in the response
This happens transparently the AI "remembers" your past work.
## Management
### View Statistics
```bash
python3 rag_manage.py stats
```
Output:
```
📊 OpenClaw RAG Statistics
Collection: openclaw_knowledge
Total Documents: 635
By Source:
session-001: 23
my-script.py: 5
porkbun: 12
By Type:
session: 500
workspace: 100
skill: 35
```
### Delete Documents
```bash
# Delete all sessions
python3 rag_manage.py delete --by-type session
# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
# Reset entire collection
python3 rag_manage.py reset
```
### Add Manual Document
```bash
python3 rag_manage.py add \
--text "API endpoint: https://api.example.com/endpoint" \
--source "api-docs:example.com" \
--type "manual"
```
## Configuration
### Custom Session Directory
```bash
python3 ingest_sessions.py --sessions-dir /path/to/sessions
```
### Chunk Size Control
```bash
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
```
### Custom Collection
```python
from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")
```
## Data Types
| Type | Source Format | Description |
|------|--------------|-------------|
| `session` | `session:{key}` | Chat history transcripts |
| `workspace` | `relative/path/to/file` | Code, configs, docs |
| `skill` | `skill:{name}` | Skill documentation |
| `memory` | `MEMORY.md` | Long-term memory entries |
| `manual` | `{custom}` | Manually added docs |
| `api` | `api-docs:{name}` | API documentation |
## Performance
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
- **Storage**: ~100MB per 1,000 documents
- **Indexing**: ~1,000 documents/minute
- **Search**: <100ms (after first query)
## Troubleshooting
### No Results Found
```bash
# Check what's indexed
python3 rag_manage.py stats
# Try broader query
python3 rag_query.py "SMS" # instead of "voip.ms SMS API endpoint"
```
### Slow First Search
First search loads embeddings (~1-2 seconds). Subsequent searches are instant.
### Duplicate ID Errors
```bash
# Reset and re-index
python3 rag_manage.py reset
python3 ingest_sessions.py
python3 ingest_docs.py workspace
```
### ChromaDB Model Download
First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.
## Best Practices
### Re-index Regularly
After significant work:
```bash
python3 ingest_sessions.py # New conversations
python3 ingest_docs.py workspace # New code/changes
```
### Use Specific Queries
```bash
# Better
python3 rag_query.py "voip.ms getSMS method"
# Too broad
python3 rag_query.py "SMS"
```
### Filter by Type
```bash
# Looking for code
python3 rag_query.py --type workspace "chromedriver"
# Looking for past conversations
python3 rag_query.py --type session "Reddit"
```
### Document Decisions
After important decisions, add them manually:
```bash
python3 rag_manage.py add \
--text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
--source "decision:reddit-automation" \
--type "decision"
```
## Limitations
- Files > 1MB automatically skipped (performance)
- Python 3.7+ required
- ~100MB disk per 1,000 documents
- First search slower (embedding load)
## Integration with OpenClaw
This skill integrates seamlessly with OpenClaw:
1. **Automatic RAG**: AI automatically retrieves relevant context when responding
2. **Session history**: All conversations indexed and searchable
3. **Workspace awareness**: Code and docs indexed for reference
4. **Skill accessible**: Use from any OpenClaw session or script
## Example Workflow
**Scenario:** You're working on a new automation but hit a Cloudflare challenge.
```bash
# Search for past Cloudflare solutions
python3 rag_query.py "Cloudflare bypass selenium"
# Result shows relevant past conversation:
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."
# Now you know the solution before trying it!
```
## Repository
https://git.theta42.com/nova/openclaw-rag-skill
**Published:** clawhub.com
**Maintainer:** Nova AI Assistant
**For:** William Mantly (Theta42)
## License
MIT License - Free to use and modify