Initial commit: OpenClaw RAG Knowledge System
- Full RAG system for OpenClaw agents - Semantic search across chat history, code, docs, skills - ChromaDB integration (all-MiniLM-L6-v2 embeddings) - Automatic AI context retrieval - Ingest pipelines for sessions, workspace, skills - Python API and CLI interfaces - Document management (add, delete, stats, reset)
This commit is contained in:
361
SKILL.md
Normal file
361
SKILL.md
Normal file
@@ -0,0 +1,361 @@
|
||||
# OpenClaw RAG Knowledge System
|
||||
|
||||
**Retrieval-Augmented Generation for OpenClaw – Search chat history, code, docs, and skills with semantic understanding**
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base – chat transcripts, workspace code, skill documentation – and enables semantic search across everything.
|
||||
|
||||
**Key features:**
|
||||
- 🧠 Semantic search across all conversations and code
|
||||
- 📚 Automatic knowledge base management
|
||||
- 🔍 Find past solutions, code patterns, decisions instantly
|
||||
- 💾 Local ChromaDB storage (no API keys required)
|
||||
- 🚀 Automatic AI integration – retrieves context transparently
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.7+
|
||||
- OpenClaw workspace
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Navigate to your OpenClaw workspace
|
||||
cd ~/.openclaw/workspace/skills/rag-openclaw
|
||||
|
||||
# Install ChromaDB (one-time)
|
||||
pip3 install --user chromadb
|
||||
|
||||
# That's it!
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Index Your Knowledge
|
||||
|
||||
```bash
|
||||
# Index all chat history
|
||||
python3 ingest_sessions.py
|
||||
|
||||
# Index workspace code and docs
|
||||
python3 ingest_docs.py workspace
|
||||
|
||||
# Index skill documentation
|
||||
python3 ingest_docs.py skills
|
||||
```
|
||||
|
||||
### 2. Search the Knowledge Base
|
||||
|
||||
```bash
|
||||
# Interactive search mode
|
||||
python3 rag_query.py -i
|
||||
|
||||
# Quick search
|
||||
python3 rag_query.py "how to send SMS via voip.ms"
|
||||
|
||||
# Search by type
|
||||
python3 rag_query.py "porkbun DNS" --type skill
|
||||
python3 rag_query.py "chromedriver" --type workspace
|
||||
python3 rag_query.py "Reddit automation" --type session
|
||||
```
|
||||
|
||||
### 3. Check Statistics
|
||||
|
||||
```bash
|
||||
# See what's indexed
|
||||
python3 rag_manage.py stats
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Finding Past Solutions
|
||||
|
||||
Hit a problem? Search for how you solved it before:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py "cloudflare bypass selenium"
|
||||
python3 rag_query.py "voip.ms SMS configuration"
|
||||
python3 rag_query.py "porkbun update DNS record"
|
||||
```
|
||||
|
||||
### Searching Through Codebase
|
||||
|
||||
Find specific code or documentation:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type workspace "unifi gateway API"
|
||||
python3 rag_query.py --type workspace "SMS client"
|
||||
```
|
||||
|
||||
### Quick Reference
|
||||
|
||||
Access skill documentation without digging through files:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type skill "how to monitor UniFi"
|
||||
python3 rag_query.py --type skill "Porkbun tool usage"
|
||||
```
|
||||
|
||||
### Programmatic Use
|
||||
|
||||
From within Python scripts or OpenClaw sessions:
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
|
||||
from rag_query_wrapper import search_knowledge, format_for_ai
|
||||
|
||||
# Search and get structured results
|
||||
results = search_knowledge("Reddit account automation")
|
||||
print(f"Found {results['count']} relevant items")
|
||||
|
||||
# Format for AI consumption
|
||||
context = format_for_ai(results)
|
||||
print(context)
|
||||
```
|
||||
|
||||
## Files Reference
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `rag_system.py` | Core RAG class (ChromaDB wrapper) |
|
||||
| `ingest_sessions.py` | Index chat history |
|
||||
| `ingest_docs.py` | Index workspace files & skills |
|
||||
| `rag_query.py` | Search interface (CLI & interactive) |
|
||||
| `rag_manage.py` | Document management (stats, delete, reset) |
|
||||
| `rag_query_wrapper.py` | Simple Python API for programmatic use |
|
||||
| `README.md` | Full documentation |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Indexing
|
||||
|
||||
**Sessions:**
|
||||
- Reads `~/.openclaw/agents/main/sessions/*.jsonl`
|
||||
- Handles OpenClaw event format (session metadata, messages, tool calls)
|
||||
- Chunks messages (20 per chunk, 5 message overlap)
|
||||
- Extracts and formats thinking, tool calls, results
|
||||
|
||||
**Workspace:**
|
||||
- Scans for `.py`, `.js`, `.ts`, `.md`, `.json`, `.yaml`, `.sh`, `.html`, `.css`
|
||||
- Skips files > 1MB and binary files
|
||||
- Chunks long documents for better retrieval
|
||||
|
||||
**Skills:**
|
||||
- Indexes all `SKILL.md` files
|
||||
- Organized by skill name for easy reference
|
||||
|
||||
### Search
|
||||
|
||||
ChromaDB uses `all-MiniLM-L6-v2` embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by *meaning* not just *keywords*.
|
||||
|
||||
### Automatic Integration
|
||||
|
||||
When the AI responds, it automatically:
|
||||
1. Searches the knowledge base for relevant context
|
||||
2. Retrieves past conversations, code, or docs
|
||||
3. Includes that context in the response
|
||||
|
||||
This happens transparently – the AI "remembers" your past work.
|
||||
|
||||
## Management
|
||||
|
||||
### View Statistics
|
||||
|
||||
```bash
|
||||
python3 rag_manage.py stats
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
📊 OpenClaw RAG Statistics
|
||||
|
||||
Collection: openclaw_knowledge
|
||||
Total Documents: 635
|
||||
|
||||
By Source:
|
||||
session-001: 23
|
||||
my-script.py: 5
|
||||
porkbun: 12
|
||||
|
||||
By Type:
|
||||
session: 500
|
||||
workspace: 100
|
||||
skill: 35
|
||||
```
|
||||
|
||||
### Delete Documents
|
||||
|
||||
```bash
|
||||
# Delete all sessions
|
||||
python3 rag_manage.py delete --by-type session
|
||||
|
||||
# Delete specific file
|
||||
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
|
||||
|
||||
# Reset entire collection
|
||||
python3 rag_manage.py reset
|
||||
```
|
||||
|
||||
### Add Manual Document
|
||||
|
||||
```bash
|
||||
python3 rag_manage.py add \
|
||||
--text "API endpoint: https://api.example.com/endpoint" \
|
||||
--source "api-docs:example.com" \
|
||||
--type "manual"
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Custom Session Directory
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --sessions-dir /path/to/sessions
|
||||
```
|
||||
|
||||
### Chunk Size Control
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
|
||||
```
|
||||
|
||||
### Custom Collection
|
||||
|
||||
```python
|
||||
from rag_system import RAGSystem
|
||||
rag = RAGSystem(collection_name="my_knowledge")
|
||||
```
|
||||
|
||||
## Data Types
|
||||
|
||||
| Type | Source Format | Description |
|
||||
|------|--------------|-------------|
|
||||
| `session` | `session:{key}` | Chat history transcripts |
|
||||
| `workspace` | `relative/path/to/file` | Code, configs, docs |
|
||||
| `skill` | `skill:{name}` | Skill documentation |
|
||||
| `memory` | `MEMORY.md` | Long-term memory entries |
|
||||
| `manual` | `{custom}` | Manually added docs |
|
||||
| `api` | `api-docs:{name}` | API documentation |
|
||||
|
||||
## Performance
|
||||
|
||||
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
|
||||
- **Storage**: ~100MB per 1,000 documents
|
||||
- **Indexing**: ~1,000 documents/minute
|
||||
- **Search**: <100ms (after first query)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Results Found
|
||||
|
||||
```bash
|
||||
# Check what's indexed
|
||||
python3 rag_manage.py stats
|
||||
|
||||
# Try broader query
|
||||
python3 rag_query.py "SMS" # instead of "voip.ms SMS API endpoint"
|
||||
```
|
||||
|
||||
### Slow First Search
|
||||
|
||||
First search loads embeddings (~1-2 seconds). Subsequent searches are instant.
|
||||
|
||||
### Duplicate ID Errors
|
||||
|
||||
```bash
|
||||
# Reset and re-index
|
||||
python3 rag_manage.py reset
|
||||
python3 ingest_sessions.py
|
||||
python3 ingest_docs.py workspace
|
||||
```
|
||||
|
||||
### ChromaDB Model Download
|
||||
|
||||
First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Re-index Regularly
|
||||
|
||||
After significant work:
|
||||
```bash
|
||||
python3 ingest_sessions.py # New conversations
|
||||
python3 ingest_docs.py workspace # New code/changes
|
||||
```
|
||||
|
||||
### Use Specific Queries
|
||||
|
||||
```bash
|
||||
# Better
|
||||
python3 rag_query.py "voip.ms getSMS method"
|
||||
|
||||
# Too broad
|
||||
python3 rag_query.py "SMS"
|
||||
```
|
||||
|
||||
### Filter by Type
|
||||
|
||||
```bash
|
||||
# Looking for code
|
||||
python3 rag_query.py --type workspace "chromedriver"
|
||||
|
||||
# Looking for past conversations
|
||||
python3 rag_query.py --type session "Reddit"
|
||||
```
|
||||
|
||||
### Document Decisions
|
||||
|
||||
After important decisions, add them manually:
|
||||
|
||||
```bash
|
||||
python3 rag_manage.py add \
|
||||
--text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
|
||||
--source "decision:reddit-automation" \
|
||||
--type "decision"
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Files > 1MB automatically skipped (performance)
|
||||
- Python 3.7+ required
|
||||
- ~100MB disk per 1,000 documents
|
||||
- First search slower (embedding load)
|
||||
|
||||
## Integration with OpenClaw
|
||||
|
||||
This skill integrates seamlessly with OpenClaw:
|
||||
|
||||
1. **Automatic RAG**: AI automatically retrieves relevant context when responding
|
||||
2. **Session history**: All conversations indexed and searchable
|
||||
3. **Workspace awareness**: Code and docs indexed for reference
|
||||
4. **Skill accessible**: Use from any OpenClaw session or script
|
||||
|
||||
## Example Workflow
|
||||
|
||||
**Scenario:** You're working on a new automation but hit a Cloudflare challenge.
|
||||
|
||||
```bash
|
||||
# Search for past Cloudflare solutions
|
||||
python3 rag_query.py "Cloudflare bypass selenium"
|
||||
|
||||
# Result shows relevant past conversation:
|
||||
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."
|
||||
|
||||
# Now you know the solution before trying it!
|
||||
```
|
||||
|
||||
## Repository
|
||||
|
||||
https://git.theta42.com/nova/openclaw-rag-skill
|
||||
|
||||
**Published:** clawhub.com
|
||||
**Maintainer:** Nova AI Assistant
|
||||
**For:** William Mantly (Theta42)
|
||||
|
||||
## License
|
||||
|
||||
MIT License - Free to use and modify
|
||||
Reference in New Issue
Block a user