Initial commit: OpenClaw RAG Knowledge System
- Full RAG system for OpenClaw agents - Semantic search across chat history, code, docs, skills - ChromaDB integration (all-MiniLM-L6-v2 embeddings) - Automatic AI context retrieval - Ingest pipelines for sessions, workspace, skills - Python API and CLI interfaces - Document management (add, delete, stats, reset)
This commit is contained in:
294
README.md
Normal file
294
README.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# OpenClaw RAG Knowledge System
|
||||
|
||||
Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.
|
||||
|
||||
## Features
|
||||
|
||||
- **Semantic Search**: Find relevant context by meaning, not just keywords
|
||||
- **Multi-Source Indexing**: Sessions, workspace files, skill documentation
|
||||
- **Local Vector Store**: ChromaDB with built-in embeddings (no API keys required)
|
||||
- **Automatic Integration**: AI automatically consults knowledge base when responding
|
||||
- **Type Filtering**: Search by document type (session, workspace, skill, memory)
|
||||
- **Management Tools**: Add/remove documents, view statistics, reset collection
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# No external dependencies - just Python 3
|
||||
cd ~/.openclaw/workspace/rag
|
||||
python3 -m pip install --user chromadb
|
||||
```
|
||||
|
||||
### Index Your Data
|
||||
|
||||
```bash
|
||||
# Index all chat sessions
|
||||
python3 ingest_sessions.py
|
||||
|
||||
# Index workspace code and docs
|
||||
python3 ingest_docs.py workspace
|
||||
|
||||
# Index skill documentation
|
||||
python3 ingest_docs.py skills
|
||||
```
|
||||
|
||||
### Search the Knowledge Base
|
||||
|
||||
```bash
|
||||
# Interactive search mode
|
||||
python3 rag_query.py -i
|
||||
|
||||
# Quick search
|
||||
python3 rag_query.py "how to send SMS"
|
||||
|
||||
# Search by type
|
||||
python3 rag_query.py "voip.ms" --type session
|
||||
python3 rag_query.py "Porkbun DNS" --type skill
|
||||
```
|
||||
|
||||
### Integration in Python Code
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
|
||||
from rag_query_wrapper import search_knowledge
|
||||
|
||||
# Search and get structured results
|
||||
results = search_knowledge("Reddit account automation")
|
||||
print(f"Found {results['count']} results")
|
||||
|
||||
# Format for AI consumption
|
||||
from rag_query_wrapper import format_for_ai
|
||||
context = format_for_ai(results)
|
||||
print(context)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
rag/
|
||||
├── rag_system.py # Core RAG class (ChromaDB wrapper)
|
||||
├── ingest_sessions.py # Load chat history from sessions
|
||||
├── ingest_docs.py # Load workspace files & skill docs
|
||||
├── rag_query.py # Search the knowledge base
|
||||
├── rag_manage.py # Document management
|
||||
├── rag_query_wrapper.py # Simple Python API
|
||||
└── SKILL.md # OpenClaw skill documentation
|
||||
```
|
||||
|
||||
Data storage: `~/.openclaw/data/rag/` (ChromaDB persistent storage)
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Find Past Solutions
|
||||
|
||||
When you encounter a problem, search for similar past issues:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py "cloudflare bypass failed selenium"
|
||||
python3 rag_query.py "voip.ms SMS client"
|
||||
python3 rag_query.py "porkbun DNS API"
|
||||
```
|
||||
|
||||
### Search Through Codebase
|
||||
|
||||
Find code and documentation across your entire workspace:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type workspace "chromedriver setup"
|
||||
python3 rag_query.py --type workspace "unifi gateway API"
|
||||
```
|
||||
|
||||
### Access Skill Documentation
|
||||
|
||||
Quick reference for any openclaw skill:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type skill "how to check UniFi"
|
||||
python3 rag_query.py --type skill "Porkbun DNS management"
|
||||
```
|
||||
|
||||
### Manage Knowledge Base
|
||||
|
||||
```bash
|
||||
# View statistics
|
||||
python3 rag_manage.py stats
|
||||
|
||||
# Delete all sessions
|
||||
python3 rag_manage.py delete --by-type session
|
||||
|
||||
# Delete specific file
|
||||
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Document Ingestion
|
||||
|
||||
1. **Session transcripts**: Process chat history from `~/.openclaw/agents/main/sessions/*.jsonl`
|
||||
- Handles OpenClaw event format (session metadata, messages, tool calls)
|
||||
- Chunks messages into groups of 20 with overlap
|
||||
- Extracts and formats thinking, tool calls, and results
|
||||
|
||||
2. **Workspace files**: Scans workspace for code, docs, configs
|
||||
- Supports: `.py`, `.js`, `.ts`, `.md`, `.json`, `. yaml`, `.sh`, `.html`, `.css`
|
||||
- Skips files > 1MB and binary files
|
||||
- Chunking for long documents
|
||||
|
||||
3. **Skills**: Indexes all `SKILL.md` files
|
||||
- Captures skill documentation and usage examples
|
||||
- Organized by skill name
|
||||
|
||||
### Semantic Search
|
||||
|
||||
ChromaDB uses `all-MiniLM-L6-v2` embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.
|
||||
|
||||
### Automatic RAG Integration
|
||||
|
||||
When the AI responds to a question that could benefit from context, it automatically:
|
||||
1. Searches the knowledge base
|
||||
2. Retrieves relevant past conversations, code, or docs
|
||||
3. Includes that context in the response
|
||||
|
||||
This happens transparently - the AI just "knows" about your past work.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Custom Session Directory
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --sessions-dir /path/to/sessions
|
||||
```
|
||||
|
||||
### Chunk Size Control
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
|
||||
```
|
||||
|
||||
### Custom Collection Name
|
||||
|
||||
```python
|
||||
from rag_system import RAGSystem
|
||||
rag = RAGSystem(collection_name="my_knowledge")
|
||||
```
|
||||
|
||||
## Data Types
|
||||
|
||||
| Type | Source | Description |
|
||||
|------|--------|-------------|
|
||||
| **session** | `session:{key}` | Chat history transcripts |
|
||||
| **workspace** | `relative/path` | Code, configs, docs |
|
||||
| **skill** | `skill:{name}` | Skill documentation |
|
||||
| **memory** | `MEMORY.md` | Long-term memory entries |
|
||||
| **manual** | `{custom}` | Manually added docs |
|
||||
| **api** | `api-docs:{name}` | API documentation |
|
||||
|
||||
## Performance
|
||||
|
||||
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
|
||||
- **Storage**: ~100MB per 1,000 documents
|
||||
- **Indexing time**: ~1,000 docs/min
|
||||
- **Search time**: <100ms (after first query loads embeddings)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Results Found
|
||||
|
||||
- Check if anything is indexed: `python3 rag_manage.py stats`
|
||||
- Try broader queries or different wording
|
||||
- Try without filters: remove `--type` if using it
|
||||
|
||||
### Slow First Search
|
||||
|
||||
The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.
|
||||
|
||||
### Memory Issues
|
||||
|
||||
Reset collection if needed:
|
||||
```bash
|
||||
python3 rag_manage.py reset
|
||||
```
|
||||
|
||||
### Duplicate ID Errors
|
||||
|
||||
If you see "Expected IDs to be unique" errors:
|
||||
1. Reset the collection
|
||||
2. Re-run ingestion
|
||||
3. The fix includes `chunk_index` in ID generation
|
||||
|
||||
### ChromaDB Download Stuck
|
||||
|
||||
On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Re-index Regularly
|
||||
|
||||
After significant work, re-ingest to keep knowledge current:
|
||||
```bash
|
||||
python3 ingest_sessions.py
|
||||
python3 ingest_docs.py workspace
|
||||
```
|
||||
|
||||
### Use Specific Queries
|
||||
|
||||
Better results with focused queries:
|
||||
```bash
|
||||
# Good
|
||||
python3 rag_query.py "voip.ms getSMS API method"
|
||||
|
||||
# Less specific
|
||||
python3 rag_query.py "API"
|
||||
```
|
||||
|
||||
### Filter by Type
|
||||
|
||||
When you know the data type:
|
||||
```bash
|
||||
# Looking for code
|
||||
python3 rag_query.py --type workspace "chromedriver"
|
||||
|
||||
# Looking for past conversations
|
||||
python3 rag_query.py --type session "SMS"
|
||||
```
|
||||
|
||||
### Document Decisions
|
||||
|
||||
After important decisions, add to knowledge base:
|
||||
```bash
|
||||
python3 rag_manage.py add \
|
||||
--text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
|
||||
--source "decision:reddit-automation" \
|
||||
--type "decision"
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Files > 1MB are automatically skipped (performance)
|
||||
- First search is slower (embedding load)
|
||||
- Requires ~100MB disk space per 1,000 documents
|
||||
- Python 3.7+ required
|
||||
|
||||
## License
|
||||
|
||||
MIT License - Free to use and modify
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Areas for improvement:
|
||||
- API documentation indexing from external URLs
|
||||
- Automated re-indexing cron job
|
||||
- Better chunking strategies for long documents
|
||||
- Integration with external vector stores (Pinecone, Weaviate)
|
||||
|
||||
## Author
|
||||
|
||||
Nova AI Assistant for William Mantly (Theta42)
|
||||
|
||||
## Repository
|
||||
|
||||
https://git.theta42.com/nova/openclaw-rag-skill
|
||||
Published on: clawhub.com
|
||||
Reference in New Issue
Block a user