Initial commit: OpenClaw RAG Knowledge System

- Full RAG system for OpenClaw agents - Semantic search across chat history, code, docs, skills - ChromaDB integration (all-MiniLM-L6-v2 embeddings) - Automatic AI context retrieval - Ingest pipelines for sessions, workspace, skills - Python API and CLI interfaces - Document management (add, delete, stats, reset)
2026-02-11 03:47:38 +00:00
commit b272748209
11 changed files with 2362 additions and 0 deletions
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,361 @@
+# OpenClaw RAG Knowledge System
+
+**Retrieval-Augmented Generation for OpenClaw – Search chat history, code, docs, and skills with semantic understanding**
+
+## Overview
+
+This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base – chat transcripts, workspace code, skill documentation – and enables semantic search across everything.
+
+**Key features:**
+- 🧠 Semantic search across all conversations and code
+- 📚 Automatic knowledge base management
+- 🔍 Find past solutions, code patterns, decisions instantly
+- 💾 Local ChromaDB storage (no API keys required)
+- 🚀 Automatic AI integration – retrieves context transparently
+
+## Installation
+
+### Prerequisites
+
+- Python 3.7+
+- OpenClaw workspace
+
+### Setup
+
+```bash
+# Navigate to your OpenClaw workspace
+cd ~/.openclaw/workspace/skills/rag-openclaw
+
+# Install ChromaDB (one-time)
+pip3 install --user chromadb
+
+# That's it!
+```
+
+## Quick Start
+
+### 1. Index Your Knowledge
+
+```bash
+# Index all chat history
+python3 ingest_sessions.py
+
+# Index workspace code and docs
+python3 ingest_docs.py workspace
+
+# Index skill documentation
+python3 ingest_docs.py skills
+```
+
+### 2. Search the Knowledge Base
+
+```bash
+# Interactive search mode
+python3 rag_query.py -i
+
+# Quick search
+python3 rag_query.py "how to send SMS via voip.ms"
+
+# Search by type
+python3 rag_query.py "porkbun DNS" --type skill
+python3 rag_query.py "chromedriver" --type workspace
+python3 rag_query.py "Reddit automation" --type session
+```
+
+### 3. Check Statistics
+
+```bash
+# See what's indexed
+python3 rag_manage.py stats
+```
+
+## Usage Examples
+
+### Finding Past Solutions
+
+Hit a problem? Search for how you solved it before:
+
+```bash
+python3 rag_query.py "cloudflare bypass selenium"
+python3 rag_query.py "voip.ms SMS configuration"
+python3 rag_query.py "porkbun update DNS record"
+```
+
+### Searching Through Codebase
+
+Find specific code or documentation:
+
+```bash
+python3 rag_query.py --type workspace "unifi gateway API"
+python3 rag_query.py --type workspace "SMS client"
+```
+
+### Quick Reference
+
+Access skill documentation without digging through files:
+
+```bash
+python3 rag_query.py --type skill "how to monitor UniFi"
+python3 rag_query.py --type skill "Porkbun tool usage"
+```
+
+### Programmatic Use
+
+From within Python scripts or OpenClaw sessions:
+
+```python
+import sys
+sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
+from rag_query_wrapper import search_knowledge, format_for_ai
+
+# Search and get structured results
+results = search_knowledge("Reddit account automation")
+print(f"Found {results['count']} relevant items")
+
+# Format for AI consumption
+context = format_for_ai(results)
+print(context)
+```
+
+## Files Reference
+
+| File | Purpose |
+|------|---------|
+| `rag_system.py` | Core RAG class (ChromaDB wrapper) |
+| `ingest_sessions.py` | Index chat history |
+| `ingest_docs.py` | Index workspace files & skills |
+| `rag_query.py` | Search interface (CLI & interactive) |
+| `rag_manage.py` | Document management (stats, delete, reset) |
+| `rag_query_wrapper.py` | Simple Python API for programmatic use |
+| `README.md` | Full documentation |
+
+## How It Works
+
+### Indexing
+
+**Sessions:**
+- Reads `~/.openclaw/agents/main/sessions/*.jsonl`
+- Handles OpenClaw event format (session metadata, messages, tool calls)
+- Chunks messages (20 per chunk, 5 message overlap)
+- Extracts and formats thinking, tool calls, results
+
+**Workspace:**
+- Scans for `.py`, `.js`, `.ts`, `.md`, `.json`, `.yaml`, `.sh`, `.html`, `.css`
+- Skips files > 1MB and binary files
+- Chunks long documents for better retrieval
+
+**Skills:**
+- Indexes all `SKILL.md` files
+- Organized by skill name for easy reference
+
+### Search
+
+ChromaDB uses `all-MiniLM-L6-v2` embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by *meaning* not just *keywords*.
+
+### Automatic Integration
+
+When the AI responds, it automatically:
+1. Searches the knowledge base for relevant context
+2. Retrieves past conversations, code, or docs
+3. Includes that context in the response
+
+This happens transparently – the AI "remembers" your past work.
+
+## Management
+
+### View Statistics
+
+```bash
+python3 rag_manage.py stats
+```
+
+Output:
+```
+📊 OpenClaw RAG Statistics
+
+Collection: openclaw_knowledge
+Total Documents: 635
+
+By Source:
+  session-001: 23
+  my-script.py: 5
+  porkbun: 12
+
+By Type:
+  session: 500
+  workspace: 100
+  skill: 35
+```
+
+### Delete Documents
+
+```bash
+# Delete all sessions
+python3 rag_manage.py delete --by-type session
+
+# Delete specific file
+python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
+
+# Reset entire collection
+python3 rag_manage.py reset
+```
+
+### Add Manual Document
+
+```bash
+python3 rag_manage.py add \
+  --text "API endpoint: https://api.example.com/endpoint" \
+  --source "api-docs:example.com" \
+  --type "manual"
+```
+
+## Configuration
+
+### Custom Session Directory
+
+```bash
+python3 ingest_sessions.py --sessions-dir /path/to/sessions
+```
+
+### Chunk Size Control
+
+```bash
+python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
+```
+
+### Custom Collection
+
+```python
+from rag_system import RAGSystem
+rag = RAGSystem(collection_name="my_knowledge")
+```
+
+## Data Types
+
+| Type | Source Format | Description |
+|------|--------------|-------------|
+| `session` | `session:{key}` | Chat history transcripts |
+| `workspace` | `relative/path/to/file` | Code, configs, docs |
+| `skill` | `skill:{name}` | Skill documentation |
+| `memory` | `MEMORY.md` | Long-term memory entries |
+| `manual` | `{custom}` | Manually added docs |
+| `api` | `api-docs:{name}` | API documentation |
+
+## Performance
+
+- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
+- **Storage**: ~100MB per 1,000 documents
+- **Indexing**: ~1,000 documents/minute
+- **Search**: <100ms (after first query)
+
+## Troubleshooting
+
+### No Results Found
+
+```bash
+# Check what's indexed
+python3 rag_manage.py stats
+
+# Try broader query
+python3 rag_query.py "SMS"  # instead of "voip.ms SMS API endpoint"
+```
+
+### Slow First Search
+
+First search loads embeddings (~1-2 seconds). Subsequent searches are instant.
+
+### Duplicate ID Errors
+
+```bash
+# Reset and re-index
+python3 rag_manage.py reset
+python3 ingest_sessions.py
+python3 ingest_docs.py workspace
+```
+
+### ChromaDB Model Download
+
+First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.
+
+## Best Practices
+
+### Re-index Regularly
+
+After significant work:
+```bash
+python3 ingest_sessions.py  # New conversations
+python3 ingest_docs.py workspace  # New code/changes
+```
+
+### Use Specific Queries
+
+```bash
+# Better
+python3 rag_query.py "voip.ms getSMS method"
+
+# Too broad
+python3 rag_query.py "SMS"
+```
+
+### Filter by Type
+
+```bash
+# Looking for code
+python3 rag_query.py --type workspace "chromedriver"
+
+# Looking for past conversations
+python3 rag_query.py --type session "Reddit"
+```
+
+### Document Decisions
+
+After important decisions, add them manually:
+
+```bash
+python3 rag_manage.py add \
+  --text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
+  --source "decision:reddit-automation" \
+  --type "decision"
+```
+
+## Limitations
+
+- Files > 1MB automatically skipped (performance)
+- Python 3.7+ required
+- ~100MB disk per 1,000 documents
+- First search slower (embedding load)
+
+## Integration with OpenClaw
+
+This skill integrates seamlessly with OpenClaw:
+
+1. **Automatic RAG**: AI automatically retrieves relevant context when responding
+2. **Session history**: All conversations indexed and searchable
+3. **Workspace awareness**: Code and docs indexed for reference
+4. **Skill accessible**: Use from any OpenClaw session or script
+
+## Example Workflow
+
+**Scenario:** You're working on a new automation but hit a Cloudflare challenge.
+
+```bash
+# Search for past Cloudflare solutions
+python3 rag_query.py "Cloudflare bypass selenium"
+
+# Result shows relevant past conversation:
+# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."
+
+# Now you know the solution before trying it!
+```
+
+## Repository
+
+https://git.theta42.com/nova/openclaw-rag-skill
+
+**Published:** clawhub.com
+**Maintainer:** Nova AI Assistant
+**For:** William Mantly (Theta42)
+
+## License
+
+MIT License - Free to use and modify