# OpenClaw RAG Knowledge System

**Retrieval-Augmented Generation for OpenClaw – Search chat history, code, docs, and skills with semantic understanding**

## Overview

This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base – chat transcripts, workspace code, skill documentation – and enables semantic search across everything.

**Key features:**
- 🧠 Semantic search across all conversations and code
- 📚 Automatic knowledge base management
- 🔍 Find past solutions, code patterns, decisions instantly
- 💾 Local ChromaDB storage (no API keys required)
- 🚀 Automatic AI integration – retrieves context transparently

## Installation

### Prerequisites

- Python 3.7+
- OpenClaw workspace

### Setup

```bash
# Navigate to your OpenClaw workspace
cd ~/.openclaw/workspace/skills/rag-openclaw

# Install ChromaDB (one-time)
pip3 install --user chromadb

# That's it!
```

## Quick Start

### 1. Index Your Knowledge

```bash
# Index all chat history
python3 ingest_sessions.py

# Index workspace code and docs
python3 ingest_docs.py workspace

# Index skill documentation
python3 ingest_docs.py skills
```

### 2. Search the Knowledge Base

```bash
# Interactive search mode
python3 rag_query.py -i

# Quick search
python3 rag_query.py "how to send SMS via voip.ms"

# Search by type
python3 rag_query.py "porkbun DNS" --type skill
python3 rag_query.py "chromedriver" --type workspace
python3 rag_query.py "Reddit automation" --type session
```

### 3. Check Statistics

```bash
# See what's indexed
python3 rag_manage.py stats
```

## Usage Examples

### Finding Past Solutions

Hit a problem? Search for how you solved it before:

```bash
python3 rag_query.py "cloudflare bypass selenium"
python3 rag_query.py "voip.ms SMS configuration"
python3 rag_query.py "porkbun update DNS record"
```

### Searching Through Codebase

Find specific code or documentation:

```bash
python3 rag_query.py --type workspace "unifi gateway API"
python3 rag_query.py --type workspace "SMS client"
```

### Quick Reference

Access skill documentation without digging through files:

```bash
python3 rag_query.py --type skill "how to monitor UniFi"
python3 rag_query.py --type skill "Porkbun tool usage"
```

### Programmatic Use

From within Python scripts or OpenClaw sessions:

```python
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
from rag_query_wrapper import search_knowledge, format_for_ai

# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} relevant items")

# Format for AI consumption
context = format_for_ai(results)
print(context)
```

## Files Reference

| File | Purpose |
|------|---------|
| `rag_system.py` | Core RAG class (ChromaDB wrapper) |
| `ingest_sessions.py` | Index chat history |
| `ingest_docs.py` | Index workspace files & skills |
| `rag_query.py` | Search interface (CLI & interactive) |
| `rag_manage.py` | Document management (stats, delete, reset) |
| `rag_query_wrapper.py` | Simple Python API for programmatic use |
| `README.md` | Full documentation |

## How It Works

### Indexing

**Sessions:**
- Reads `~/.openclaw/agents/main/sessions/*.jsonl`
- Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages (20 per chunk, 5 message overlap)
- Extracts and formats thinking, tool calls, results

**Workspace:**
- Scans for `.py`, `.js`, `.ts`, `.md`, `.json`, `.yaml`, `.sh`, `.html`, `.css`
- Skips files > 1MB and binary files
- Chunks long documents for better retrieval

**Skills:**
- Indexes all `SKILL.md` files
- Organized by skill name for easy reference

### Search

ChromaDB uses `all-MiniLM-L6-v2` embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by *meaning* not just *keywords*.

### Automatic Integration

When the AI responds, it automatically:
1. Searches the knowledge base for relevant context
2. Retrieves past conversations, code, or docs
3. Includes that context in the response

This happens transparently – the AI "remembers" your past work.

## Management

### View Statistics

```bash
python3 rag_manage.py stats
```

Output:
```
📊 OpenClaw RAG Statistics

Collection: openclaw_knowledge
Total Documents: 635

By Source:
  session-001: 23
  my-script.py: 5
  porkbun: 12

By Type:
  session: 500
  workspace: 100
  skill: 35
```

### Delete Documents

```bash
# Delete all sessions
python3 rag_manage.py delete --by-type session

# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"

# Reset entire collection
python3 rag_manage.py reset
```

### Add Manual Document

```bash
python3 rag_manage.py add \
  --text "API endpoint: https://api.example.com/endpoint" \
  --source "api-docs:example.com" \
  --type "manual"
```

## Configuration

### Custom Session Directory

```bash
python3 ingest_sessions.py --sessions-dir /path/to/sessions
```

### Chunk Size Control

```bash
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
```

### Custom Collection

```python
from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")
```

## Data Types

| Type | Source Format | Description |
|------|--------------|-------------|
| `session` | `session:{key}` | Chat history transcripts |
| `workspace` | `relative/path/to/file` | Code, configs, docs |
| `skill` | `skill:{name}` | Skill documentation |
| `memory` | `MEMORY.md` | Long-term memory entries |
| `manual` | `{custom}` | Manually added docs |
| `api` | `api-docs:{name}` | API documentation |

## Performance

- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
- **Storage**: ~100MB per 1,000 documents
- **Indexing**: ~1,000 documents/minute
- **Search**: <100ms (after first query)

## Troubleshooting

### No Results Found

```bash
# Check what's indexed
python3 rag_manage.py stats

# Try broader query
python3 rag_query.py "SMS"  # instead of "voip.ms SMS API endpoint"
```

### Slow First Search

First search loads embeddings (~1-2 seconds). Subsequent searches are instant.

### Duplicate ID Errors

```bash
# Reset and re-index
python3 rag_manage.py reset
python3 ingest_sessions.py
python3 ingest_docs.py workspace
```

### ChromaDB Model Download

First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.

## Best Practices

### Re-index Regularly

After significant work:
```bash
python3 ingest_sessions.py  # New conversations
python3 ingest_docs.py workspace  # New code/changes
```

### Use Specific Queries

```bash
# Better
python3 rag_query.py "voip.ms getSMS method"

# Too broad
python3 rag_query.py "SMS"
```

### Filter by Type

```bash
# Looking for code
python3 rag_query.py --type workspace "chromedriver"

# Looking for past conversations
python3 rag_query.py --type session "Reddit"
```

### Document Decisions

After important decisions, add them manually:

```bash
python3 rag_manage.py add \
  --text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
  --source "decision:reddit-automation" \
  --type "decision"
```

## Limitations

- Files > 1MB automatically skipped (performance)
- Python 3.7+ required
- ~100MB disk per 1,000 documents
- First search slower (embedding load)

## Integration with OpenClaw

This skill integrates seamlessly with OpenClaw:

1. **Automatic RAG**: AI automatically retrieves relevant context when responding
2. **Session history**: All conversations indexed and searchable
3. **Workspace awareness**: Code and docs indexed for reference
4. **Skill accessible**: Use from any OpenClaw session or script

## Example Workflow

**Scenario:** You're working on a new automation but hit a Cloudflare challenge.

```bash
# Search for past Cloudflare solutions
python3 rag_query.py "Cloudflare bypass selenium"

# Result shows relevant past conversation:
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."

# Now you know the solution before trying it!
```

## Repository

https://git.theta42.com/nova/openclaw-rag-skill

**Published:** clawhub.com
**Maintainer:** Nova AI Assistant
**For:** William Mantly (Theta42)

## License

MIT License - Free to use and modify