Files

Nova AI b272748209 Initial commit: OpenClaw RAG Knowledge System

- Full RAG system for OpenClaw agents
- Semantic search across chat history, code, docs, skills
- ChromaDB integration (all-MiniLM-L6-v2 embeddings)
- Automatic AI context retrieval
- Ingest pipelines for sessions, workspace, skills
- Python API and CLI interfaces
- Document management (add, delete, stats, reset)

2026-02-11 03:47:38 +00:00

8.1 KiB

Raw Blame History

OpenClaw RAG Knowledge System

Retrieval-Augmented Generation for OpenClaw – Search chat history, code, docs, and skills with semantic understanding

Overview

This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base – chat transcripts, workspace code, skill documentation – and enables semantic search across everything.

Key features:

🧠 Semantic search across all conversations and code
📚 Automatic knowledge base management
🔍 Find past solutions, code patterns, decisions instantly
💾 Local ChromaDB storage (no API keys required)
🚀 Automatic AI integration – retrieves context transparently

Installation

Prerequisites

Python 3.7+
OpenClaw workspace

Setup

# Navigate to your OpenClaw workspace
cd ~/.openclaw/workspace/skills/rag-openclaw

# Install ChromaDB (one-time)
pip3 install --user chromadb

# That's it!

Quick Start

1. Index Your Knowledge

# Index all chat history
python3 ingest_sessions.py

# Index workspace code and docs
python3 ingest_docs.py workspace

# Index skill documentation
python3 ingest_docs.py skills

2. Search the Knowledge Base

# Interactive search mode
python3 rag_query.py -i

# Quick search
python3 rag_query.py "how to send SMS via voip.ms"

# Search by type
python3 rag_query.py "porkbun DNS" --type skill
python3 rag_query.py "chromedriver" --type workspace
python3 rag_query.py "Reddit automation" --type session

3. Check Statistics

# See what's indexed
python3 rag_manage.py stats

Usage Examples

Finding Past Solutions

Hit a problem? Search for how you solved it before:

python3 rag_query.py "cloudflare bypass selenium"
python3 rag_query.py "voip.ms SMS configuration"
python3 rag_query.py "porkbun update DNS record"

Searching Through Codebase

Find specific code or documentation:

python3 rag_query.py --type workspace "unifi gateway API"
python3 rag_query.py --type workspace "SMS client"

Quick Reference

Access skill documentation without digging through files:

python3 rag_query.py --type skill "how to monitor UniFi"
python3 rag_query.py --type skill "Porkbun tool usage"

Programmatic Use

From within Python scripts or OpenClaw sessions:

import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
from rag_query_wrapper import search_knowledge, format_for_ai

# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} relevant items")

# Format for AI consumption
context = format_for_ai(results)
print(context)

Files Reference

File	Purpose
`rag_system.py`	Core RAG class (ChromaDB wrapper)
`ingest_sessions.py`	Index chat history
`ingest_docs.py`	Index workspace files & skills
`rag_query.py`	Search interface (CLI & interactive)
`rag_manage.py`	Document management (stats, delete, reset)
`rag_query_wrapper.py`	Simple Python API for programmatic use
`README.md`	Full documentation

How It Works

Indexing

Sessions:

Reads ~/.openclaw/agents/main/sessions/*.jsonl
Handles OpenClaw event format (session metadata, messages, tool calls)
Chunks messages (20 per chunk, 5 message overlap)
Extracts and formats thinking, tool calls, results

Workspace:

Scans for .py, .js, .ts, .md, .json, .yaml, .sh, .html, .css
Skips files > 1MB and binary files
Chunks long documents for better retrieval

Skills:

Indexes all SKILL.md files
Organized by skill name for easy reference

Search

ChromaDB uses all-MiniLM-L6-v2 embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by meaning not just keywords.

Automatic Integration

When the AI responds, it automatically:

Searches the knowledge base for relevant context
Retrieves past conversations, code, or docs
Includes that context in the response

This happens transparently – the AI "remembers" your past work.

Management

View Statistics

python3 rag_manage.py stats

Output:

📊 OpenClaw RAG Statistics

Collection: openclaw_knowledge
Total Documents: 635

By Source:
  session-001: 23
  my-script.py: 5
  porkbun: 12

By Type:
  session: 500
  workspace: 100
  skill: 35

Delete Documents

# Delete all sessions
python3 rag_manage.py delete --by-type session

# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"

# Reset entire collection
python3 rag_manage.py reset

Add Manual Document

python3 rag_manage.py add \
  --text "API endpoint: https://api.example.com/endpoint" \
  --source "api-docs:example.com" \
  --type "manual"

Configuration

Custom Session Directory

python3 ingest_sessions.py --sessions-dir /path/to/sessions

Chunk Size Control

python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10

Custom Collection

from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")

Data Types

Type	Source Format	Description
`session`	`session:{key}`	Chat history transcripts
`workspace`	`relative/path/to/file`	Code, configs, docs
`skill`	`skill:{name}`	Skill documentation
`memory`	`MEMORY.md`	Long-term memory entries
`manual`	`{custom}`	Manually added docs
`api`	`api-docs:{name}`	API documentation

Performance

Embedding model: all-MiniLM-L6-v2 (79MB, cached locally)
Storage: ~100MB per 1,000 documents
Indexing: ~1,000 documents/minute
Search: <100ms (after first query)

Troubleshooting

No Results Found

# Check what's indexed
python3 rag_manage.py stats

# Try broader query
python3 rag_query.py "SMS"  # instead of "voip.ms SMS API endpoint"

Slow First Search

First search loads embeddings (~1-2 seconds). Subsequent searches are instant.

Duplicate ID Errors

# Reset and re-index
python3 rag_manage.py reset
python3 ingest_sessions.py
python3 ingest_docs.py workspace

ChromaDB Model Download

First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.

Best Practices

Re-index Regularly

After significant work:

python3 ingest_sessions.py  # New conversations
python3 ingest_docs.py workspace  # New code/changes

Use Specific Queries

# Better
python3 rag_query.py "voip.ms getSMS method"

# Too broad
python3 rag_query.py "SMS"

Filter by Type

# Looking for code
python3 rag_query.py --type workspace "chromedriver"

# Looking for past conversations
python3 rag_query.py --type session "Reddit"

Document Decisions

After important decisions, add them manually:

python3 rag_manage.py add \
  --text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
  --source "decision:reddit-automation" \
  --type "decision"

Limitations

Files > 1MB automatically skipped (performance)
Python 3.7+ required
~100MB disk per 1,000 documents
First search slower (embedding load)

Integration with OpenClaw

This skill integrates seamlessly with OpenClaw:

Automatic RAG: AI automatically retrieves relevant context when responding
Session history: All conversations indexed and searchable
Workspace awareness: Code and docs indexed for reference
Skill accessible: Use from any OpenClaw session or script

Example Workflow

Scenario: You're working on a new automation but hit a Cloudflare challenge.

# Search for past Cloudflare solutions
python3 rag_query.py "Cloudflare bypass selenium"

# Result shows relevant past conversation:
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."

# Now you know the solution before trying it!

Repository

https://git.theta42.com/nova/openclaw-rag-skill

Published: clawhub.com Maintainer: Nova AI Assistant For: William Mantly (Theta42)

License

MIT License - Free to use and modify

8.1 KiB Raw Blame History Unescape Escape