T

nova 3c9cee28d7 v1.0.3: Fix hard-coded paths, address security scan feedback

- Replace all absolute paths with dynamic resolution
- Add path portability and network behavior documentation
- Verify no custom network calls in codebase
- Update version to 1.0.3

2026-02-12 16:59:33 +00:00

scripts

v1.0.3: Fix hard-coded paths, address security scan feedback

2026-02-12 16:59:33 +00:00

CHANGELOG.md

v1.0.3: Fix hard-coded paths, address security scan feedback

2026-02-12 16:59:33 +00:00

ingest_docs.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

ingest_sessions.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

launch_rag_agent.sh

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

LICENSE

Add LICENSE, package.json, CHANGELOG, and enhance documentation

2026-02-11 16:37:24 +00:00

package.json

v1.0.3: Fix hard-coded paths, address security scan feedback

2026-02-12 16:59:33 +00:00

rag_agent.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

rag_context.py

v1.0.3: Fix hard-coded paths, address security scan feedback

2026-02-12 16:59:33 +00:00

rag_manage.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

rag_query_quick.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

rag_query_wrapper.py

v1.0.3: Fix hard-coded paths, address security scan feedback

2026-02-12 16:59:33 +00:00

rag_query.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

rag_system.py

Initial commit: OpenClaw RAG Knowledge System

2026-02-11 03:47:38 +00:00

README.md

Add LICENSE, package.json, CHANGELOG, and enhance documentation

2026-02-11 16:37:24 +00:00

SKILL.md

v1.0.3: Fix hard-coded paths, address security scan feedback

2026-02-12 16:59:33 +00:00

README.md

OpenClaw RAG Knowledge System

Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.

Features

Semantic Search: Find relevant context by meaning, not just keywords
Multi-Source Indexing: Sessions, workspace files, skill documentation
Local Vector Store: ChromaDB with built-in embeddings (no API keys required)
Automatic Integration: AI automatically consults knowledge base when responding
Type Filtering: Search by document type (session, workspace, skill, memory)
Management Tools: Add/remove documents, view statistics, reset collection

Quick Start

Installation

# Install Python dependency
cd ~/.openclaw/workspace/rag
python3 -m pip install --user chromadb

No API keys required - This system is fully local:

Embeddings: all-MiniLM-L6-v2 (downloaded once, 79MB)
Vector store: ChromaDB (persistent disk storage)
Data location: ~/.openclaw/data/rag/ (auto-created)

All operations run offline with no external dependencies besides the initial ChromaDB download.

Index Your Data

# Index all chat sessions
python3 ingest_sessions.py

# Index workspace code and docs
python3 ingest_docs.py workspace

# Index skill documentation
python3 ingest_docs.py skills

Search the Knowledge Base

# Interactive search mode
python3 rag_query.py -i

# Quick search
python3 rag_query.py "how to send SMS"

# Search by type
python3 rag_query.py "voip.ms" --type session
python3 rag_query.py "Porkbun DNS" --type skill

Integration in Python Code

import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
from rag_query_wrapper import search_knowledge

# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} results")

# Format for AI consumption
from rag_query_wrapper import format_for_ai
context = format_for_ai(results)
print(context)

Architecture

rag/
├── rag_system.py          # Core RAG class (ChromaDB wrapper)
├── ingest_sessions.py     # Load chat history from sessions
├── ingest_docs.py         # Load workspace files & skill docs
├── rag_query.py           # Search the knowledge base
├── rag_manage.py          # Document management
├── rag_query_wrapper.py   # Simple Python API
└── SKILL.md               # OpenClaw skill documentation

Data storage: ~/.openclaw/data/rag/ (ChromaDB persistent storage)

Usage Examples

Find Past Solutions

When you encounter a problem, search for similar past issues:

python3 rag_query.py "cloudflare bypass failed selenium"
python3 rag_query.py "voip.ms SMS client"
python3 rag_query.py "porkbun DNS API"

Search Through Codebase

Find code and documentation across your entire workspace:

python3 rag_query.py --type workspace "chromedriver setup"
python3 rag_query.py --type workspace "unifi gateway API"

Access Skill Documentation

Quick reference for any openclaw skill:

python3 rag_query.py --type skill "how to check UniFi"
python3 rag_query.py --type skill "Porkbun DNS management"

Manage Knowledge Base

# View statistics
python3 rag_manage.py stats

# Delete all sessions
python3 rag_manage.py delete --by-type session

# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"

How It Works

Document Ingestion

Session transcripts: Process chat history from ~/.openclaw/agents/main/sessions/*.jsonl
- Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages into groups of 20 with overlap
- Extracts and formats thinking, tool calls, and results
Workspace files: Scans workspace for code, docs, configs
- Supports: .py, .js, .ts, .md, .json, . yaml, .sh, .html, .css
- Skips files > 1MB and binary files
- Chunking for long documents
Skills: Indexes all SKILL.md files
- Captures skill documentation and usage examples
- Organized by skill name

Semantic Search

ChromaDB uses all-MiniLM-L6-v2 embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.

Automatic RAG Integration

When the AI responds to a question that could benefit from context, it automatically:

Searches the knowledge base
Retrieves relevant past conversations, code, or docs
Includes that context in the response

This happens transparently - the AI just "knows" about your past work.

Configuration

Custom Session Directory

python3 ingest_sessions.py --sessions-dir /path/to/sessions

Chunk Size Control

python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10

Custom Collection Name

from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")

Data Types

Type	Source	Description
session	`session:{key}`	Chat history transcripts
workspace	`relative/path`	Code, configs, docs
skill	`skill:{name}`	Skill documentation
memory	`MEMORY.md`	Long-term memory entries
manual	`{custom}`	Manually added docs
api	`api-docs:{name}`	API documentation

Performance

Embedding model: all-MiniLM-L6-v2 (79MB, cached locally)
Storage: ~100MB per 1,000 documents
Indexing time: ~1,000 docs/min
Search time: <100ms (after first query loads embeddings)

Troubleshooting

No Results Found

Check if anything is indexed: python3 rag_manage.py stats
Try broader queries or different wording
Try without filters: remove --type if using it

Slow First Search

The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.

Memory Issues

Reset collection if needed:

python3 rag_manage.py reset

Duplicate ID Errors

If you see "Expected IDs to be unique" errors:

Reset the collection
Re-run ingestion
The fix includes chunk_index in ID generation

ChromaDB Download Stuck

On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.

Automatic Updates

Setup Scheduled Indexing

The RAG system includes an automatic update script that runs daily:

# Manual test
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh

What it does:

Detects new/updated chat sessions and re-indexes them
Re-indexes workspace files (captures code changes)
Updates skill documentation
Maintains state to avoid re-processing unchanged files
Runs via cron at 4:00 AM UTC daily

Configuration:

# View cron job
openclaw cron list

# Edit schedule (if needed)
openclaw cron update <job-id> --schedule "{\"expr\":\"0 4 * * *\"}"

State tracking: ~/.openclaw/workspace/memory/rag-auto-state.json Log file: ~/.openclaw/workspace/memory/rag-auto-update.log

Best Practices

Automatic Update Enabled

The RAG system now automatically updates daily - no manual re-indexing needed.

After significant work, you can still manually update:

bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh

Use Specific Queries

Better results with focused queries:

# Good
python3 rag_query.py "voip.ms getSMS API method"

# Less specific
python3 rag_query.py "API"

Filter by Type

When you know the data type:

# Looking for code
python3 rag_query.py --type workspace "chromedriver"

# Looking for past conversations
python3 rag_query.py --type session "SMS"

Document Decisions

After important decisions, add to knowledge base:

python3 rag_manage.py add \
  --text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
  --source "decision:reddit-automation" \
  --type "decision"

Limitations

Files > 1MB are automatically skipped (performance)
First search is slower (embedding load)
Requires ~100MB disk space per 1,000 documents
Python 3.7+ required

License

MIT License - Free to use and modify

Contributing

Contributions welcome! Areas for improvement:

API documentation indexing from external URLs
File system watch for automatic re-indexing
Better chunking strategies for long documents
Integration with external vector stores (Pinecone, Weaviate)

Documentation Files

CHANGELOG.md - Version history and changes
SKILL.md - OpenClaw skill integration guide
package.json - Skill metadata (no credentials required)
LICENSE - MIT License

Author

Nova AI Assistant for William Mantly (Theta42)

Repository

https://git.theta42.com/nova/openclaw-rag-skill Published on: clawhub.com