2 Commits

Author SHA1 Message Date
22d85d638f Add GitHub Pages landing page
- Obsidian dark theme design
- Links to ClawHub, GitHub, and william.mantly.vip
- Feature comparison with web-search RAG tools
- Responsive layout with terminal UI preview
2026-02-12 18:11:50 +00:00
3c9cee28d7 v1.0.3: Fix hard-coded paths, address security scan feedback
- Replace all absolute paths with dynamic resolution
- Add path portability and network behavior documentation
- Verify no custom network calls in codebase
- Update version to 1.0.3
2026-02-12 16:59:33 +00:00
17 changed files with 529 additions and 2848 deletions

View File

@@ -1,129 +0,0 @@
# Changelog
All notable changes to the OpenClaw RAG Knowledge System will be documented in this file.
## [1.0.0] - 2026-02-11
### Added
- Initial release of RAG Knowledge System for OpenClaw
- Semantic search using ChromaDB with all-MiniLM-L6-v2 embeddings
- Multi-source indexing: sessions, workspace files, skill documentation
- CLI tools: rag_query.py, rag_manage.py, ingest_sessions.py, ingest_docs.py
- Python API: rag_query_wrapper.py for programmatic access
- Automatic integration wrapper: rag_context.py for transparent RAG queries
- RAG-enhanced agent wrapper: rag_agent.py
- Type filtering: search by document type (session, workspace, skill, memory)
- Document management: add, delete, reset collection
- Batch ingestion with intelligent chunking
- Session parser for OpenClaw event format
- Automatic daily updates via cron job
- Comprehensive documentation: README.md, SKILL.md
### Features
- **Semantic Search**: Find relevant context by meaning, not keywords
- **Local Vector Store**: ChromaDB with persistent storage (~100MB per 1,000 docs)
- **Zero Dependencies**: No API keys required (all-MiniLM-L6-v2 is free and local)
- **Smart Chunking**: Messages grouped by 20 with overlap for context
- **Multi-Format Support**: Python, JavaScript, Markdown, JSON, YAML, shell scripts
- **Automatic Updates**: Scheduled cron job runs daily at 4:00 AM UTC
- **State Tracking**: Avoids re-processing unchanged files
- **Debug Mode**: Verbose output for troubleshooting
### Bug Fixes
- Fixed duplicate ID errors by including chunk_index in hash generation
- Fixed session parser to handle OpenClaw event format correctly
- Fixed metadata conversion errors (all metadata values as strings)
### Performance
- Indexing speed: ~1,000 docs/minute
- Search time: <100ms (after embedding load)
- Embedding model: 79MB (cached locally)
- Storage: ~100MB per 1,000 documents
### Documentation
- Complete SKILL.md with OpenClaw integration guide
- Comprehensive README.md with examples and troubleshooting
- Inline help in all CLI tools
- Best practices and limitations documented
---
## [1.0.1] - 2026-02-11
### Added
- `package.json` with complete OpenClaw skill metadata
- `CHANGELOG.md` for version tracking
- `LICENSE` (MIT) for proper licensing
### Changed
- `package.json` explicitly declares NO required environment variables (fully local system)
- Documented data storage path: `~/.openclaw/data/rag/`
- Enhanced `README.md` with clearer installation instructions
- Added references to CHANGELOG, LICENSE, and package.json in README
- Clarified that no API keys or credentials are required
### Documentation
- Improved documentation transparency to meet security scanner best practices
- Clearly documented the fully-local nature of the system (no external dependencies)
---
## [Unreleased]
### Planned
- API documentation indexing from external URLs
- Automatic re-indexing on file system events (inotify)
- Better chunking strategies for long documents
- Integration with external vector stores (Pinecone, Weaviate)
- Webhook notifications for automated content processing
- Hybrid search (semantic + keyword)
- Query history and analytics
- Export/import of vector database
---
## [1.0.2] - 2026-02-12
### Added
- YAML front matter to SKILL.md with `name: rag` and `description` for ClawHub compatibility
- `Security Considerations` section documenting privacy implications and sensitive data risks
- `scripts/rag-auto-update.sh` included in skill package (previously in separate location)
- `.skill` package for ClawHub distribution (28KB, 14 files)
### Changed
- Updated package.json description to match SKILL.md front matter
- Documented auto-update script behavior for security review (local-only ingestion)
- Clarified ChromaDB storage location and data deletion procedures
### Fixed
- **Cron job HTTP 500 errors**: Changed from `sessionTarget: "main"` to `isolated` to avoid flooding chat with thousands of lines of output
- **Cron schedule**: Fixed from `0 4 * * *` to `0 0 * * *` to match actual midnight UTC execution time
### Security
- Documented that RAG indexes all session transcripts and workspace files (may contain API keys, credentials, private messages)
- Added recommendations for privacy-conscious use: review sessions before ingestion, use `rag_manage.py reset` to delete all indexed data
- Confirmed auto-update script only runs local ingestion scripts - no remote code fetching
### Documentation
- Added detailed security warnings in SKILL.md
- Explained how to delete ChromaDB persistence directory (`~/.openclaw/data/rag/`)
- Provided guidance on redacting sensitive data before ingestion
---
## Version Guidelines
This project follows [Semantic Versioning](https://semver.org/):
- **MAJOR** version: Incompatible API changes
- **MINOR** version: Backwards-compatible functionality additions
- **PATCH** version: Backwards-compatible bug fixes
## Categories
- **Added**: New features
- **Changed**: Changes in existing functionality
- **Deprecated**: Soon-to-be removed features
- **Removed**: Removed features
- **Fixed**: Bug fixes
- **Security**: Security vulnerabilities

21
LICENSE
View File

@@ -1,21 +0,0 @@
MIT License
Copyright (c) 2026 Nova AI Assistant (for William Mantly - Theta42)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

339
README.md
View File

@@ -1,339 +0,0 @@
# OpenClaw RAG Knowledge System
Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.
## Features
- **Semantic Search**: Find relevant context by meaning, not just keywords
- **Multi-Source Indexing**: Sessions, workspace files, skill documentation
- **Local Vector Store**: ChromaDB with built-in embeddings (no API keys required)
- **Automatic Integration**: AI automatically consults knowledge base when responding
- **Type Filtering**: Search by document type (session, workspace, skill, memory)
- **Management Tools**: Add/remove documents, view statistics, reset collection
## Quick Start
### Installation
```bash
# Install Python dependency
cd ~/.openclaw/workspace/rag
python3 -m pip install --user chromadb
```
**No API keys required** - This system is fully local:
- Embeddings: all-MiniLM-L6-v2 (downloaded once, 79MB)
- Vector store: ChromaDB (persistent disk storage)
- Data location: `~/.openclaw/data/rag/` (auto-created)
All operations run offline with no external dependencies besides the initial ChromaDB download.
### Index Your Data
```bash
# Index all chat sessions
python3 ingest_sessions.py
# Index workspace code and docs
python3 ingest_docs.py workspace
# Index skill documentation
python3 ingest_docs.py skills
```
### Search the Knowledge Base
```bash
# Interactive search mode
python3 rag_query.py -i
# Quick search
python3 rag_query.py "how to send SMS"
# Search by type
python3 rag_query.py "voip.ms" --type session
python3 rag_query.py "Porkbun DNS" --type skill
```
### Integration in Python Code
```python
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
from rag_query_wrapper import search_knowledge
# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} results")
# Format for AI consumption
from rag_query_wrapper import format_for_ai
context = format_for_ai(results)
print(context)
```
## Architecture
```
rag/
├── rag_system.py # Core RAG class (ChromaDB wrapper)
├── ingest_sessions.py # Load chat history from sessions
├── ingest_docs.py # Load workspace files & skill docs
├── rag_query.py # Search the knowledge base
├── rag_manage.py # Document management
├── rag_query_wrapper.py # Simple Python API
└── SKILL.md # OpenClaw skill documentation
```
Data storage: `~/.openclaw/data/rag/` (ChromaDB persistent storage)
## Usage Examples
### Find Past Solutions
When you encounter a problem, search for similar past issues:
```bash
python3 rag_query.py "cloudflare bypass failed selenium"
python3 rag_query.py "voip.ms SMS client"
python3 rag_query.py "porkbun DNS API"
```
### Search Through Codebase
Find code and documentation across your entire workspace:
```bash
python3 rag_query.py --type workspace "chromedriver setup"
python3 rag_query.py --type workspace "unifi gateway API"
```
### Access Skill Documentation
Quick reference for any openclaw skill:
```bash
python3 rag_query.py --type skill "how to check UniFi"
python3 rag_query.py --type skill "Porkbun DNS management"
```
### Manage Knowledge Base
```bash
# View statistics
python3 rag_manage.py stats
# Delete all sessions
python3 rag_manage.py delete --by-type session
# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
```
## How It Works
### Document Ingestion
1. **Session transcripts**: Process chat history from `~/.openclaw/agents/main/sessions/*.jsonl`
- Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages into groups of 20 with overlap
- Extracts and formats thinking, tool calls, and results
2. **Workspace files**: Scans workspace for code, docs, configs
- Supports: `.py`, `.js`, `.ts`, `.md`, `.json`, `. yaml`, `.sh`, `.html`, `.css`
- Skips files > 1MB and binary files
- Chunking for long documents
3. **Skills**: Indexes all `SKILL.md` files
- Captures skill documentation and usage examples
- Organized by skill name
### Semantic Search
ChromaDB uses `all-MiniLM-L6-v2` embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.
### Automatic RAG Integration
When the AI responds to a question that could benefit from context, it automatically:
1. Searches the knowledge base
2. Retrieves relevant past conversations, code, or docs
3. Includes that context in the response
This happens transparently - the AI just "knows" about your past work.
## Configuration
### Custom Session Directory
```bash
python3 ingest_sessions.py --sessions-dir /path/to/sessions
```
### Chunk Size Control
```bash
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
```
### Custom Collection Name
```python
from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")
```
## Data Types
| Type | Source | Description |
|------|--------|-------------|
| **session** | `session:{key}` | Chat history transcripts |
| **workspace** | `relative/path` | Code, configs, docs |
| **skill** | `skill:{name}` | Skill documentation |
| **memory** | `MEMORY.md` | Long-term memory entries |
| **manual** | `{custom}` | Manually added docs |
| **api** | `api-docs:{name}` | API documentation |
## Performance
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
- **Storage**: ~100MB per 1,000 documents
- **Indexing time**: ~1,000 docs/min
- **Search time**: <100ms (after first query loads embeddings)
## Troubleshooting
### No Results Found
- Check if anything is indexed: `python3 rag_manage.py stats`
- Try broader queries or different wording
- Try without filters: remove `--type` if using it
### Slow First Search
The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.
### Memory Issues
Reset collection if needed:
```bash
python3 rag_manage.py reset
```
### Duplicate ID Errors
If you see "Expected IDs to be unique" errors:
1. Reset the collection
2. Re-run ingestion
3. The fix includes `chunk_index` in ID generation
### ChromaDB Download Stuck
On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.
## Automatic Updates
### Setup Scheduled Indexing
The RAG system includes an automatic update script that runs daily:
```bash
# Manual test
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
```
**What it does:**
- Detects new/updated chat sessions and re-indexes them
- Re-indexes workspace files (captures code changes)
- Updates skill documentation
- Maintains state to avoid re-processing unchanged files
- Runs via cron at 4:00 AM UTC daily
**Configuration:**
```bash
# View cron job
openclaw cron list
# Edit schedule (if needed)
openclaw cron update <job-id> --schedule "{\"expr\":\"0 4 * * *\"}"
```
**State tracking:** `~/.openclaw/workspace/memory/rag-auto-state.json`
**Log file:** `~/.openclaw/workspace/memory/rag-auto-update.log`
## Best Practices
### Automatic Update Enabled
The RAG system now automatically updates daily - no manual re-indexing needed.
After significant work, you can still manually update:
```bash
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
```
### Use Specific Queries
Better results with focused queries:
```bash
# Good
python3 rag_query.py "voip.ms getSMS API method"
# Less specific
python3 rag_query.py "API"
```
### Filter by Type
When you know the data type:
```bash
# Looking for code
python3 rag_query.py --type workspace "chromedriver"
# Looking for past conversations
python3 rag_query.py --type session "SMS"
```
### Document Decisions
After important decisions, add to knowledge base:
```bash
python3 rag_manage.py add \
--text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
--source "decision:reddit-automation" \
--type "decision"
```
## Limitations
- Files > 1MB are automatically skipped (performance)
- First search is slower (embedding load)
- Requires ~100MB disk space per 1,000 documents
- Python 3.7+ required
## License
MIT License - Free to use and modify
## Contributing
Contributions welcome! Areas for improvement:
- API documentation indexing from external URLs
- File system watch for automatic re-indexing
- Better chunking strategies for long documents
- Integration with external vector stores (Pinecone, Weaviate)
## Documentation Files
- **CHANGELOG.md** - Version history and changes
- **SKILL.md** - OpenClaw skill integration guide
- **package.json** - Skill metadata (no credentials required)
- **LICENSE** - MIT License
## Author
Nova AI Assistant for William Mantly (Theta42)
## Repository
https://git.theta42.com/nova/openclaw-rag-skill
Published on: clawhub.com

381
SKILL.md
View File

@@ -1,381 +0,0 @@
---
name: rag
description: Complete RAG (Retrieval-Augmented Generation) system for OpenClaw. Indexes chat sessions, workspace code, documentation, and skills into local ChromaDB for semantic search. Enables finding past solutions, code patterns, and decisions instantly. Uses local embeddings (all-MiniLM-L6-v2) with no API keys required. Automatically ingests and updates knowledge base from ~/.openclaw/agents/main/sessions and workspace files.
---
# OpenClaw RAG Knowledge System
**Retrieval-Augmented Generation for OpenClaw Search chat history, code, docs, and skills with semantic understanding**
## Overview
This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base chat transcripts, workspace code, skill documentation and enables semantic search across everything.
**Key features:**
- 🧠 Semantic search across all conversations and code
- 📚 Automatic knowledge base management
- 🔍 Find past solutions, code patterns, decisions instantly
- 💾 Local ChromaDB storage (no API keys required)
- 🚀 Automatic AI integration retrieves context transparently
## Installation
### Prerequisites
- Python 3.7+
- OpenClaw workspace
### Setup
```bash
# Navigate to your OpenClaw workspace
cd ~/.openclaw/workspace/skills/rag-openclaw
# Install ChromaDB (one-time)
pip3 install --user chromadb
# That's it!
```
## Quick Start
### 1. Index Your Knowledge
```bash
# Index all chat history
python3 ingest_sessions.py
# Index workspace code and docs
python3 ingest_docs.py workspace
# Index skill documentation
python3 ingest_docs.py skills
```
### 2. Search the Knowledge Base
```bash
# Interactive search mode
python3 rag_query.py -i
# Quick search
python3 rag_query.py "how to send SMS via voip.ms"
# Search by type
python3 rag_query.py "porkbun DNS" --type skill
python3 rag_query.py "chromedriver" --type workspace
python3 rag_query.py "Reddit automation" --type session
```
### 3. Check Statistics
```bash
# See what's indexed
python3 rag_manage.py stats
```
## Usage Examples
### Finding Past Solutions
Hit a problem? Search for how you solved it before:
```bash
python3 rag_query.py "cloudflare bypass selenium"
python3 rag_query.py "voip.ms SMS configuration"
python3 rag_query.py "porkbun update DNS record"
```
### Searching Through Codebase
Find specific code or documentation:
```bash
python3 rag_query.py --type workspace "unifi gateway API"
python3 rag_query.py --type workspace "SMS client"
```
### Quick Reference
Access skill documentation without digging through files:
```bash
python3 rag_query.py --type skill "how to monitor UniFi"
python3 rag_query.py --type skill "Porkbun tool usage"
```
### Programmatic Use
From within Python scripts or OpenClaw sessions:
```python
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
from rag_query_wrapper import search_knowledge, format_for_ai
# Search and get structured results
results = search_knowledge("Reddit account automation")
print(f"Found {results['count']} relevant items")
# Format for AI consumption
context = format_for_ai(results)
print(context)
```
## Files Reference
| File | Purpose |
|------|---------|
| `rag_system.py` | Core RAG class (ChromaDB wrapper) |
| `ingest_sessions.py` | Index chat history |
| `ingest_docs.py` | Index workspace files & skills |
| `rag_query.py` | Search interface (CLI & interactive) |
| `rag_manage.py` | Document management (stats, delete, reset) |
| `rag_query_wrapper.py` | Simple Python API for programmatic use |
| `README.md` | Full documentation |
## How It Works
### Indexing
**Sessions:**
- Reads `~/.openclaw/agents/main/sessions/*.jsonl`
- Handles OpenClaw event format (session metadata, messages, tool calls)
- Chunks messages (20 per chunk, 5 message overlap)
- Extracts and formats thinking, tool calls, results
**Workspace:**
- Scans for `.py`, `.js`, `.ts`, `.md`, `.json`, `.yaml`, `.sh`, `.html`, `.css`
- Skips files > 1MB and binary files
- Chunks long documents for better retrieval
**Skills:**
- Indexes all `SKILL.md` files
- Organized by skill name for easy reference
### Search
ChromaDB uses `all-MiniLM-L6-v2` embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by *meaning* not just *keywords*.
### Automatic Integration
When the AI responds, it automatically:
1. Searches the knowledge base for relevant context
2. Retrieves past conversations, code, or docs
3. Includes that context in the response
This happens transparently the AI "remembers" your past work.
## Management
### View Statistics
```bash
python3 rag_manage.py stats
```
Output:
```
📊 OpenClaw RAG Statistics
Collection: openclaw_knowledge
Total Documents: 635
By Source:
session-001: 23
my-script.py: 5
porkbun: 12
By Type:
session: 500
workspace: 100
skill: 35
```
### Delete Documents
```bash
# Delete all sessions
python3 rag_manage.py delete --by-type session
# Delete specific file
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
# Reset entire collection
python3 rag_manage.py reset
```
### Add Manual Document
```bash
python3 rag_manage.py add \
--text "API endpoint: https://api.example.com/endpoint" \
--source "api-docs:example.com" \
--type "manual"
```
## Configuration
### Custom Session Directory
```bash
python3 ingest_sessions.py --sessions-dir /path/to/sessions
```
### Chunk Size Control
```bash
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
```
### Custom Collection
```python
from rag_system import RAGSystem
rag = RAGSystem(collection_name="my_knowledge")
```
## Data Types
| Type | Source Format | Description |
|------|--------------|-------------|
| `session` | `session:{key}` | Chat history transcripts |
| `workspace` | `relative/path/to/file` | Code, configs, docs |
| `skill` | `skill:{name}` | Skill documentation |
| `memory` | `MEMORY.md` | Long-term memory entries |
| `manual` | `{custom}` | Manually added docs |
| `api` | `api-docs:{name}` | API documentation |
## Performance
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
- **Storage**: ~100MB per 1,000 documents
- **Indexing**: ~1,000 documents/minute
- **Search**: <100ms (after first query)
## Troubleshooting
### No Results Found
```bash
# Check what's indexed
python3 rag_manage.py stats
# Try broader query
python3 rag_query.py "SMS" # instead of "voip.ms SMS API endpoint"
```
### Slow First Search
First search loads embeddings (~1-2 seconds). Subsequent searches are instant.
### Duplicate ID Errors
```bash
# Reset and re-index
python3 rag_manage.py reset
python3 ingest_sessions.py
python3 ingest_docs.py workspace
```
### ChromaDB Model Download
First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.
## Best Practices
### Re-index Regularly
After significant work:
```bash
python3 ingest_sessions.py # New conversations
python3 ingest_docs.py workspace # New code/changes
```
### Use Specific Queries
```bash
# Better
python3 rag_query.py "voip.ms getSMS method"
# Too broad
python3 rag_query.py "SMS"
```
### Filter by Type
```bash
# Looking for code
python3 rag_query.py --type workspace "chromedriver"
# Looking for past conversations
python3 rag_query.py --type session "Reddit"
```
### Document Decisions
After important decisions, add them manually:
```bash
python3 rag_manage.py add \
--text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
--source "decision:reddit-automation" \
--type "decision"
```
## Limitations
- Files > 1MB automatically skipped (performance)
- Python 3.7+ required
- ~100MB disk per 1,000 documents
- First search slower (embedding load)
## Integration with OpenClaw
This skill integrates seamlessly with OpenClaw:
1. **Automatic RAG**: AI automatically retrieves relevant context when responding
2. **Session history**: All conversations indexed and searchable
3. **Workspace awareness**: Code and docs indexed for reference
4. **Skill accessible**: Use from any OpenClaw session or script
## Security Considerations
**⚠️ Important Privacy Note:** This RAG system indexes local data, which may contain:
- API keys, tokens, or credentials in session transcripts
- Private messages or personal information
- Tool results with sensitive data
- Workspace configuration files
**Recommended:**
- Review session files before ingestion if concerned about privacy
- Consider redacting sensitive data from session files
- Use `rag_manage.py reset` to delete the entire index when needed
- The ChromaDB persistence at `~/.openclaw/data/rag/` can be deleted to remove all indexed data
- The auto-update script only runs local ingestion - no remote code fetching
## Example Workflow
**Scenario:** You're working on a new automation but hit a Cloudflare challenge.
```bash
# Search for past Cloudflare solutions
python3 rag_query.py "Cloudflare bypass selenium"
# Result shows relevant past conversation:
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."
# Now you know the solution before trying it!
```
## Repository
https://git.theta42.com/nova/openclaw-rag-skill
**Published:** clawhub.com
**Maintainer:** Nova AI Assistant
**For:** William Mantly (Theta42)
## License
MIT License - Free to use and modify

529
index.html Normal file
View File

@@ -0,0 +1,529 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>OpenClaw RAG | Knowledge Base with Semantic Search</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
<style>
:root {
--bg-primary: #0D0D0D;
--bg-secondary: #1A1A1A;
--bg-tertiary: #252525;
--accent-orange: #FF5733;
--accent-orange-hover: #FF6B4C;
--text-primary: #FFFFFF;
--text-secondary: #B0B0B0;
--text-muted: #6B6B6B;
--border-color: #333333;
--terminal-bg: #0A0A0A;
--terminal-border: #1E3A2E;
}
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Inter', sans-serif;
background-color: var(--bg-primary);
color: var(--text-primary);
line-height: 1.6;
min-height: 100vh;
}
/* Header */
header {
padding: 2rem 4rem;
display: flex;
justify-content: space-between;
align-items: center;
border-bottom: 1px solid var(--border-color);
}
.logo {
font-weight: 800;
font-size: 1.5rem;
background: linear-gradient(135deg, var(--text-primary), var(--accent-orange));
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
nav a {
color: var(--text-secondary);
text-decoration: none;
margin-left: 2rem;
font-size: 0.9rem;
transition: color 0.2s;
}
nav a:hover {
color: var(--accent-orange);
}
/* Hero Section */
.hero {
max-width: 1400px;
margin: 0 auto;
padding: 5rem 4rem;
display: grid;
grid-template-columns: 1fr 1fr;
gap: 4rem;
align-items: center;
}
.hero-text h1 {
font-size: 3.5rem;
font-weight: 800;
line-height: 1.1;
margin-bottom: 1.5rem;
}
.hero-text h1 span {
color: var(--accent-orange);
}
.hero-text p {
color: var(--text-secondary);
font-size: 1.2rem;
margin-bottom: 2rem;
}
.cta-group {
display: flex;
gap: 1rem;
}
.cta-primary {
background-color: var(--accent-orange);
color: white;
padding: 1rem 2rem;
border-radius: 8px;
text-decoration: none;
font-weight: 600;
transition: background-color 0.2s;
border: none;
cursor: pointer;
}
.cta-primary:hover {
background-color: var(--accent-orange-hover);
}
.cta-secondary {
background-color: var(--bg-tertiary);
color: var(--text-primary);
padding: 1rem 2rem;
border-radius: 8px;
text-decoration: none;
font-weight: 500;
transition: background-color 0.2s;
border: 1px solid var(--border-color);
}
.cta-secondary:hover {
background-color: var(--bg-secondary);
}
/* Terminal */
.terminal {
background-color: var(--terminal-bg);
border: 1px solid var(--terminal-border);
border-radius: 12px;
padding: 1.5rem;
font-family: 'JetBrains Mono', monospace;
font-size: 0.9rem;
}
.terminal-header {
display: flex;
gap: 0.5rem;
margin-bottom: 1rem;
}
.terminal-dot {
width: 12px;
height: 12px;
border-radius: 50%;
}
.dot-red { background-color: #FF5F56; }
.dot-yellow { background-color: #FFBD2E; }
.dot-green { background-color: #27C93F; }
.terminal-prompt {
color: var(--accent-orange);
display: flex;
gap: 0.5rem;
margin-bottom: 0.5rem;
}
.terminal-command {
color: var(--text-primary);
}
.terminal-output {
color: var(--text-secondary);
margin-top: 1rem;
}
.terminal-output .success {
color: #27C93F;
}
/* Features Section */
.features {
max-width: 1400px;
margin: 0 auto;
padding: 4rem;
}
.section-header {
margin-bottom: 3rem;
}
.section-header h2 {
font-size: 2.5rem;
font-weight: 700;
margin-bottom: 0.5rem;
}
.section-header p {
color: var(--text-secondary);
font-size: 1.1rem;
}
.feature-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(350px, 1fr));
gap: 1.5rem;
}
.feature-card {
background: rgba(26, 26, 26, 0.8);
border: 1px solid var(--border-color);
border-radius: 12px;
padding: 2rem;
transition: border-color 0.2s;
}
.feature-card:hover {
border-color: var(--accent-orange);
}
.feature-icon {
font-size: 2rem;
margin-bottom: 1rem;
}
.feature-card h3 {
font-size: 1.3rem;
font-weight: 600;
margin-bottom: 0.75rem;
}
.feature-card p {
color: var(--text-secondary);
font-size: 0.95rem;
line-height: 1.7;
}
/* Comparison Section */
.comparison {
max-width: 1400px;
margin: 0 auto;
padding: 4rem;
background: linear-gradient(180deg, transparent, rgba(255, 87, 51, 0.03));
}
.comparison-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 3rem;
margin-top: 2rem;
}
.comparison-card {
background: var(--bg-secondary);
border: 2px solid var(--border-color);
border-radius: 12px;
padding: 2rem;
}
.comparison-card.better {
border-color: var(--accent-orange);
box-shadow: 0 0 20px rgba(255, 87, 51, 0.1);
}
.comparison-card h3 {
font-size: 1.5rem;
font-weight: 700;
margin-bottom: 1.5rem;
}
.comparison-card.better h3 {
color: var(--accent-orange);
}
.feature-list {
list-style: none;
}
.feature-list li {
padding: 0.75rem 0;
padding-left: 1.5rem;
position: relative;
color: var(--text-secondary);
}
.feature-list li::before {
content: '✓';
position: absolute;
left: 0;
color: var(--accent-orange);
}
.feature-list li.missing {
color: var(--text-muted);
}
.feature-list li.missing::before {
content: '✗';
color: var(--text-muted);
}
/* Why Section */
.why-section {
max-width: 1400px;
margin: 0 auto;
padding: 4rem;
}
.why-cards {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
gap: 1.5rem;
margin-top: 2rem;
}
.why-card {
background: var(--bg-tertiary);
border-radius: 12px;
padding: 1.5rem;
border-left: 4px solid var(--accent-orange);
}
.why-card h4 {
font-size: 1.1rem;
font-weight: 600;
margin-bottom: 0.5rem;
}
.why-card p {
color: var(--text-secondary);
font-size: 0.9rem;
}
/* Footer */
footer {
border-top: 1px solid var(--border-color);
padding: 3rem 4rem;
margin-top: 4rem;
}
.footer-content {
max-width: 1400px;
margin: 0 auto;
display: flex;
justify-content: space-between;
align-items: center;
}
.footer-links a {
color: var(--text-secondary);
text-decoration: none;
margin-left: 2rem;
font-size: 0.9rem;
transition: color 0.2s;
}
.footer-links a:hover {
color: var(--accent-orange);
}
.footer-text {
color: var(--text-muted);
font-size: 0.85rem;
}
/* Responsive */
@media (max-width: 1024px) {
.hero {
grid-template-columns: 1fr;
text-align: center;
padding: 3rem 2rem;
}
.hero h1 {
font-size: 2.5rem;
}
.comparison-grid {
grid-template-columns: 1fr;
}
.header, footer, .features, .comparison, .why-section {
padding-left: 2rem;
padding-right: 2rem;
}
}
</style>
</head>
<body>
<header>
<div class="logo">OpenClaw RAG</div>
<nav>
<a href="#features">Features</a>
<a href="#comparison">Comparison</a>
<a href="#why">Why Us</a>
<a href="https://github.com/wmantly/openclaw-rag-skill">GitHub</a>
</nav>
</header>
<section class="hero">
<div class="hero-text">
<h1>Local Knowledge with<br><span>Semantic Search</span></h1>
<p>Index your entire OpenClaw workspace — chat history, code, documentation — and search by meaning, not keywords. Fully offline, zero dependencies, complete privacy.</p>
<div class="cta-group">
<a href="https://clawhub.ai/wmantly/openclaw-rag-skill" class="cta-primary">Install on ClawHub</a>
<a href="https://github.com/wmantly/openclaw-rag-skill" class="cta-secondary">View on GitHub</a>
</div>
</div>
<div class="terminal">
<div class="terminal-header">
<div class="terminal-dot dot-red"></div>
<div class="terminal-dot dot-yellow"></div>
<div class="terminal-dot dot-green"></div>
</div>
<div class="terminal-prompt">
<span class="terminal-command">$</span>
<span>python3 rag_query.py "how to send SMS"</span>
</div>
<div class="terminal-output">
<p>🔍 Found 3 relevant items:</p>
<br>
<p><strong>📄 Past Conversation (voipms-sms)</strong></p>
<p>Use voipms_sms_client.py with API endpoint...</p>
<p>API: https://voip.ms/api/v1/rest.php</p>
<br>
<p class="success">✅ Indexed 2,533 documents from your workspace</p>
</div>
</div>
</section>
<section id="features" class="features">
<div class="section-header">
<h2>Powerful Features</h2>
<p>Everything you need to build and maintain your knowledge base</p>
</div>
<div class="feature-grid">
<div class="feature-card">
<div class="feature-icon">🧠</div>
<h3>Semantic Search</h3>
<p>Find context by meaning, not just keywords. ChromaDB with all-MiniLM-L6-v2 embeddings automatically understands the intent behind your queries.</p>
</div>
<div class="feature-card">
<div class="feature-icon">📚</div>
<h3>Multi-Source Indexing</h3>
<p>Index chat sessions, Python/JavaScript code, Markdown docs, skill documentation — everything in your OpenClaw workspace becomes searchable.</p>
</div>
<div class="feature-card">
<div class="feature-icon">🔒</div>
<h3>Completely Offline</h3>
<p>No API keys, no external services, no cloud databases. All embeddings and storage live on your machine. Your data never leaves.</p>
</div>
<div class="feature-card">
<div class="feature-icon">🚀</div>
<h3>Automatic Integration</h3>
<p>The AI automatically retrieves relevant context from your knowledge base when responding. It "remembers" your past work transparently.</p>
</div>
<div class="feature-card">
<div class="feature-icon">📊</div>
<h3>Type Filtering</h3>
<p>Search by document type — sessions, workspace, skills, memory. Find exactly what you're looking for with precision.</p>
</div>
<div class="feature-card">
<div class="feature-icon"></div>
<h3>Blazing Fast</h3>
<p>Indexing at ~1,000 docs/minute, search in under 100ms. The embedding model (79MB) caches locally after first run.</p>
</div>
</div>
</section>
<section id="comparison" class="comparison">
<div class="section-header">
<h2>Why OpenClaw RAG?</h2>
<p>Compare with web-search RAG tools</p>
</div>
<div class="comparison-grid">
<div class="comparison-card">
<h3>Web-Search RAG Tools</h3>
<ul class="feature-list">
<li class="missing">Searches the internet</li>
<li class="missing">Requires web connection</li>
<li class="missing">No knowledge of your code</li>
<li class="missing">Can't find past solutions</li>
<li class="missing">For research, not memory</li>
</ul>
</div>
<div class="comparison-card better">
<h3>✨ OpenClaw RAG</h3>
<ul class="feature-list">
<li>Indexes YOUR knowledge base</li>
<li>Fully offline after setup</li>
<li>Understands your entire workspace</li>
<li>Remembers your past solutions</li>
<li>For institutional memory & productivity</li>
</ul>
</div>
</div>
</section>
<section id="why" class="why-section">
<div class="section-header">
<h2>Built for Developers, by Developers</h2>
<p>Different use cases for different problems</p>
</div>
<div class="why-cards">
<div class="why-card">
<h4>🌐 Use Web RAG for Research</h4>
<p>Need current information, multi-engine web searches, or gathering context from the internet? That's what web-search RAG tools do best.</p>
</div>
<div class="why-card">
<h4>💾 Use OpenClaw RAG for Memory</h4>
<p>Need to find past solutions, code patterns, decisions, or conversations? That's our specialty — we remember everything you've done.</p>
</div>
<div class="why-card">
<h4>🤝 Best of Both Worlds</h4>
<p>Use web-search RAG for research, then save the important findings to your knowledge base. OpenClaw RAG preserves what matters forever.</p>
</div>
</div>
</section>
<footer>
<div class="footer-content">
<p class="footer-text">OpenClaw RAG v1.0.3 · MIT License</p>
<div class="footer-links">
<a href="https://clawhub.ai/wmantly/openclaw-rag-skill">ClawHub</a>
<a href="https://github.com/wmantly/openclaw-rag-skill">GitHub</a>
<a href="https://william.mantly.vip">William Mantly</a>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -1,265 +0,0 @@
#!/usr/bin/env python3
"""
RAG Document Ingestor - Load workspace files, skills, docs into vector store
"""
import os
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict
# Add parent directory to path
import sys
sys.path.insert(0, str(Path(__file__).parent))
from rag_system import RAGSystem
def read_file_safe(file_path: Path) -> str:
"""Read file with error handling"""
try:
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
except Exception as e:
print(f" ⚠️ Error reading: {e}")
return None
def is_text_file(file_path: Path, max_size_mb: float = 1.0) -> bool:
"""Check if file is text and not too large"""
# Skip binary files
if file_path.suffix.lower() in ['.pyc', '.so', '.o', '.a', '.png', '.jpg', '.jpeg', '.gif', '.zip', '.tar', '.gz']:
return False
# Check size
try:
size_mb = file_path.stat().st_size / (1024 * 1024)
if size_mb > max_size_mb:
return False
except:
return False
return True
def chunk_text(text: str, max_chars: int = 4000, overlap: int = 200) -> List[str]:
"""Split text into chunks"""
chunks = []
if len(text) <= max_chars:
return [text]
# Simple splitting by newline for now (could be improved to split at sentences)
paragraphs = text.split('\n\n')
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) + 2 <= max_chars:
current_chunk += para + "\n\n"
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
def ingest_workspace(
workspace_dir: str = None,
collection_name: str = "openclaw_knowledge",
file_patterns: List[str] = None
):
"""
Ingest workspace files into RAG system
Args:
workspace_dir: Path to workspace directory
collection_name: Name of the ChromaDB collection
file_patterns: List of file patterns to include (default: all)
"""
if workspace_dir is None:
workspace_dir = os.path.expanduser("~/.openclaw/workspace")
workspace_path = Path(workspace_dir)
if not workspace_path.exists():
print(f"❌ Workspace not found: {workspace_dir}")
return
print(f"🔍 Scanning workspace: {workspace_path}")
# Default file patterns
if file_patterns is None:
file_patterns = [
"*.md", "*.py", "*.js", "*.ts", "*.json", "*.yaml", "*.yml",
"*.txt", "*.sh", "*.html", "*.css"
]
# Find all matching files
all_files = []
for pattern in file_patterns:
for file_path in workspace_path.rglob(pattern):
if is_text_file(file_path):
all_files.append(file_path)
if not all_files:
print(f"⚠️ No files found")
return
print(f"✅ Found {len(all_files)} files\n")
# Initialize RAG
rag = RAGSystem(collection_name=collection_name)
total_chunks = 0
for file_path in all_files[:100]: # Limit to 100 files for testing
relative_path = file_path.relative_to(workspace_path)
print(f"\n📄 {relative_path}")
# Read file
content = read_file_safe(file_path)
if content is None:
continue
# Chunk if too large
if len(content) > 4000:
text_chunks = chunk_text(content)
print(f" Chunks: {len(text_chunks)}")
else:
text_chunks = [content]
print(f" Size: {len(content)} chars")
# Add each chunk
for i, chunk in enumerate(text_chunks):
metadata = {
"type": "workspace",
"source": str(relative_path),
"file_path": str(file_path),
"file_size": len(content),
"chunk_index": i,
"total_chunks": len(text_chunks),
"file_extension": file_path.suffix.lower(),
"ingested_at": datetime.now().isoformat()
}
doc_id = rag.add_document(chunk, metadata)
total_chunks += 1
print(f" ✅ Indexed {len(text_chunks)} chunk(s)")
print(f"\n📊 Summary:")
print(f" Files processed: {len([f for f in all_files[:100]])}")
print(f" Total chunks indexed: {total_chunks}")
stats = rag.get_stats()
print(f" Total documents in collection: {stats['total_documents']}")
def ingest_skills(
skills_base_dir: str = None,
collection_name: str = "openclaw_knowledge"
):
"""
Ingest all SKILL.md files from skills directory
Args:
skills_base_dir: Base directory for skills
collection_name: Name of the ChromaDB collection
"""
# Default to OpenClaw skills dir
if skills_base_dir is None:
# Check both system and workspace skills
system_skills = Path("/usr/lib/node_modules/openclaw/skills")
workspace_skills = Path(os.path.expanduser("~/.openclaw/workspace/skills"))
skills_dirs = [d for d in [system_skills, workspace_skills] if d.exists()]
if not skills_dirs:
print(f"❌ No skills directories found")
return
else:
skills_dirs = [Path(skills_base_dir)]
print(f"🔍 Scanning for skills...")
# Find all SKILL.md files
skill_files = []
for skills_dir in skills_dirs:
# Direct SKILL.md files
for skill_file in skills_dir.rglob("SKILL.md"):
skill_files.append(skill_file)
if not skill_files:
print(f"⚠️ No SKILL.md files found")
return
print(f"✅ Found {len(skill_files)} skills\n")
# Initialize RAG
rag = RAGSystem(collection_name=collection_name)
total_chunks = 0
for skill_file in skill_files:
# Determine skill name from path
if skill_file.name == "SKILL.md":
skill_name = skill_file.parent.name
else:
skill_name = skill_file.stem
print(f"\n📜 {skill_name}")
content = read_file_safe(skill_file)
if content is None:
continue
# Chunk skill documentation
chunks = chunk_text(content, max_chars=3000, overlap=100)
for i, chunk in enumerate(chunks):
metadata = {
"type": "skill",
"source": f"skill:{skill_name}",
"skill_name": skill_name,
"file_path": str(skill_file),
"chunk_index": i,
"total_chunks": len(chunks),
"ingested_at": datetime.now().isoformat()
}
doc_id = rag.add_document(chunk, metadata)
total_chunks += 1
print(f" ✅ Indexed {len(chunks)} chunk(s)")
print(f"\n📊 Summary:")
print(f" Skills processed: {len(skill_files)}")
print(f" Total chunks indexed: {total_chunks}")
stats = rag.get_stats()
print(f" Total documents in collection: {stats['total_documents']}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Ingest files into OpenClaw RAG")
parser.add_argument("type", choices=["workspace", "skills"], help="What to ingest")
parser.add_argument("--path", help="Path to workspace or skills directory")
parser.add_argument("--collection", default="openclaw_knowledge", help="Collection name")
args = parser.parse_args()
if args.type == "workspace":
ingest_workspace(workspace_dir=args.path, collection_name=args.collection)
elif args.type == "skills":
ingest_skills(skills_base_dir=args.path, collection_name=args.collection)

View File

@@ -1,289 +0,0 @@
#!/usr/bin/env python3
"""
RAG Session Ingestor - Load all session transcripts into vector store
Fixed to handle OpenClaw session event format
"""
import os
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any
import sys
sys.path.insert(0, str(Path(__file__).parent))
from rag_system import RAGSystem
def parse_jsonl(file_path: Path) -> List[Dict]:
"""
Parse OpenClaw session JSONL format
Session files contain:
- Line 1: Session metadata (type: "session")
- Lines 2+: Events including messages, toolCalls, etc.
"""
messages = []
try:
with open(file_path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
event = json.loads(line)
# Skip session metadata line
if line_num == 1 and event.get('type') == 'session':
continue
# Extract message events only
if event.get('type') == 'message':
msg_obj = event.get('message', {})
messages.append({
'role': msg_obj.get('role'),
'content': msg_obj.get('content'),
'timestamp': event.get('timestamp'),
'id': event.get('id'),
'sessionKey': event.get('sessionKey') # Not usually here, but check
})
except json.JSONDecodeError as e:
continue
except Exception as e:
print(f"❌ Error reading {file_path}: {e}")
return messages
def extract_session_key(file_name: str) -> str:
"""Extract session key from filename"""
return file_name.replace('.jsonl', '')
def extract_session_metadata(session_data: List[Dict], session_key: str) -> Dict:
"""Extract metadata from session messages"""
if not session_data:
return {}
first_msg = session_data[0]
last_msg = session_data[-1]
return {
"start_time": first_msg.get("timestamp"),
"end_time": last_msg.get("timestamp"),
"total_messages": len(session_data),
"has_system": any(msg.get("role") == "system" for msg in session_data),
"has_user": any(msg.get("role") == "user" for msg in session_data),
"has_assistant": any(msg.get("role") == "assistant" for msg in session_data),
}
def format_content(content) -> str:
"""
Format message content from OpenClaw format to text
Content can be:
- String
- List of dicts with 'type' field (text, thinking, toolCall, toolResult)
"""
if isinstance(content, str):
return content
if isinstance(content, list):
texts = []
for item in content:
if not isinstance(item, dict):
continue
item_type = item.get('type', '')
if item_type == 'text':
texts.append(item.get('text', ''))
elif item_type == 'thinking':
# Skip reasoning, usually not useful for RAG
# texts.append(f"[Reasoning: {item.get('thinking', '')[:200]}]")
pass
elif item_type == 'toolCall':
tool_name = item.get('name', 'unknown')
args = str(item.get('arguments', ''))[:100]
texts.append(f"[Tool: {tool_name}({args})]")
elif item_type == 'toolResult':
result = str(item.get('text', item.get('result', ''))).strip()
# Truncate large tool results
if len(result) > 500:
result = result[:500] + "..."
texts.append(f"[Tool Result: {result}]")
return "\n".join(texts)
return str(content)[:500]
def chunk_messages(
messages: List[Dict],
context_window: int = 20,
overlap: int = 5
) -> List[Dict]:
"""
Chunk messages for better retrieval
Args:
messages: List of message objects
context_window: Messages per chunk
overlap: Message overlap between chunks
Returns:
List of {"text": str, "metadata": dict} chunks
"""
chunks = []
for i in range(0, len(messages), context_window - overlap):
chunk_messages = messages[i:i + context_window]
# Build text from messages
text_parts = []
for msg in chunk_messages:
role = msg.get("role", "unknown")
content = msg.get("content", "")
# Format content
text = format_content(content)
if text.strip():
text_parts.append(f"{role.upper()}: {text}")
text = "\n\n".join(text_parts)
# Don't add empty chunks
if not text.strip():
continue
# Metadata
metadata = {
"type": "session",
"source": str(chunk_messages[0].get("sessionKey") or chunk_messages[0].get("id") or session_key),
"chunk_index": int(i // (context_window - overlap)),
"chunk_start_time": str(chunk_messages[0].get("timestamp") or ""),
"chunk_end_time": str(chunk_messages[-1].get("timestamp") or ""),
"message_count": int(len(chunk_messages)),
"ingested_at": datetime.now().isoformat(),
"date": str(chunk_messages[0].get("timestamp") or datetime.now().isoformat())
}
chunks.append({
"text": text,
"metadata": metadata
})
return chunks
def ingest_sessions(
sessions_dir: str = None,
collection_name: str = "openclaw_knowledge",
chunk_size: int = 20,
chunk_overlap: int = 5
):
"""
Ingest all session transcripts into RAG system
Args:
sessions_dir: Directory containing session jsonl files
collection_name: Name of the ChromaDB collection
chunk_size: Messages per chunk
chunk_overlap: Message overlap between chunks
"""
if sessions_dir is None:
sessions_dir = os.path.expanduser("~/.openclaw/agents/main/sessions")
sessions_path = Path(sessions_dir)
if not sessions_path.exists():
print(f"❌ Sessions directory not found: {sessions_path}")
return
print(f"🔍 Finding session files in: {sessions_path}")
jsonl_files = list(sessions_path.glob("*.jsonl"))
if not jsonl_files:
print(f"⚠️ No jsonl files found in {sessions_path}")
return
print(f"✅ Found {len(jsonl_files)} session files\n")
rag = RAGSystem(collection_name=collection_name)
total_chunks = 0
total_messages = 0
skipped_empty = 0
for jsonl_file in sorted(jsonl_files):
session_key = extract_session_key(jsonl_file.name)
print(f"\n📄 Processing: {jsonl_file.name}")
messages = parse_jsonl(jsonl_file)
if not messages:
print(f" ⚠️ No messages, skipping")
skipped_empty += 1
continue
total_messages += len(messages)
# Extract session metadata
session_metadata = extract_session_metadata(messages, session_key)
print(f" Messages: {len(messages)}")
# Chunk messages
chunks = chunk_messages(messages, chunk_size, chunk_overlap)
if not chunks:
print(f" ⚠️ No valid chunks, skipping")
skipped_empty += 1
continue
print(f" Chunks: {len(chunks)}")
# Add to RAG
try:
ids = rag.add_documents_batch(chunks, batch_size=50)
total_chunks += len(chunks)
print(f" ✅ Indexed {len(chunks)} chunks")
except Exception as e:
print(f" ❌ Error: {e}")
# Summary
print(f"\n📊 Summary:")
print(f" Sessions processed: {len(jsonl_files)}")
print(f" Skipped (empty): {skipped_empty}")
print(f" Total messages: {total_messages}")
print(f" Total chunks indexed: {total_chunks}")
stats = rag.get_stats()
print(f" Total documents in collection: {stats['total_documents']}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Ingest OpenClaw session transcripts into RAG")
parser.add_argument("--sessions-dir", help="Path to sessions directory (default: ~/.openclaw/agents/main/sessions)")
parser.add_argument("--chunk-size", type=int, default=20, help="Messages per chunk (default: 20)")
parser.add_argument("--chunk-overlap", type=int, default=5, help="Message overlap (default: 5)")
args = parser.parse_args()
ingest_sessions(
sessions_dir=args.sessions_dir,
chunk_size=args.chunk_size,
chunk_overlap=args.chunk_overlap
)

View File

@@ -1,43 +0,0 @@
#!/bin/bash
# RAG Agent Launcher - Spawns an agent with automatic knowledge base access
# This spawns a sub-agent that has RAG automatically integrated
# The agent will query your knowledge base before responding to questions
SESSION_SPAWN_COMMAND='python3 -c "
import sys
sys.path.insert(0, \"/home/william/.openclaw/workspace/rag\")
# Add RAG context to system prompt
ORIGINAL_TASK=\"$@\"
# Search for relevant context
from rag_system import RAGSystem
rag = RAGSystem()
# Find similar past conversations
results = rag.search(ORIGINAL_TASK, n_results=3)
if results:
context = \"\\n=== RELEVANT CONTEXT FROM KNOWLEDGE BASE ===\\n\"
for i, r in enumerate(results, 1):
meta = r.get(\"metadata\", {})
text = r.get(\"text\", \"\")[:500]
doc_type = meta.get(\"type\", \"unknown\")
source = meta.get(\"source\", \"unknown\")
context += f\"\\n[{doc_type.upper()} - {source}]\\n{text}\\n\"
else:
context = \"\"
# Respond with context-aware task
print(f\"\"\"{context}
=== CURRENT TASK ===
{ORIGINAL_TASK}
Use the context above if relevant to help answer the question.\"
\"\")"
# Spawn the agent with RAG context
/home/william/.local/bin/openclaw sessions spawn "$SESSION_SPAWN_COMMAND"

View File

@@ -1,52 +0,0 @@
{
"name": "rag-openclaw",
"version": "1.0.2",
"description": "RAG Knowledge System for OpenClaw - Semantic search across chat history, code, docs, and skills with automatic memory retrieval",
"homepage": "http://git.theta42.com/nova/openclaw-rag-skill",
"author": {
"name": "Nova AI",
"email": "nova@vm42.us"
},
"owner": "wmantly",
"openclaw": {
"always": false,
"capabilities": []
},
"environment": {
"required": {},
"optional": {},
"config": {
"paths": [
"~/.openclaw/data/rag/"
],
"help": "ChromaDB storage location. No configuration required - system auto-creates data directory on first use."
}
},
"install": {
"type": "instruction",
"steps": [
"1. Install Python dependency: pip3 install --user chromadb",
"2. Install location: ~/.openclaw/workspace/rag/ (created automatically)",
"3. Data storage: ~/.openclaw/data/rag/ (auto-created on first run)",
"4. No API keys or credentials required - fully local system"
]
},
"scripts": {
"ingest:sessions": "python3 ingest_sessions.py",
"ingest:workspace": "python3 ingest_docs.py workspace",
"ingest:skills": "python3 ingest_docs.py skills",
"search": "python3 rag_query.py",
"update": "bash scripts/rag-auto-update.sh",
"stats": "python3 rag_manage.py stats",
"manage": "python3 rag_manage.py"
},
"keywords": [
"rag",
"knowledge",
"semantic-search",
"chromadb",
"memory",
"retrieval-augmented-generation"
],
"license": "MIT"
}

View File

@@ -1,213 +0,0 @@
#!/usr/bin/env python3
"""
RAG-Enhanced OpenClaw Agent
This agent automatically retrieves relevant context from the knowledge base
before responding, providing conversation history, code, and documentation context.
Usage:
python3 /path/to/rag_agent.py <user_message> <session_jsonl_file>
Or integrate into OpenClaw as an agent wrapper.
"""
import sys
import json
from pathlib import Path
# Add parent directory to import RAG system
current_dir = Path(__file__).parent
sys.path.insert(0, str(current_dir))
from rag_system import RAGSystem
def extract_user_query(messages: list) -> str:
"""
Extract the most recent user message from conversation history.
Args:
messages: List of message objects
Returns:
User query string
"""
# Find the last user message
for msg in reversed(messages):
role = msg.get('role')
if role == 'user':
content = msg.get('content', '')
# Handle different content formats
if isinstance(content, str):
return content
elif isinstance(content, list):
# Extract text from list format
text_parts = []
for item in content:
if isinstance(item, dict) and item.get('type') == 'text':
text_parts.append(item.get('text', ''))
return ' '.join(text_parts)
return ''
def search_relevant_context(query: str, rag: RAGSystem, max_results: int = 5) -> str:
"""
Search the knowledge base for relevant context.
Args:
query: User's question
rag: RAGSystem instance
max_results: Maximum results to return
Returns:
Formatted context string
"""
if not query or len(query) < 3:
return ''
try:
# Search for relevant context
results = rag.search(query, n_results=max_results)
if not results:
return ''
# Format the results
context_parts = []
context_parts.append(f"Found {len(results)} relevant context items:\n")
for i, result in enumerate(results, 1):
metadata = result.get('metadata', {})
doc_type = metadata.get('type', 'unknown')
source = metadata.get('source', 'unknown')
# Header based on type
if doc_type == 'session':
header = f"[Session Reference {i}]"
elif doc_type == 'workspace':
header = f"[Code/Docs {i}: {source}]"
elif doc_type == 'skill':
header = f"[Skill Reference {i}: {source}]"
else:
header = f"[Reference {i}]"
# Truncate long content
text = result.get('text', '')
if len(text) > 800:
text = text[:800] + "..."
context_parts.append(f"{header}\n{text}\n")
return '\n'.join(context_parts)
except Exception as e:
# Fail silently - RAG shouldn't break conversations
return ''
def enhance_message_with_rag(
message_content: str,
conversation_history: list,
collection_name: str = "openclaw_knowledge"
) -> str:
"""
Enhance a user message with relevant RAG context.
This is the main integration point. Call this before sending messages to the LLM.
Args:
message_content: The current user message
conversation_history: Recent conversation messages
collection_name: ChromaDB collection name
Returns:
Enhanced message string with RAG context prepended
"""
try:
# Initialize RAG system
rag = RAGSystem(collection_name=collection_name)
# Extract user query
user_query = extract_user_query([{'role': 'user', 'content': message_content}] + conversation_history)
# Search for relevant context
context = search_relevant_context(user_query, rag, max_results=5)
if not context:
return message_content
# Prepend context to the message
enhanced = f"""[RAG CONTEXT - Retrieved from knowledge base:]
{context}
---
[CURRENT USER MESSAGE:]
{message_content}"""
return enhanced
except Exception as e:
# Fail silently - return original message if RAG fails
return message_content
def get_response_with_rag(
user_message: str,
session_jsonl: str = None,
collection_name: str = "openclaw_knowledge"
) -> str:
"""
Get an AI response with automatic RAG-enhanced context.
This is a helper function that can be called from scripts.
Args:
user_message: The user's question
session_jsonl: Path to session file (for conversation history)
collection_name: ChromaDB collection name
Returns:
Enhanced message ready for LLM processing
"""
# Load conversation history if session file provided
conversation_history = []
if session_jsonl and Path(session_jsonl).exists():
try:
with open(session_jsonl, 'r') as f:
for line in f:
if line.strip():
event = json.loads(line)
if event.get('type') == 'message':
msg = event.get('message', {})
conversation_history.append(msg)
except:
pass
# Enhance message
return enhance_message_with_rag(user_message, conversation_history, collection_name)
if __name__ == "__main__":
# Command-line interface for testing
if len(sys.argv) < 2:
print("Usage: python3 rag_agent.py <user_message> [session_jsonl]")
print("\nOr import and use:")
print(" from rag.rag_agent import enhance_message_with_rag")
print(" enhanced = enhance_message_with_rag(user_message, history)")
sys.exit(1)
user_message = sys.argv[1]
session_jsonl = sys.argv[2] if len(sys.argv) > 2 else None
# Get enhanced message
enhanced = get_response_with_rag(user_message, session_jsonl)
print("\n" + "="*80)
print("ENHANCED MESSAGE (Ready for LLM):")
print("="*80)
print(enhanced)
print("="*80 + "\n")

View File

@@ -1,91 +0,0 @@
#!/usr/bin/env python3
"""
RAG Context Checker - Simple wrapper for automatic knowledge base lookup
Usage from within Python/OpenClaw sessions:
from rag_context import check_context
check_context("your question here")
This prints relevant context if found, otherwise silent.
"""
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
from rag_query_wrapper import search_knowledge, format_for_ai
def check_context(query: str, max_results: int = 5, min_score: float = 0.3) -> None:
"""
Check knowledge base for relevant context and print it.
This is designed as a simple call-and-forget function. If context exists,
it prints it. If not, it's silent.
Args:
query: The question or topic to search for
max_results: Maximum results to retrieve
min_score: Minimum similarity score (not used in current implementation)
Example:
>>> check_context("how to send SMS")
📚 Found 3 relevant items...
[prints context]
"""
if not query or len(query) < 3:
return
try:
results = search_knowledge(query, n_results=max_results)
if results and results.get('count', 0) > 0:
print("\n" + "="*80)
print("📚 RELEVANT CONTEXT FROM KNOWLEDGE BASE\n")
formatted = format_for_ai(results)
print(formatted)
print("="*80 + "\n")
except Exception as e:
# Fail silently - RAG errors shouldn't break conversations
pass
def get_context(query: str, max_results: int = 5) -> str:
"""
Get context as a string (for automated use).
Returns formatted context string, or empty string if no results.
Args:
query: Search query
max_results: Maximum results
Returns:
Formatted context string, or "" if no results
Example:
>>> context = get_context("Reddit automation")
>>> if context:
>>> print("Found relevant past work:")
>>> print(context)
"""
try:
results = search_knowledge(query, n_results=max_results)
if results and results.get('count', 0) > 0:
return format_for_ai(results)
except Exception as e:
pass
return ""
# Quick test when run directly
if __name__ == "__main__":
if len(sys.argv) > 1:
query = ' '.join(sys.argv[1:])
check_context(query)
else:
print("Usage: python3 rag_context.py <query>")
print("Or import: from rag_context import check_context")

View File

@@ -1,218 +0,0 @@
#!/usr/bin/env python3
"""
RAG Manager - Manage the OpenClaw knowledge base (add/remove/stats)
"""
import sys
from pathlib import Path
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent))
from rag_system import RAGSystem
def show_stats(collection_name: str = "openclaw_knowledge"):
"""Show collection statistics"""
print("📊 OpenClaw RAG Statistics\n")
rag = RAGSystem(collection_name=collection_name)
stats = rag.get_stats()
print(f"Collection: {stats['collection_name']}")
print(f"Storage: {stats['persist_directory']}")
print(f"Total Documents: {stats['total_documents']}\n")
if stats['source_distribution']:
print("By Source:")
for source, count in sorted(stats['source_distribution'].items())[:15]:
print(f" {source}: {count}")
print()
if stats['type_distribution']:
print("By Type:")
for doc_type, count in sorted(stats['type_distribution'].items()):
print(f" {doc_type}: {count}")
def add_manual_document(
text: str,
source: str,
doc_type: str = "manual",
collection_name: str = "openclaw_knowledge"
):
"""Manually add a document to the knowledge base"""
from datetime import datetime
metadata = {
"type": doc_type,
"source": source,
"added_at": datetime.now().isoformat()
}
rag = RAGSystem(collection_name=collection_name)
doc_id = rag.add_document(text, metadata)
print(f"✅ Document added: {doc_id}")
print(f" Source: {source}")
print(f" Type: {doc_type}")
print(f" Length: {len(text)} chars")
def delete_by_source(
source: str,
collection_name: str = "openclaw_knowledge"
):
"""Delete all documents from a specific source"""
rag = RAGSystem(collection_name=collection_name)
# Count matching docs first
results = rag.collection.get(where={"source": source})
count = len(results['ids'])
if count == 0:
print(f"⚠️ No documents found with source: {source}")
return
# Confirm
print(f"Found {count} documents from source: {source}")
confirm = input("Delete them? (yes/no): ").strip().lower()
if confirm not in ['yes', 'y']:
print("Cancelled")
return
# Delete
deleted = rag.delete_by_filter({"source": source})
print(f"✅ Deleted {deleted} documents")
def delete_by_type(
doc_type: str,
collection_name: str = "openclaw_knowledge"
):
"""Delete all documents of a specific type"""
rag = RAGSystem(collection_name=collection_name)
# Count matching docs first
results = rag.collection.get(where={"type": doc_type})
count = len(results['ids'])
if count == 0:
print(f"⚠️ No documents found with type: {doc_type}")
return
# Confirm
print(f"Found {count} documents of type: {doc_type}")
confirm = input("Delete them? (yes/no): ").strip().lower()
if confirm not in ['yes', 'y']:
print("Cancelled")
return
# Delete
deleted = rag.delete_by_filter({"type": doc_type})
print(f"✅ Deleted {deleted} documents")
def reset_collection(collection_name: str = "openclaw_knowledge"):
"""Delete all documents and reset the collection"""
print("⚠️ WARNING: This will delete ALL documents from the collection!")
# Double confirm
confirm1 = input("Type 'yes' to confirm: ").strip().lower()
if confirm1 != 'yes':
print("Cancelled")
return
confirm2 = input("Are you REALLY sure? This cannot be undone (type 'yes'): ").strip().lower()
if confirm2 != 'yes':
print("Cancelled")
return
rag = RAGSystem(collection_name=collection_name)
rag.reset_collection()
print("✅ Collection reset - all documents deleted")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Manage OpenClaw RAG knowledge base")
parser.add_argument("action", choices=["stats", "add", "delete", "reset"], help="Action to perform")
parser.add_argument("--collection", default="openclaw_knowledge", help="Collection name")
# Add arguments
parser.add_argument("--text", help="Document text (for add)")
parser.add_argument("--source", help="Document source (for add)")
parser.add_argument("--type", "--doc-type", help="Document type (for add)")
# Delete arguments
parser.add_argument("--by-source", help="Delete by source (for delete)")
parser.add_argument("--by-type", help="Delete by type (for delete)")
args = parser.parse_args()
if args.action == "stats":
show_stats(collection_name=args.collection)
elif args.action == "add":
if not args.text or not args.source:
print("❌ --text and --source required for add action")
sys.exit(1)
add_manual_document(
text=args.text,
source=args.source,
doc_type=args.type or "manual",
collection_name=args.collection
)
elif args.action == "delete":
if args.by_source:
delete_by_source(args.by_source, collection_name=args.collection)
elif args.by_type:
delete_by_type(args.by_type, collection_name=args.collection)
else:
print("❌ --by-source or --by-type required for delete action")
sys.exit(1)
elif args.action == "reset":
reset_collection(collection_name=args.collection)
elif args.action == "interactive":
print("🚀 OpenClaw RAG Manager - Interactive Mode\n")
while True:
print("\nActions:")
print(" 1. Show stats")
print(" 2. Add document")
print(" 3. Delete by source")
print(" 4. Delete by type")
print(" 5. Exit")
choice = input("\nChoose action (1-5): ").strip()
if choice == '1':
show_stats(collection_name=args.collection)
elif choice == '2':
text = input("Document text: ").strip()
source = input("Source: ").strip()
doc_type = input("Type (default: manual): ").strip() or "manual"
if text and source:
add_manual_document(text, source, doc_type, collection_name=args.collection)
elif choice == '3':
source = input("Source to delete: ").strip()
if source:
delete_by_source(source, collection_name=args.collection)
elif choice == '4':
doc_type = input("Type to delete: ").strip()
if doc_type:
delete_by_type(doc_type, collection_name=args.collection)
elif choice == '5':
print("👋 Goodbye!")
break
else:
print("❌ Invalid choice")

View File

@@ -1,182 +0,0 @@
#!/usr/bin/env python3
"""
RAG Query - Search the OpenClaw knowledge base
"""
import sys
from pathlib import Path
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent))
from rag_system import RAGSystem
def format_result(result: dict, index: int) -> str:
"""Format a single search result"""
metadata = result['metadata']
# Determine type
doc_type = metadata.get('type', 'unknown')
source = metadata.get('source', '?')
# Header based on type
if doc_type == 'session':
chunk_idx = metadata.get('chunk_index', '?')
header = f"\n📄 Session {source} (chunk {chunk_idx})"
elif doc_type == 'workspace':
header = f"\n📁 {source}"
elif doc_type == 'skill':
skill_name = metadata.get('skill_name', source)
header = f"\n📜 Skill: {skill_name}"
elif doc_type == 'memory':
header = f"\n🧠 Memory: {source}"
else:
header = f"\n🔹 {doc_type}: {source}"
# Format text (limit length)
text = result['text']
if len(text) > 1000:
text = text[:1000] + "..."
# Get date if available
info = []
if 'ingested_at' in metadata:
info.append(f"indexed {metadata['ingested_at'][:10]}")
# Chunk info
if 'chunk_index' in metadata and 'total_chunks' in metadata:
info.append(f"chunk {metadata['chunk_index']+1}/{metadata['total_chunks']}")
info_str = f" ({', '.join(info)})" if info else ""
return f"{header}{info_str}\n{text}"
def search(
query: str,
n_results: int = 10,
filters: dict = None,
collection_name: str = "openclaw_knowledge",
verbose: bool = True
) -> list:
"""
Search the RAG knowledge base
Args:
query: Search query
n_results: Number of results
filters: Metadata filters (e.g., {"type": "skill"})
collection_name: Collection name
verbose: Print results
Returns:
List of result dicts
"""
if verbose:
print(f"🔍 Query: {query}")
if filters:
print(f"🎯 Filters: {filters}")
print()
# Initialize RAG
rag = RAGSystem(collection_name=collection_name)
# Search
results = rag.search(query, n_results=n_results, filters=filters)
if not results:
if verbose:
print("❌ No results found")
return []
if verbose:
print(f"✅ Found {len(results)} results\n")
print("=" * 80)
for i, result in enumerate(results, 1):
print(format_result(result, i))
print("=" * 80)
return results
def interactive_search(collection_name: str = "openclaw_knowledge"):
"""Interactive search mode"""
print("🚀 OpenClaw RAG Search - Interactive Mode")
print("Type 'quit' or 'exit' to stop\n")
rag = RAGSystem(collection_name=collection_name)
# Show stats
stats = rag.get_stats()
print(f"📊 Collection: {stats['collection_name']}")
print(f" Total documents: {stats['total_documents']}")
print(f" Storage: {stats['persist_directory']}\n")
while True:
try:
query = input("\n🔍 Search query: ").strip()
if not query:
continue
if query.lower() in ['quit', 'exit', 'q']:
print("\n👋 Goodbye!")
break
# Parse filters if any
filters = None
if query.startswith("type:"):
parts = query.split(maxsplit=1)
if len(parts) > 1:
doc_type = parts[0].replace("type:", "")
query = parts[1]
filters = {"type": doc_type}
# Search
results = rag.search(query, n_results=10, filters=filters)
if results:
print(f"\n{len(results)} results:")
print("=" * 80)
for i, result in enumerate(results, 1):
print(format_result(result, i))
print("=" * 80)
else:
print("❌ No results found")
except KeyboardInterrupt:
print("\n\n👋 Goodbye!")
break
except Exception as e:
print(f"❌ Error: {e}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Search OpenClaw RAG knowledge base")
parser.add_argument("query", nargs="?", help="Search query (if not provided, enters interactive mode)")
parser.add_argument("-n", "--num-results", type=int, default=10, help="Number of results")
parser.add_argument("--type", help="Filter by document type (session, workspace, skill, memory)")
parser.add_argument("--collection", default="openclaw_knowledge", help="Collection name")
parser.add_argument("--interactive", "-i", action="store_true", help="Interactive mode")
args = parser.parse_args()
# Build filters
filters = None
if args.type:
filters = {"type": args.type}
if args.interactive or not args.query:
interactive_search(collection_name=args.collection)
else:
search(
query=args.query,
n_results=args.num_results,
filters=filters,
collection_name=args.collection
)

View File

@@ -1,89 +0,0 @@
#!/usr/bin/env python3
"""
Quick RAG Query - Simple function to call from Python scripts or sessions
Usage:
from rag.rag_query_quick import search_context
results = search_context("your question here")
"""
import sys
from pathlib import Path
# Add parent directory
sys.path.insert(0, str(Path(__file__).parent))
from rag_system import RAGSystem
def search_context(
query: str,
n_results: int = 5,
collection_name: str = "openclaw_knowledge"
) -> str:
"""
Search the RAG knowledge base and return formatted results.
This is the simplest way to use RAG from within Python code.
Args:
query: Search question
n_results: Number of results to return
collection_name: ChromaDB collection name
Returns:
Formatted string with relevant context
Example:
>>> from rag.rag_query_quick import search_context
>>> context = search_context("How do I send SMS?")
>>> print(context)
"""
try:
rag = RAGSystem(collection_name=collection_name)
results = rag.search(query, n_results=n_results)
if not results:
return "No relevant context found in knowledge base."
output = []
output.append(f"🔍 Found {len(results)} relevant items:\n")
for i, result in enumerate(results, 1):
meta = result.get('metadata', {})
doc_type = meta.get('type', 'unknown')
source = meta.get('source', 'unknown')
# Format header
if doc_type == 'session':
header = f"📄 Session reference {i}"
elif doc_type == 'workspace':
header = f"📁 Code/Docs: {source}"
elif doc_type == 'skill':
header = f"📜 Skill: {source}"
else:
header = f"Reference {i}"
# Format content
text = result.get('text', '')
if len(text) > 600:
text = text[:600] + "..."
output.append(f"\n{header}\n{text}\n")
return '\n'.join(output)
except Exception as e:
return f"❌ RAG error: {e}"
# Test it when run directly
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python3 rag_query_quick.py <query>")
sys.exit(1)
query = ' '.join(sys.argv[1:])
print(search_context(query))

View File

@@ -1,119 +0,0 @@
#!/usr/bin/env python3
"""
RAG Query Wrapper - Simple function for the AI to call from within sessions
This is designed for automatic RAG integration. The AI can call this function
to retrieve relevant context from past conversations, code, and documentation.
Usage (from within Python script or session):
import sys
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
from rag_query_wrapper import search_knowledge
results = search_knowledge("your question")
print(results)
"""
import sys
from pathlib import Path
# Add RAG directory to path
rag_dir = Path(__file__).parent
sys.path.insert(0, str(rag_dir))
from rag_system import RAGSystem
def search_knowledge(query: str, n_results: int = 5) -> dict:
"""
Search the knowledge base and return structured results.
This is the primary function for automatic RAG integration.
Returns a structured dict with results for easy programmatic use.
Args:
query: Search query
n_results: Number of results to return
Returns:
dict with:
- query: the search query
- count: number of results found
- items: list of result dicts with text and metadata
"""
try:
rag = RAGSystem()
results = rag.search(query, n_results=n_results)
items = []
for result in results:
meta = result.get('metadata', {})
items.append({
'text': result.get('text', ''),
'type': meta.get('type', 'unknown'),
'source': meta.get('source', 'unknown'),
'chunk_index': meta.get('chunk_index', 0),
'date': meta.get('date', '')
})
return {
'query': query,
'count': len(items),
'items': items
}
except Exception as e:
return {
'query': query,
'count': 0,
'items': [],
'error': str(e)
}
def format_for_ai(results: dict) -> str:
"""
Format RAG results for AI consumption.
Args:
results: dict from search_knowledge()
Returns:
Formatted string suitable for insertion into AI context
"""
if results['count'] == 0:
return ""
output = [f"📚 Found {results['count']} relevant items from knowledge base:\n"]
for item in results['items']:
doc_type = item['type']
source = item['source']
text = item['text']
if doc_type == 'session':
header = f"📄 Past Conversation ({source})"
elif doc_type == 'workspace':
header = f"📁 Code/Documentation ({source})"
elif doc_type == 'skill':
header = f"📜 Skill Guide ({source})"
else:
header = f"🔹 Reference ({doc_type})"
# Truncate if too long
if len(text) > 700:
text = text[:700] + "..."
output.append(f"\n{header}\n{text}\n")
return '\n'.join(output)
# Test function
def _test():
"""Quick test of RAG integration"""
results = search_knowledge("Reddit account automation", n_results=3)
print(format_for_ai(results))
if __name__ == "__main__":
_test()

View File

@@ -1,289 +0,0 @@
#!/usr/bin/env python3
"""
OpenClaw RAG System - Core Library
Manages vector store, ingestion, and retrieval with ChromaDB
"""
import os
import json
import hashlib
from pathlib import Path
from typing import List, Dict, Optional
from datetime import datetime
try:
import chromadb
from chromadb.config import Settings
CHROMADB_AVAILABLE = True
except ImportError:
CHROMADB_AVAILABLE = False
class RAGSystem:
"""OpenClaw RAG System for knowledge retrieval"""
def __init__(
self,
persist_directory: str = None,
collection_name: str = "openclaw_knowledge",
embedding_model: str = "all-MiniLM-L6-v2"
):
"""
Initialize RAG system
Args:
persist_directory: Where ChromaDB stores data
collection_name: Name of the collection
embedding_model: Embedding model name ( ChromaDB handles this)
"""
if not CHROMADB_AVAILABLE:
raise ImportError("chromadb not installed. Run: pip3 install chromadb")
self.collection_name = collection_name
# Default to ~/.openclaw/data/rag if not specified
if persist_directory is None:
persist_directory = os.path.expanduser("~/.openclaw/data/rag")
self.persist_directory = Path(persist_directory)
self.persist_directory.mkdir(parents=True, exist_ok=True)
# Initialize ChromaDB client
self.client = chromadb.PersistentClient(
path=str(self.persist_directory),
settings=Settings(
anonymized_telemetry=False,
allow_reset=True
)
)
# Get or create collection
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={
"created": datetime.now().isoformat(),
"description": "OpenClaw knowledge base"
}
)
def add_document(
self,
text: str,
metadata: Dict,
doc_id: Optional[str] = None
) -> str:
"""
Add a document to the vector store
Args:
text: Document content
metadata: Document metadata (type, source, date, etc.)
doc_id: Optional document ID (auto-generated if not provided)
Returns:
Document ID
"""
# Generate ID if not provided (include more context for uniqueness)
if doc_id is None:
unique_str = ":".join([
metadata.get('type', 'unknown'),
metadata.get('source', 'unknown'),
metadata.get('date', datetime.now().isoformat()),
str(metadata.get('chunk_index', '0')), # Convert to string!
text[:200]
])
doc_id = hashlib.md5(unique_str.encode()).hexdigest()
# Add to collection
self.collection.add(
documents=[text],
metadatas=[metadata],
ids=[doc_id]
)
return doc_id
def add_documents_batch(
self,
documents: List[Dict],
batch_size: int = 100
) -> List[str]:
"""
Add multiple documents efficiently
Args:
documents: List of {"text": str, "metadata": dict, "id": optional} dicts
batch_size: Number of documents to add per batch
Returns:
List of document IDs
"""
all_ids = []
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
texts = [doc["text"] for doc in batch]
metadatas = [doc["metadata"] for doc in batch]
ids = [doc.get("id", hashlib.md5(
f"{doc['metadata'].get('type', 'unknown')}:{doc['metadata'].get('source', 'unknown')}:{doc['metadata'].get('date', '')}:{str(doc['metadata'].get('chunk_index', '0'))}:{doc['text'][:100]}".encode()
).hexdigest()) for doc in batch]
self.collection.add(
documents=texts,
metadatas=metadatas,
ids=ids
)
all_ids.extend(ids)
print(f"✅ Added batch {i//batch_size + 1}: {len(ids)} documents")
return all_ids
def search(
self,
query: str,
n_results: int = 10,
filters: Optional[Dict] = None
) -> List[Dict]:
"""
Search for relevant documents
Args:
query: Search query
n_results: Number of results to return
filters: Optional metadata filters
Returns:
List of {"text": str, "metadata": dict, "id": str, "score": float} dicts
"""
results = self.collection.query(
query_texts=[query],
n_results=n_results,
where=filters
)
# Format results
formatted = []
for i, doc_id in enumerate(results['ids'][0]):
formatted.append({
"id": doc_id,
"text": results['documents'][0][i],
"metadata": results['metadatas'][0][i],
# Note: ChromaDB doesn't return scores by default in query()
# We'd need to use a different method or approximate
"score": 1.0 - (i / len(results['ids'][0])) # Simple approximation
})
return formatted
def delete_document(self, doc_id: str) -> bool:
"""Delete a document by ID"""
try:
self.collection.delete(ids=[doc_id])
return True
except Exception as e:
print(f"❌ Error deleting document {doc_id}: {e}")
return False
def delete_by_filter(self, filter_dict: Dict) -> int:
"""
Delete documents by metadata filter
Args:
filter_dict: Filter criteria (e.g., {"source": "session-2026-02-10"})
Returns:
Number of documents deleted
"""
# First, find matching IDs
results = self.collection.get(where=filter_dict)
if not results['ids']:
return 0
count = len(results['ids'])
self.collection.delete(ids=results['ids'])
print(f"✅ Deleted {count} documents matching filter")
return count
def get_stats(self) -> Dict:
"""Get statistics about the collection"""
count = self.collection.count()
# Get sample to understand metadata structure
sample = self.collection.get(limit=10)
# Count by source/type
source_counts = {}
type_counts = {}
for metadata in sample['metadatas']:
source = metadata.get('source', 'unknown')
doc_type = metadata.get('type', 'unknown')
source_counts[source] = source_counts.get(source, 0) + 1
type_counts[doc_type] = type_counts.get(doc_type, 0) + 1
return {
"total_documents": count,
"collection_name": self.collection_name,
"persist_directory": str(self.persist_directory),
"source_distribution": source_counts,
"type_distribution": type_counts
}
def reset_collection(self):
"""Delete all documents and reset the collection"""
self.collection.delete(where={})
print("✅ Collection reset - all documents deleted")
def close(self):
"""Close the connection"""
# ChromaDB PersistentClient doesn't need explicit close
pass
def main():
"""Test the RAG system"""
print("🚀 Testing OpenClaw RAG System...\n")
# Initialize
rag = RAGSystem()
print(f"✅ Initialized RAG system")
print(f" Collection: {rag.collection_name}")
print(f" Storage: {rag.persist_directory}\n")
# Add test document
test_doc = {
"text": "OpenClaw is a personal AI assistant with tools for automation, messaging, and infrastructure management. It supports Discord, Telegram, SMS via VoIP.ms, and more.",
"metadata": {
"type": "test",
"source": "test-initialization",
"date": datetime.now().isoformat()
}
}
doc_id = rag.add_document(test_doc["text"], test_doc["metadata"])
print(f"✅ Added test document: {doc_id}\n")
# Search
results = rag.search(
query="What messaging platforms does OpenClaw support?",
n_results=5
)
print("🔍 Search Results:")
for i, result in enumerate(results, 1):
print(f"\n{i}. [{result['metadata'].get('source', '?')}]")
print(f" {result['text'][:200]}...")
# Stats
stats = rag.get_stats()
print(f"\n📊 Stats:")
print(f" Total documents: {stats['total_documents']}")
if __name__ == "__main__":
main()

View File

@@ -1,128 +0,0 @@
#!/bin/bash
# RAG Automatic Ingestion - Daily update of knowledge base
# This script checks for new content and intelligently updates the RAG system
set -e
# Paths
RAG_DIR="/home/william/.openclaw/workspace/rag"
STATE_FILE="/home/william/.openclaw/workspace/memory/rag-auto-state.json"
LOG_FILE="/home/william/.openclaw/workspace/memory/rag-auto-update.log"
# Timestamp
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
DATE=$(date +%Y-%m-%d)
# Create memory directory
mkdir -p "$(dirname "$STATE_FILE")"
# Initialize state if needed
if [ ! -f "$STATE_FILE" ]; then
cat > "$STATE_FILE" << EOF
{
"lastSessionIndex": 0,
"lastWorkspaceIndex": 0,
"lastSkillsIndex": 0,
"updatedAt": "$TIMESTAMP"
}
EOF
fi
# log function
log() {
echo "[$TIMESTAMP] $1" | tee -a "$LOG_FILE"
}
# Get latest session file modification time
latest_session_time() {
find ~/.openclaw/agents/main/sessions -name "*.jsonl" -type f -printf '%T@\n' 2>/dev/null | sort -rn | head -1 | cut -d. -f1 || echo "0"
}
log "=== RAG Auto-Update Started ==="
# Get current stats
SESSION_COUNT=$(find ~/.openclaw/agents/main/sessions -name "*.jsonl" | wc -l)
WORKSPACE_COUNT=$(find ~/.openclaw/workspace -type f \( -name "*.py" -o -name "*.js" -o -name "*.md" -o -name "*.json" \) | wc -l)
LATEST_SESSION=$(latest_session_time)
# Read last indexed timestamp
LAST_SESSION_INDEX=$(python3 -c "import json; print(json.load(open('$STATE_FILE')).get('lastSessionIndex', 0))" 2>/dev/null || echo "0")
log "Current status:"
log " Sessions: $SESSION_COUNT files"
log " Workspace: $WORKSPACE_COUNT searchable files"
log " Latest session: $LATEST_SESSION"
log " Last indexed: $LAST_SESSION_INDEX"
# Update sessions if new ones exist
if [ "$LATEST_SESSION" -gt "$LAST_SESSION_INDEX" ]; then
log "✓ New/updated sessions detected, re-indexing..."
cd "$RAG_DIR"
python3 ingest_sessions.py --sessions-dir ~/.openclaw/agents/main/sessions >> "$LOG_FILE" 2>&1
if [ $? -eq 0 ]; then
log "✅ Sessions re-indexed successfully"
else
log "❌ Session indexing failed (see log)"
exit 1
fi
else
log "✓ Sessions up to date (no new files)"
fi
# Update workspace (always do it - captures code changes)
log "Re-indexing workspace files..."
cd "$RAG_DIR"
python3 ingest_docs.py workspace >> "$LOG_FILE" 2>&1
if [ $? -eq 0 ]; then
log "✅ Workspace re-indexed successfully"
else
log "❌ Workspace indexing failed (see log)"
exit 1
fi
# Update skills
log "Re-indexing skills..."
cd "$RAG_DIR"
python3 ingest_docs.py skills >> "$LOG_FILE" 2>&1
if [ $? -eq 0 ]; then
log "✅ Skills re-indexed successfully"
else
log "❌ Skills indexing failed (see log)"
exit 1
fi
# Get document count
DOC_COUNT=$(cd "$RAG_DIR" && python3 -c "
import sys
sys.path.insert(0, '.')
from rag_system import get_collection
collection = get_collection()
print(collection.count())
" 2>/dev/null || echo "unknown")
# Update state
python3 << EOF
import json
state = {
"lastSessionIndex": $LATEST_SESSION,
"lastWorkspaceIndex": $(date +%s),
"lastSkillsIndex": $(date +%s),
"updatedAt": "$TIMESTAMP",
"totalDocuments": $DOC_COUNT,
"sessionCount": $SESSION_COUNT
}
with open('$STATE_FILE', 'w') as f:
json.dump(state, f, indent=2)
EOF
log "=== RAG Auto-Update Complete ==="
log "Total documents in knowledge base: $DOC_COUNT"
log "Next run: $(date -u -d '+24 hours' +%Y-%m-%dT%H:%M:%SZ)"
exit 0