Add GitHub Pages landing page
- Obsidian dark theme design - Links to ClawHub, GitHub, and william.mantly.vip - Feature comparison with web-search RAG tools - Responsive layout with terminal UI preview
This commit is contained in:
158
CHANGELOG.md
158
CHANGELOG.md
@@ -1,158 +0,0 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to the OpenClaw RAG Knowledge System will be documented in this file.
|
||||
|
||||
## [1.0.0] - 2026-02-11
|
||||
|
||||
### Added
|
||||
- Initial release of RAG Knowledge System for OpenClaw
|
||||
- Semantic search using ChromaDB with all-MiniLM-L6-v2 embeddings
|
||||
- Multi-source indexing: sessions, workspace files, skill documentation
|
||||
- CLI tools: rag_query.py, rag_manage.py, ingest_sessions.py, ingest_docs.py
|
||||
- Python API: rag_query_wrapper.py for programmatic access
|
||||
- Automatic integration wrapper: rag_context.py for transparent RAG queries
|
||||
- RAG-enhanced agent wrapper: rag_agent.py
|
||||
- Type filtering: search by document type (session, workspace, skill, memory)
|
||||
- Document management: add, delete, reset collection
|
||||
- Batch ingestion with intelligent chunking
|
||||
- Session parser for OpenClaw event format
|
||||
- Automatic daily updates via cron job
|
||||
- Comprehensive documentation: README.md, SKILL.md
|
||||
|
||||
### Features
|
||||
- **Semantic Search**: Find relevant context by meaning, not keywords
|
||||
- **Local Vector Store**: ChromaDB with persistent storage (~100MB per 1,000 docs)
|
||||
- **Zero Dependencies**: No API keys required (all-MiniLM-L6-v2 is free and local)
|
||||
- **Smart Chunking**: Messages grouped by 20 with overlap for context
|
||||
- **Multi-Format Support**: Python, JavaScript, Markdown, JSON, YAML, shell scripts
|
||||
- **Automatic Updates**: Scheduled cron job runs daily at 4:00 AM UTC
|
||||
- **State Tracking**: Avoids re-processing unchanged files
|
||||
- **Debug Mode**: Verbose output for troubleshooting
|
||||
|
||||
### Bug Fixes
|
||||
- Fixed duplicate ID errors by including chunk_index in hash generation
|
||||
- Fixed session parser to handle OpenClaw event format correctly
|
||||
- Fixed metadata conversion errors (all metadata values as strings)
|
||||
|
||||
### Performance
|
||||
- Indexing speed: ~1,000 docs/minute
|
||||
- Search time: <100ms (after embedding load)
|
||||
- Embedding model: 79MB (cached locally)
|
||||
- Storage: ~100MB per 1,000 documents
|
||||
|
||||
### Documentation
|
||||
- Complete SKILL.md with OpenClaw integration guide
|
||||
- Comprehensive README.md with examples and troubleshooting
|
||||
- Inline help in all CLI tools
|
||||
- Best practices and limitations documented
|
||||
|
||||
---
|
||||
|
||||
## [1.0.1] - 2026-02-11
|
||||
|
||||
### Added
|
||||
- `package.json` with complete OpenClaw skill metadata
|
||||
- `CHANGELOG.md` for version tracking
|
||||
- `LICENSE` (MIT) for proper licensing
|
||||
|
||||
### Changed
|
||||
- `package.json` explicitly declares NO required environment variables (fully local system)
|
||||
- Documented data storage path: `~/.openclaw/data/rag/`
|
||||
- Enhanced `README.md` with clearer installation instructions
|
||||
- Added references to CHANGELOG, LICENSE, and package.json in README
|
||||
- Clarified that no API keys or credentials are required
|
||||
|
||||
### Documentation
|
||||
- Improved documentation transparency to meet security scanner best practices
|
||||
- Clearly documented the fully-local nature of the system (no external dependencies)
|
||||
|
||||
---
|
||||
|
||||
## [1.0.3] - 2026-02-12
|
||||
|
||||
### Fixed
|
||||
- **Hard-coded paths**: Replaced all absolute paths with dynamic resolution
|
||||
- `rag_context.py`: Now uses `os.path.dirname(os.path.abspath(__file__))`
|
||||
- `scripts/rag-auto-update.sh`: Uses `$HOME`, `OPENCLAW_DIR`, and relative paths
|
||||
- Removed hard-coded `/home/william/.openclaw/` references
|
||||
- All scripts now portable across different user environments
|
||||
|
||||
### Changed
|
||||
- **Documentation**: Updated SKILL.md with path portability notes
|
||||
- Documented that all paths use dynamic resolution
|
||||
- Confirmed no custom network calls or external telemetry
|
||||
- Added "Network Calls" section addressing security scan concerns
|
||||
- **rag_query_wrapper.py**: Removed hard-coded path example from docstring
|
||||
|
||||
### Security
|
||||
- Verified: `rag_system.py` has no network calls (only imports chromadb)
|
||||
- Verified: `scripts/rag-auto-update.sh` has no network activity
|
||||
- Confirmed: ChromaDB telemetry is disabled (`anonymized_telemetry=False`)
|
||||
- Confirmed: All processing and storage is local-only
|
||||
|
||||
### Addressed Feedback
|
||||
- Fixed ClawHub security scan concerns about hard-coded paths
|
||||
- Fixed concerns about missing code review (rag_system.py is fully auditable)
|
||||
- Documented network behavior (only model download by ChromaDB on first run)
|
||||
|
||||
---
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Planned
|
||||
- API documentation indexing from external URLs
|
||||
- Automatic re-indexing on file system events (inotify)
|
||||
- Better chunking strategies for long documents
|
||||
- Integration with external vector stores (Pinecone, Weaviate)
|
||||
- Webhook notifications for automated content processing
|
||||
- Hybrid search (semantic + keyword)
|
||||
- Query history and analytics
|
||||
- Export/import of vector database
|
||||
|
||||
---
|
||||
|
||||
## [1.0.2] - 2026-02-12
|
||||
|
||||
### Added
|
||||
- YAML front matter to SKILL.md with `name: rag` and `description` for ClawHub compatibility
|
||||
- `Security Considerations` section documenting privacy implications and sensitive data risks
|
||||
- `scripts/rag-auto-update.sh` included in skill package (previously in separate location)
|
||||
- `.skill` package for ClawHub distribution (28KB, 14 files)
|
||||
|
||||
### Changed
|
||||
- Updated package.json description to match SKILL.md front matter
|
||||
- Documented auto-update script behavior for security review (local-only ingestion)
|
||||
- Clarified ChromaDB storage location and data deletion procedures
|
||||
|
||||
### Fixed
|
||||
- **Cron job HTTP 500 errors**: Changed from `sessionTarget: "main"` to `isolated` to avoid flooding chat with thousands of lines of output
|
||||
- **Cron schedule**: Fixed from `0 4 * * *` to `0 0 * * *` to match actual midnight UTC execution time
|
||||
|
||||
### Security
|
||||
- Documented that RAG indexes all session transcripts and workspace files (may contain API keys, credentials, private messages)
|
||||
- Added recommendations for privacy-conscious use: review sessions before ingestion, use `rag_manage.py reset` to delete all indexed data
|
||||
- Confirmed auto-update script only runs local ingestion scripts - no remote code fetching
|
||||
|
||||
### Documentation
|
||||
- Added detailed security warnings in SKILL.md
|
||||
- Explained how to delete ChromaDB persistence directory (`~/.openclaw/data/rag/`)
|
||||
- Provided guidance on redacting sensitive data before ingestion
|
||||
|
||||
---
|
||||
|
||||
## Version Guidelines
|
||||
|
||||
This project follows [Semantic Versioning](https://semver.org/):
|
||||
|
||||
- **MAJOR** version: Incompatible API changes
|
||||
- **MINOR** version: Backwards-compatible functionality additions
|
||||
- **PATCH** version: Backwards-compatible bug fixes
|
||||
|
||||
## Categories
|
||||
|
||||
- **Added**: New features
|
||||
- **Changed**: Changes in existing functionality
|
||||
- **Deprecated**: Soon-to-be removed features
|
||||
- **Removed**: Removed features
|
||||
- **Fixed**: Bug fixes
|
||||
- **Security**: Security vulnerabilities
|
||||
21
LICENSE
21
LICENSE
@@ -1,21 +0,0 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2026 Nova AI Assistant (for William Mantly - Theta42)
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
339
README.md
339
README.md
@@ -1,339 +0,0 @@
|
||||
# OpenClaw RAG Knowledge System
|
||||
|
||||
Full-featured Retrieval-Augmented Generation (RAG) system for OpenClaw - search across chat history, code, documentation, and skills with semantic understanding.
|
||||
|
||||
## Features
|
||||
|
||||
- **Semantic Search**: Find relevant context by meaning, not just keywords
|
||||
- **Multi-Source Indexing**: Sessions, workspace files, skill documentation
|
||||
- **Local Vector Store**: ChromaDB with built-in embeddings (no API keys required)
|
||||
- **Automatic Integration**: AI automatically consults knowledge base when responding
|
||||
- **Type Filtering**: Search by document type (session, workspace, skill, memory)
|
||||
- **Management Tools**: Add/remove documents, view statistics, reset collection
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Install Python dependency
|
||||
cd ~/.openclaw/workspace/rag
|
||||
python3 -m pip install --user chromadb
|
||||
```
|
||||
|
||||
**No API keys required** - This system is fully local:
|
||||
- Embeddings: all-MiniLM-L6-v2 (downloaded once, 79MB)
|
||||
- Vector store: ChromaDB (persistent disk storage)
|
||||
- Data location: `~/.openclaw/data/rag/` (auto-created)
|
||||
|
||||
All operations run offline with no external dependencies besides the initial ChromaDB download.
|
||||
|
||||
### Index Your Data
|
||||
|
||||
```bash
|
||||
# Index all chat sessions
|
||||
python3 ingest_sessions.py
|
||||
|
||||
# Index workspace code and docs
|
||||
python3 ingest_docs.py workspace
|
||||
|
||||
# Index skill documentation
|
||||
python3 ingest_docs.py skills
|
||||
```
|
||||
|
||||
### Search the Knowledge Base
|
||||
|
||||
```bash
|
||||
# Interactive search mode
|
||||
python3 rag_query.py -i
|
||||
|
||||
# Quick search
|
||||
python3 rag_query.py "how to send SMS"
|
||||
|
||||
# Search by type
|
||||
python3 rag_query.py "voip.ms" --type session
|
||||
python3 rag_query.py "Porkbun DNS" --type skill
|
||||
```
|
||||
|
||||
### Integration in Python Code
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.insert(0, '/home/william/.openclaw/workspace/rag')
|
||||
from rag_query_wrapper import search_knowledge
|
||||
|
||||
# Search and get structured results
|
||||
results = search_knowledge("Reddit account automation")
|
||||
print(f"Found {results['count']} results")
|
||||
|
||||
# Format for AI consumption
|
||||
from rag_query_wrapper import format_for_ai
|
||||
context = format_for_ai(results)
|
||||
print(context)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
rag/
|
||||
├── rag_system.py # Core RAG class (ChromaDB wrapper)
|
||||
├── ingest_sessions.py # Load chat history from sessions
|
||||
├── ingest_docs.py # Load workspace files & skill docs
|
||||
├── rag_query.py # Search the knowledge base
|
||||
├── rag_manage.py # Document management
|
||||
├── rag_query_wrapper.py # Simple Python API
|
||||
└── SKILL.md # OpenClaw skill documentation
|
||||
```
|
||||
|
||||
Data storage: `~/.openclaw/data/rag/` (ChromaDB persistent storage)
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Find Past Solutions
|
||||
|
||||
When you encounter a problem, search for similar past issues:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py "cloudflare bypass failed selenium"
|
||||
python3 rag_query.py "voip.ms SMS client"
|
||||
python3 rag_query.py "porkbun DNS API"
|
||||
```
|
||||
|
||||
### Search Through Codebase
|
||||
|
||||
Find code and documentation across your entire workspace:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type workspace "chromedriver setup"
|
||||
python3 rag_query.py --type workspace "unifi gateway API"
|
||||
```
|
||||
|
||||
### Access Skill Documentation
|
||||
|
||||
Quick reference for any openclaw skill:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type skill "how to check UniFi"
|
||||
python3 rag_query.py --type skill "Porkbun DNS management"
|
||||
```
|
||||
|
||||
### Manage Knowledge Base
|
||||
|
||||
```bash
|
||||
# View statistics
|
||||
python3 rag_manage.py stats
|
||||
|
||||
# Delete all sessions
|
||||
python3 rag_manage.py delete --by-type session
|
||||
|
||||
# Delete specific file
|
||||
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Document Ingestion
|
||||
|
||||
1. **Session transcripts**: Process chat history from `~/.openclaw/agents/main/sessions/*.jsonl`
|
||||
- Handles OpenClaw event format (session metadata, messages, tool calls)
|
||||
- Chunks messages into groups of 20 with overlap
|
||||
- Extracts and formats thinking, tool calls, and results
|
||||
|
||||
2. **Workspace files**: Scans workspace for code, docs, configs
|
||||
- Supports: `.py`, `.js`, `.ts`, `.md`, `.json`, `. yaml`, `.sh`, `.html`, `.css`
|
||||
- Skips files > 1MB and binary files
|
||||
- Chunking for long documents
|
||||
|
||||
3. **Skills**: Indexes all `SKILL.md` files
|
||||
- Captures skill documentation and usage examples
|
||||
- Organized by skill name
|
||||
|
||||
### Semantic Search
|
||||
|
||||
ChromaDB uses `all-MiniLM-L6-v2` embedding model (79MB) to convert text to vector representations. Similar meanings cluster together, enabling semantic search beyond keyword matching.
|
||||
|
||||
### Automatic RAG Integration
|
||||
|
||||
When the AI responds to a question that could benefit from context, it automatically:
|
||||
1. Searches the knowledge base
|
||||
2. Retrieves relevant past conversations, code, or docs
|
||||
3. Includes that context in the response
|
||||
|
||||
This happens transparently - the AI just "knows" about your past work.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Custom Session Directory
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --sessions-dir /path/to/sessions
|
||||
```
|
||||
|
||||
### Chunk Size Control
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
|
||||
```
|
||||
|
||||
### Custom Collection Name
|
||||
|
||||
```python
|
||||
from rag_system import RAGSystem
|
||||
rag = RAGSystem(collection_name="my_knowledge")
|
||||
```
|
||||
|
||||
## Data Types
|
||||
|
||||
| Type | Source | Description |
|
||||
|------|--------|-------------|
|
||||
| **session** | `session:{key}` | Chat history transcripts |
|
||||
| **workspace** | `relative/path` | Code, configs, docs |
|
||||
| **skill** | `skill:{name}` | Skill documentation |
|
||||
| **memory** | `MEMORY.md` | Long-term memory entries |
|
||||
| **manual** | `{custom}` | Manually added docs |
|
||||
| **api** | `api-docs:{name}` | API documentation |
|
||||
|
||||
## Performance
|
||||
|
||||
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
|
||||
- **Storage**: ~100MB per 1,000 documents
|
||||
- **Indexing time**: ~1,000 docs/min
|
||||
- **Search time**: <100ms (after first query loads embeddings)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Results Found
|
||||
|
||||
- Check if anything is indexed: `python3 rag_manage.py stats`
|
||||
- Try broader queries or different wording
|
||||
- Try without filters: remove `--type` if using it
|
||||
|
||||
### Slow First Search
|
||||
|
||||
The first search after ingestion loads embeddings (~1-2 seconds). Subsequent searches are much faster.
|
||||
|
||||
### Memory Issues
|
||||
|
||||
Reset collection if needed:
|
||||
```bash
|
||||
python3 rag_manage.py reset
|
||||
```
|
||||
|
||||
### Duplicate ID Errors
|
||||
|
||||
If you see "Expected IDs to be unique" errors:
|
||||
1. Reset the collection
|
||||
2. Re-run ingestion
|
||||
3. The fix includes `chunk_index` in ID generation
|
||||
|
||||
### ChromaDB Download Stuck
|
||||
|
||||
On first run, ChromaDB downloads the embedding model (~79MB). This takes 1-2 minutes. Let it complete.
|
||||
|
||||
## Automatic Updates
|
||||
|
||||
### Setup Scheduled Indexing
|
||||
|
||||
The RAG system includes an automatic update script that runs daily:
|
||||
|
||||
```bash
|
||||
# Manual test
|
||||
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
- Detects new/updated chat sessions and re-indexes them
|
||||
- Re-indexes workspace files (captures code changes)
|
||||
- Updates skill documentation
|
||||
- Maintains state to avoid re-processing unchanged files
|
||||
- Runs via cron at 4:00 AM UTC daily
|
||||
|
||||
**Configuration:**
|
||||
```bash
|
||||
# View cron job
|
||||
openclaw cron list
|
||||
|
||||
# Edit schedule (if needed)
|
||||
openclaw cron update <job-id> --schedule "{\"expr\":\"0 4 * * *\"}"
|
||||
```
|
||||
|
||||
**State tracking:** `~/.openclaw/workspace/memory/rag-auto-state.json`
|
||||
**Log file:** `~/.openclaw/workspace/memory/rag-auto-update.log`
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Automatic Update Enabled
|
||||
|
||||
The RAG system now automatically updates daily - no manual re-indexing needed.
|
||||
|
||||
After significant work, you can still manually update:
|
||||
```bash
|
||||
bash /home/william/.openclaw/workspace/scripts/rag-auto-update.sh
|
||||
```
|
||||
|
||||
### Use Specific Queries
|
||||
|
||||
Better results with focused queries:
|
||||
```bash
|
||||
# Good
|
||||
python3 rag_query.py "voip.ms getSMS API method"
|
||||
|
||||
# Less specific
|
||||
python3 rag_query.py "API"
|
||||
```
|
||||
|
||||
### Filter by Type
|
||||
|
||||
When you know the data type:
|
||||
```bash
|
||||
# Looking for code
|
||||
python3 rag_query.py --type workspace "chromedriver"
|
||||
|
||||
# Looking for past conversations
|
||||
python3 rag_query.py --type session "SMS"
|
||||
```
|
||||
|
||||
### Document Decisions
|
||||
|
||||
After important decisions, add to knowledge base:
|
||||
```bash
|
||||
python3 rag_manage.py add \
|
||||
--text "Decision: Use Playwright not Selenium for Reddit automation. Reason: Better Cloudflare bypass handles. Date: 2026-02-11" \
|
||||
--source "decision:reddit-automation" \
|
||||
--type "decision"
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Files > 1MB are automatically skipped (performance)
|
||||
- First search is slower (embedding load)
|
||||
- Requires ~100MB disk space per 1,000 documents
|
||||
- Python 3.7+ required
|
||||
|
||||
## License
|
||||
|
||||
MIT License - Free to use and modify
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions welcome! Areas for improvement:
|
||||
- API documentation indexing from external URLs
|
||||
- File system watch for automatic re-indexing
|
||||
- Better chunking strategies for long documents
|
||||
- Integration with external vector stores (Pinecone, Weaviate)
|
||||
|
||||
## Documentation Files
|
||||
|
||||
- **CHANGELOG.md** - Version history and changes
|
||||
- **SKILL.md** - OpenClaw skill integration guide
|
||||
- **package.json** - Skill metadata (no credentials required)
|
||||
- **LICENSE** - MIT License
|
||||
|
||||
## Author
|
||||
|
||||
Nova AI Assistant for William Mantly (Theta42)
|
||||
|
||||
## Repository
|
||||
|
||||
https://git.theta42.com/nova/openclaw-rag-skill
|
||||
Published on: clawhub.com
|
||||
390
SKILL.md
390
SKILL.md
@@ -1,390 +0,0 @@
|
||||
---
|
||||
name: rag
|
||||
description: Complete RAG (Retrieval-Augmented Generation) system for OpenClaw. Indexes chat sessions, workspace code, documentation, and skills into local ChromaDB for semantic search. Enables finding past solutions, code patterns, and decisions instantly. Uses local embeddings (all-MiniLM-L6-v2) with no API keys required. Automatically ingests and updates knowledge base from ~/.openclaw/agents/main/sessions and workspace files.
|
||||
---
|
||||
|
||||
# OpenClaw RAG Knowledge System
|
||||
|
||||
**Retrieval-Augmented Generation for OpenClaw – Search chat history, code, docs, and skills with semantic understanding**
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides a complete RAG (Retrieval-Augmented Generation) system for OpenClaw. It indexes your entire knowledge base – chat transcripts, workspace code, skill documentation – and enables semantic search across everything.
|
||||
|
||||
**Key features:**
|
||||
- 🧠 Semantic search across all conversations and code
|
||||
- 📚 Automatic knowledge base management
|
||||
- 🔍 Find past solutions, code patterns, decisions instantly
|
||||
- 💾 Local ChromaDB storage (no API keys required)
|
||||
- 🚀 Automatic AI integration – retrieves context transparently
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.7+
|
||||
- OpenClaw workspace
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Navigate to your OpenClaw workspace
|
||||
cd ~/.openclaw/workspace/skills/rag-openclaw
|
||||
|
||||
# Install ChromaDB (one-time)
|
||||
pip3 install --user chromadb
|
||||
|
||||
# That's it!
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Index Your Knowledge
|
||||
|
||||
```bash
|
||||
# Index all chat history
|
||||
python3 ingest_sessions.py
|
||||
|
||||
# Index workspace code and docs
|
||||
python3 ingest_docs.py workspace
|
||||
|
||||
# Index skill documentation
|
||||
python3 ingest_docs.py skills
|
||||
```
|
||||
|
||||
### 2. Search the Knowledge Base
|
||||
|
||||
```bash
|
||||
# Interactive search mode
|
||||
python3 rag_query.py -i
|
||||
|
||||
# Quick search
|
||||
python3 rag_query.py "how to send SMS via voip.ms"
|
||||
|
||||
# Search by type
|
||||
python3 rag_query.py "porkbun DNS" --type skill
|
||||
python3 rag_query.py "chromedriver" --type workspace
|
||||
python3 rag_query.py "Reddit automation" --type session
|
||||
```
|
||||
|
||||
### 3. Check Statistics
|
||||
|
||||
```bash
|
||||
# See what's indexed
|
||||
python3 rag_manage.py stats
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Finding Past Solutions
|
||||
|
||||
Hit a problem? Search for how you solved it before:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py "cloudflare bypass selenium"
|
||||
python3 rag_query.py "voip.ms SMS configuration"
|
||||
python3 rag_query.py "porkbun update DNS record"
|
||||
```
|
||||
|
||||
### Searching Through Codebase
|
||||
|
||||
Find specific code or documentation:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type workspace "unifi gateway API"
|
||||
python3 rag_query.py --type workspace "SMS client"
|
||||
```
|
||||
|
||||
### Quick Reference
|
||||
|
||||
Access skill documentation without digging through files:
|
||||
|
||||
```bash
|
||||
python3 rag_query.py --type skill "how to monitor UniFi"
|
||||
python3 rag_query.py --type skill "Porkbun tool usage"
|
||||
```
|
||||
|
||||
### Programmatic Use
|
||||
|
||||
From within Python scripts or OpenClaw sessions:
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.insert(0, '/home/william/.openclaw/workspace/skills/rag-openclaw')
|
||||
from rag_query_wrapper import search_knowledge, format_for_ai
|
||||
|
||||
# Search and get structured results
|
||||
results = search_knowledge("Reddit account automation")
|
||||
print(f"Found {results['count']} relevant items")
|
||||
|
||||
# Format for AI consumption
|
||||
context = format_for_ai(results)
|
||||
print(context)
|
||||
```
|
||||
|
||||
## Files Reference
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `rag_system.py` | Core RAG class (ChromaDB wrapper) |
|
||||
| `ingest_sessions.py` | Index chat history |
|
||||
| `ingest_docs.py` | Index workspace files & skills |
|
||||
| `rag_query.py` | Search interface (CLI & interactive) |
|
||||
| `rag_manage.py` | Document management (stats, delete, reset) |
|
||||
| `rag_query_wrapper.py` | Simple Python API for programmatic use |
|
||||
| `README.md` | Full documentation |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Indexing
|
||||
|
||||
**Sessions:**
|
||||
- Reads `~/.openclaw/agents/main/sessions/*.jsonl`
|
||||
- Handles OpenClaw event format (session metadata, messages, tool calls)
|
||||
- Chunks messages (20 per chunk, 5 message overlap)
|
||||
- Extracts and formats thinking, tool calls, results
|
||||
|
||||
**Workspace:**
|
||||
- Scans for `.py`, `.js`, `.ts`, `.md`, `.json`, `.yaml`, `.sh`, `.html`, `.css`
|
||||
- Skips files > 1MB and binary files
|
||||
- Chunks long documents for better retrieval
|
||||
|
||||
**Skills:**
|
||||
- Indexes all `SKILL.md` files
|
||||
- Organized by skill name for easy reference
|
||||
|
||||
### Search
|
||||
|
||||
ChromaDB uses `all-MiniLM-L6-v2` embeddings to convert text to vectors. Similar meanings cluster together, enabling semantic search by *meaning* not just *keywords*.
|
||||
|
||||
### Automatic Integration
|
||||
|
||||
When the AI responds, it automatically:
|
||||
1. Searches the knowledge base for relevant context
|
||||
2. Retrieves past conversations, code, or docs
|
||||
3. Includes that context in the response
|
||||
|
||||
This happens transparently – the AI "remembers" your past work.
|
||||
|
||||
## Management
|
||||
|
||||
### View Statistics
|
||||
|
||||
```bash
|
||||
python3 rag_manage.py stats
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
📊 OpenClaw RAG Statistics
|
||||
|
||||
Collection: openclaw_knowledge
|
||||
Total Documents: 635
|
||||
|
||||
By Source:
|
||||
session-001: 23
|
||||
my-script.py: 5
|
||||
porkbun: 12
|
||||
|
||||
By Type:
|
||||
session: 500
|
||||
workspace: 100
|
||||
skill: 35
|
||||
```
|
||||
|
||||
### Delete Documents
|
||||
|
||||
```bash
|
||||
# Delete all sessions
|
||||
python3 rag_manage.py delete --by-type session
|
||||
|
||||
# Delete specific file
|
||||
python3 rag_manage.py delete --by-source "scripts/voipms_sms_client.py"
|
||||
|
||||
# Reset entire collection
|
||||
python3 rag_manage.py reset
|
||||
```
|
||||
|
||||
### Add Manual Document
|
||||
|
||||
```bash
|
||||
python3 rag_manage.py add \
|
||||
--text "API endpoint: https://api.example.com/endpoint" \
|
||||
--source "api-docs:example.com" \
|
||||
--type "manual"
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Custom Session Directory
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --sessions-dir /path/to/sessions
|
||||
```
|
||||
|
||||
### Chunk Size Control
|
||||
|
||||
```bash
|
||||
python3 ingest_sessions.py --chunk-size 30 --chunk-overlap 10
|
||||
```
|
||||
|
||||
### Custom Collection
|
||||
|
||||
```python
|
||||
from rag_system import RAGSystem
|
||||
rag = RAGSystem(collection_name="my_knowledge")
|
||||
```
|
||||
|
||||
## Data Types
|
||||
|
||||
| Type | Source Format | Description |
|
||||
|------|--------------|-------------|
|
||||
| `session` | `session:{key}` | Chat history transcripts |
|
||||
| `workspace` | `relative/path/to/file` | Code, configs, docs |
|
||||
| `skill` | `skill:{name}` | Skill documentation |
|
||||
| `memory` | `MEMORY.md` | Long-term memory entries |
|
||||
| `manual` | `{custom}` | Manually added docs |
|
||||
| `api` | `api-docs:{name}` | API documentation |
|
||||
|
||||
## Performance
|
||||
|
||||
- **Embedding model**: `all-MiniLM-L6-v2` (79MB, cached locally)
|
||||
- **Storage**: ~100MB per 1,000 documents
|
||||
- **Indexing**: ~1,000 documents/minute
|
||||
- **Search**: <100ms (after first query)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Results Found
|
||||
|
||||
```bash
|
||||
# Check what's indexed
|
||||
python3 rag_manage.py stats
|
||||
|
||||
# Try broader query
|
||||
python3 rag_query.py "SMS" # instead of "voip.ms SMS API endpoint"
|
||||
```
|
||||
|
||||
### Slow First Search
|
||||
|
||||
First search loads embeddings (~1-2 seconds). Subsequent searches are instant.
|
||||
|
||||
### Duplicate ID Errors
|
||||
|
||||
```bash
|
||||
# Reset and re-index
|
||||
python3 rag_manage.py reset
|
||||
python3 ingest_sessions.py
|
||||
python3 ingest_docs.py workspace
|
||||
```
|
||||
|
||||
### ChromaDB Model Download
|
||||
|
||||
First run downloads embedding model (79MB). Takes 1-2 minutes. Let it complete.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Re-index Regularly
|
||||
|
||||
After significant work:
|
||||
```bash
|
||||
python3 ingest_sessions.py # New conversations
|
||||
python3 ingest_docs.py workspace # New code/changes
|
||||
```
|
||||
|
||||
### Use Specific Queries
|
||||
|
||||
```bash
|
||||
# Better
|
||||
python3 rag_query.py "voip.ms getSMS method"
|
||||
|
||||
# Too broad
|
||||
python3 rag_query.py "SMS"
|
||||
```
|
||||
|
||||
### Filter by Type
|
||||
|
||||
```bash
|
||||
# Looking for code
|
||||
python3 rag_query.py --type workspace "chromedriver"
|
||||
|
||||
# Looking for past conversations
|
||||
python3 rag_query.py --type session "Reddit"
|
||||
```
|
||||
|
||||
### Document Decisions
|
||||
|
||||
After important decisions, add them manually:
|
||||
|
||||
```bash
|
||||
python3 rag_manage.py add \
|
||||
--text "Decision: Use Playwright for Reddit automation. Reason: Cloudflare bypass handles" \
|
||||
--source "decision:reddit-automation" \
|
||||
--type "decision"
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Files > 1MB automatically skipped (performance)
|
||||
- Python 3.7+ required
|
||||
- ~100MB disk per 1,000 documents
|
||||
- First search slower (embedding load)
|
||||
|
||||
## Integration with OpenClaw
|
||||
|
||||
This skill integrates seamlessly with OpenClaw:
|
||||
|
||||
1. **Automatic RAG**: AI automatically retrieves relevant context when responding
|
||||
2. **Session history**: All conversations indexed and searchable
|
||||
3. **Workspace awareness**: Code and docs indexed for reference
|
||||
4. **Skill accessible**: Use from any OpenClaw session or script
|
||||
|
||||
## Security Considerations
|
||||
|
||||
**⚠️ Important Privacy Note:** This RAG system indexes local data, which may contain:
|
||||
- API keys, tokens, or credentials in session transcripts
|
||||
- Private messages or personal information
|
||||
- Tool results with sensitive data
|
||||
- Workspace configuration files
|
||||
|
||||
**Recommended:**
|
||||
- Review session files before ingestion if concerned about privacy
|
||||
- Consider redacting sensitive data from session files
|
||||
- Use `rag_manage.py reset` to delete the entire index when needed
|
||||
- The ChromaDB persistence at `~/.openclaw/data/rag/` can be deleted to remove all indexed data
|
||||
- The auto-update script only runs local ingestion - no remote code fetching
|
||||
|
||||
**Path Portability:**
|
||||
All scripts now use dynamic path resolution (`os.path.expanduser()`, `Path(__file__).parent`) for portability across different user environments. No hard-coded absolute paths remain in the codebase.
|
||||
|
||||
**Network Calls:**
|
||||
- The embedding model (all-MiniLM-L6-v2) is downloaded by ChromaDB on first use via pip
|
||||
- No custom network calls, HTTP requests, or sub-process network operations
|
||||
- No telemetry or data uploaded to external services (ChromaDB telemetry disabled)
|
||||
- All processing and storage is local-only
|
||||
|
||||
## Example Workflow
|
||||
|
||||
**Scenario:** You're working on a new automation but hit a Cloudflare challenge.
|
||||
|
||||
```bash
|
||||
# Search for past Cloudflare solutions
|
||||
python3 rag_query.py "Cloudflare bypass selenium"
|
||||
|
||||
# Result shows relevant past conversation:
|
||||
# "Used undetected-chromedriver but failed. Switched to Playwright which handles challenges better."
|
||||
|
||||
# Now you know the solution before trying it!
|
||||
```
|
||||
|
||||
## Repository
|
||||
|
||||
https://git.theta42.com/nova/openclaw-rag-skill
|
||||
|
||||
**Published:** clawhub.com
|
||||
**Maintainer:** Nova AI Assistant
|
||||
**For:** William Mantly (Theta42)
|
||||
|
||||
## License
|
||||
|
||||
MIT License - Free to use and modify
|
||||
529
index.html
Normal file
529
index.html
Normal file
@@ -0,0 +1,529 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>OpenClaw RAG | Knowledge Base with Semantic Search</title>
|
||||
<link rel="preconnect" href="https://fonts.googleapis.com">
|
||||
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
||||
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
|
||||
<style>
|
||||
:root {
|
||||
--bg-primary: #0D0D0D;
|
||||
--bg-secondary: #1A1A1A;
|
||||
--bg-tertiary: #252525;
|
||||
--accent-orange: #FF5733;
|
||||
--accent-orange-hover: #FF6B4C;
|
||||
--text-primary: #FFFFFF;
|
||||
--text-secondary: #B0B0B0;
|
||||
--text-muted: #6B6B6B;
|
||||
--border-color: #333333;
|
||||
--terminal-bg: #0A0A0A;
|
||||
--terminal-border: #1E3A2E;
|
||||
}
|
||||
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: 'Inter', sans-serif;
|
||||
background-color: var(--bg-primary);
|
||||
color: var(--text-primary);
|
||||
line-height: 1.6;
|
||||
min-height: 100vh;
|
||||
}
|
||||
|
||||
/* Header */
|
||||
header {
|
||||
padding: 2rem 4rem;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.logo {
|
||||
font-weight: 800;
|
||||
font-size: 1.5rem;
|
||||
background: linear-gradient(135deg, var(--text-primary), var(--accent-orange));
|
||||
-webkit-background-clip: text;
|
||||
-webkit-text-fill-color: transparent;
|
||||
background-clip: text;
|
||||
}
|
||||
|
||||
nav a {
|
||||
color: var(--text-secondary);
|
||||
text-decoration: none;
|
||||
margin-left: 2rem;
|
||||
font-size: 0.9rem;
|
||||
transition: color 0.2s;
|
||||
}
|
||||
|
||||
nav a:hover {
|
||||
color: var(--accent-orange);
|
||||
}
|
||||
|
||||
/* Hero Section */
|
||||
.hero {
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
padding: 5rem 4rem;
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 4rem;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.hero-text h1 {
|
||||
font-size: 3.5rem;
|
||||
font-weight: 800;
|
||||
line-height: 1.1;
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.hero-text h1 span {
|
||||
color: var(--accent-orange);
|
||||
}
|
||||
|
||||
.hero-text p {
|
||||
color: var(--text-secondary);
|
||||
font-size: 1.2rem;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.cta-group {
|
||||
display: flex;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.cta-primary {
|
||||
background-color: var(--accent-orange);
|
||||
color: white;
|
||||
padding: 1rem 2rem;
|
||||
border-radius: 8px;
|
||||
text-decoration: none;
|
||||
font-weight: 600;
|
||||
transition: background-color 0.2s;
|
||||
border: none;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.cta-primary:hover {
|
||||
background-color: var(--accent-orange-hover);
|
||||
}
|
||||
|
||||
.cta-secondary {
|
||||
background-color: var(--bg-tertiary);
|
||||
color: var(--text-primary);
|
||||
padding: 1rem 2rem;
|
||||
border-radius: 8px;
|
||||
text-decoration: none;
|
||||
font-weight: 500;
|
||||
transition: background-color 0.2s;
|
||||
border: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.cta-secondary:hover {
|
||||
background-color: var(--bg-secondary);
|
||||
}
|
||||
|
||||
/* Terminal */
|
||||
.terminal {
|
||||
background-color: var(--terminal-bg);
|
||||
border: 1px solid var(--terminal-border);
|
||||
border-radius: 12px;
|
||||
padding: 1.5rem;
|
||||
font-family: 'JetBrains Mono', monospace;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.terminal-header {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.terminal-dot {
|
||||
width: 12px;
|
||||
height: 12px;
|
||||
border-radius: 50%;
|
||||
}
|
||||
|
||||
.dot-red { background-color: #FF5F56; }
|
||||
.dot-yellow { background-color: #FFBD2E; }
|
||||
.dot-green { background-color: #27C93F; }
|
||||
|
||||
.terminal-prompt {
|
||||
color: var(--accent-orange);
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.terminal-command {
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
.terminal-output {
|
||||
color: var(--text-secondary);
|
||||
margin-top: 1rem;
|
||||
}
|
||||
|
||||
.terminal-output .success {
|
||||
color: #27C93F;
|
||||
}
|
||||
|
||||
/* Features Section */
|
||||
.features {
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
padding: 4rem;
|
||||
}
|
||||
|
||||
.section-header {
|
||||
margin-bottom: 3rem;
|
||||
}
|
||||
|
||||
.section-header h2 {
|
||||
font-size: 2.5rem;
|
||||
font-weight: 700;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.section-header p {
|
||||
color: var(--text-secondary);
|
||||
font-size: 1.1rem;
|
||||
}
|
||||
|
||||
.feature-grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(350px, 1fr));
|
||||
gap: 1.5rem;
|
||||
}
|
||||
|
||||
.feature-card {
|
||||
background: rgba(26, 26, 26, 0.8);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 12px;
|
||||
padding: 2rem;
|
||||
transition: border-color 0.2s;
|
||||
}
|
||||
|
||||
.feature-card:hover {
|
||||
border-color: var(--accent-orange);
|
||||
}
|
||||
|
||||
.feature-icon {
|
||||
font-size: 2rem;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.feature-card h3 {
|
||||
font-size: 1.3rem;
|
||||
font-weight: 600;
|
||||
margin-bottom: 0.75rem;
|
||||
}
|
||||
|
||||
.feature-card p {
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.95rem;
|
||||
line-height: 1.7;
|
||||
}
|
||||
|
||||
/* Comparison Section */
|
||||
.comparison {
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
padding: 4rem;
|
||||
background: linear-gradient(180deg, transparent, rgba(255, 87, 51, 0.03));
|
||||
}
|
||||
|
||||
.comparison-grid {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 3rem;
|
||||
margin-top: 2rem;
|
||||
}
|
||||
|
||||
.comparison-card {
|
||||
background: var(--bg-secondary);
|
||||
border: 2px solid var(--border-color);
|
||||
border-radius: 12px;
|
||||
padding: 2rem;
|
||||
}
|
||||
|
||||
.comparison-card.better {
|
||||
border-color: var(--accent-orange);
|
||||
box-shadow: 0 0 20px rgba(255, 87, 51, 0.1);
|
||||
}
|
||||
|
||||
.comparison-card h3 {
|
||||
font-size: 1.5rem;
|
||||
font-weight: 700;
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.comparison-card.better h3 {
|
||||
color: var(--accent-orange);
|
||||
}
|
||||
|
||||
.feature-list {
|
||||
list-style: none;
|
||||
}
|
||||
|
||||
.feature-list li {
|
||||
padding: 0.75rem 0;
|
||||
padding-left: 1.5rem;
|
||||
position: relative;
|
||||
color: var(--text-secondary);
|
||||
}
|
||||
|
||||
.feature-list li::before {
|
||||
content: '✓';
|
||||
position: absolute;
|
||||
left: 0;
|
||||
color: var(--accent-orange);
|
||||
}
|
||||
|
||||
.feature-list li.missing {
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
.feature-list li.missing::before {
|
||||
content: '✗';
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* Why Section */
|
||||
.why-section {
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
padding: 4rem;
|
||||
}
|
||||
|
||||
.why-cards {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
|
||||
gap: 1.5rem;
|
||||
margin-top: 2rem;
|
||||
}
|
||||
|
||||
.why-card {
|
||||
background: var(--bg-tertiary);
|
||||
border-radius: 12px;
|
||||
padding: 1.5rem;
|
||||
border-left: 4px solid var(--accent-orange);
|
||||
}
|
||||
|
||||
.why-card h4 {
|
||||
font-size: 1.1rem;
|
||||
font-weight: 600;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.why-card p {
|
||||
color: var(--text-secondary);
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
/* Footer */
|
||||
footer {
|
||||
border-top: 1px solid var(--border-color);
|
||||
padding: 3rem 4rem;
|
||||
margin-top: 4rem;
|
||||
}
|
||||
|
||||
.footer-content {
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.footer-links a {
|
||||
color: var(--text-secondary);
|
||||
text-decoration: none;
|
||||
margin-left: 2rem;
|
||||
font-size: 0.9rem;
|
||||
transition: color 0.2s;
|
||||
}
|
||||
|
||||
.footer-links a:hover {
|
||||
color: var(--accent-orange);
|
||||
}
|
||||
|
||||
.footer-text {
|
||||
color: var(--text-muted);
|
||||
font-size: 0.85rem;
|
||||
}
|
||||
|
||||
/* Responsive */
|
||||
@media (max-width: 1024px) {
|
||||
.hero {
|
||||
grid-template-columns: 1fr;
|
||||
text-align: center;
|
||||
padding: 3rem 2rem;
|
||||
}
|
||||
|
||||
.hero h1 {
|
||||
font-size: 2.5rem;
|
||||
}
|
||||
|
||||
.comparison-grid {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.header, footer, .features, .comparison, .why-section {
|
||||
padding-left: 2rem;
|
||||
padding-right: 2rem;
|
||||
}
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<header>
|
||||
<div class="logo">OpenClaw RAG</div>
|
||||
<nav>
|
||||
<a href="#features">Features</a>
|
||||
<a href="#comparison">Comparison</a>
|
||||
<a href="#why">Why Us</a>
|
||||
<a href="https://github.com/wmantly/openclaw-rag-skill">GitHub</a>
|
||||
</nav>
|
||||
</header>
|
||||
|
||||
<section class="hero">
|
||||
<div class="hero-text">
|
||||
<h1>Local Knowledge with<br><span>Semantic Search</span></h1>
|
||||
<p>Index your entire OpenClaw workspace — chat history, code, documentation — and search by meaning, not keywords. Fully offline, zero dependencies, complete privacy.</p>
|
||||
<div class="cta-group">
|
||||
<a href="https://clawhub.ai/wmantly/openclaw-rag-skill" class="cta-primary">Install on ClawHub</a>
|
||||
<a href="https://github.com/wmantly/openclaw-rag-skill" class="cta-secondary">View on GitHub</a>
|
||||
</div>
|
||||
</div>
|
||||
<div class="terminal">
|
||||
<div class="terminal-header">
|
||||
<div class="terminal-dot dot-red"></div>
|
||||
<div class="terminal-dot dot-yellow"></div>
|
||||
<div class="terminal-dot dot-green"></div>
|
||||
</div>
|
||||
<div class="terminal-prompt">
|
||||
<span class="terminal-command">$</span>
|
||||
<span>python3 rag_query.py "how to send SMS"</span>
|
||||
</div>
|
||||
<div class="terminal-output">
|
||||
<p>🔍 Found 3 relevant items:</p>
|
||||
<br>
|
||||
<p><strong>📄 Past Conversation (voipms-sms)</strong></p>
|
||||
<p>Use voipms_sms_client.py with API endpoint...</p>
|
||||
<p>API: https://voip.ms/api/v1/rest.php</p>
|
||||
<br>
|
||||
<p class="success">✅ Indexed 2,533 documents from your workspace</p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="features" class="features">
|
||||
<div class="section-header">
|
||||
<h2>Powerful Features</h2>
|
||||
<p>Everything you need to build and maintain your knowledge base</p>
|
||||
</div>
|
||||
<div class="feature-grid">
|
||||
<div class="feature-card">
|
||||
<div class="feature-icon">🧠</div>
|
||||
<h3>Semantic Search</h3>
|
||||
<p>Find context by meaning, not just keywords. ChromaDB with all-MiniLM-L6-v2 embeddings automatically understands the intent behind your queries.</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<div class="feature-icon">📚</div>
|
||||
<h3>Multi-Source Indexing</h3>
|
||||
<p>Index chat sessions, Python/JavaScript code, Markdown docs, skill documentation — everything in your OpenClaw workspace becomes searchable.</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<div class="feature-icon">🔒</div>
|
||||
<h3>Completely Offline</h3>
|
||||
<p>No API keys, no external services, no cloud databases. All embeddings and storage live on your machine. Your data never leaves.</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<div class="feature-icon">🚀</div>
|
||||
<h3>Automatic Integration</h3>
|
||||
<p>The AI automatically retrieves relevant context from your knowledge base when responding. It "remembers" your past work transparently.</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<div class="feature-icon">📊</div>
|
||||
<h3>Type Filtering</h3>
|
||||
<p>Search by document type — sessions, workspace, skills, memory. Find exactly what you're looking for with precision.</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<div class="feature-icon">⚡</div>
|
||||
<h3>Blazing Fast</h3>
|
||||
<p>Indexing at ~1,000 docs/minute, search in under 100ms. The embedding model (79MB) caches locally after first run.</p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="comparison" class="comparison">
|
||||
<div class="section-header">
|
||||
<h2>Why OpenClaw RAG?</h2>
|
||||
<p>Compare with web-search RAG tools</p>
|
||||
</div>
|
||||
<div class="comparison-grid">
|
||||
<div class="comparison-card">
|
||||
<h3>Web-Search RAG Tools</h3>
|
||||
<ul class="feature-list">
|
||||
<li class="missing">Searches the internet</li>
|
||||
<li class="missing">Requires web connection</li>
|
||||
<li class="missing">No knowledge of your code</li>
|
||||
<li class="missing">Can't find past solutions</li>
|
||||
<li class="missing">For research, not memory</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="comparison-card better">
|
||||
<h3>✨ OpenClaw RAG</h3>
|
||||
<ul class="feature-list">
|
||||
<li>Indexes YOUR knowledge base</li>
|
||||
<li>Fully offline after setup</li>
|
||||
<li>Understands your entire workspace</li>
|
||||
<li>Remembers your past solutions</li>
|
||||
<li>For institutional memory & productivity</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="why" class="why-section">
|
||||
<div class="section-header">
|
||||
<h2>Built for Developers, by Developers</h2>
|
||||
<p>Different use cases for different problems</p>
|
||||
</div>
|
||||
<div class="why-cards">
|
||||
<div class="why-card">
|
||||
<h4>🌐 Use Web RAG for Research</h4>
|
||||
<p>Need current information, multi-engine web searches, or gathering context from the internet? That's what web-search RAG tools do best.</p>
|
||||
</div>
|
||||
<div class="why-card">
|
||||
<h4>💾 Use OpenClaw RAG for Memory</h4>
|
||||
<p>Need to find past solutions, code patterns, decisions, or conversations? That's our specialty — we remember everything you've done.</p>
|
||||
</div>
|
||||
<div class="why-card">
|
||||
<h4>🤝 Best of Both Worlds</h4>
|
||||
<p>Use web-search RAG for research, then save the important findings to your knowledge base. OpenClaw RAG preserves what matters forever.</p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer>
|
||||
<div class="footer-content">
|
||||
<p class="footer-text">OpenClaw RAG v1.0.3 · MIT License</p>
|
||||
<div class="footer-links">
|
||||
<a href="https://clawhub.ai/wmantly/openclaw-rag-skill">ClawHub</a>
|
||||
<a href="https://github.com/wmantly/openclaw-rag-skill">GitHub</a>
|
||||
<a href="https://william.mantly.vip">William Mantly</a>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
</body>
|
||||
</html>
|
||||
265
ingest_docs.py
265
ingest_docs.py
@@ -1,265 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG Document Ingestor - Load workspace files, skills, docs into vector store
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from typing import List, Dict
|
||||
|
||||
# Add parent directory to path
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def read_file_safe(file_path: Path) -> str:
|
||||
"""Read file with error handling"""
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
return f.read()
|
||||
except Exception as e:
|
||||
print(f" ⚠️ Error reading: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def is_text_file(file_path: Path, max_size_mb: float = 1.0) -> bool:
|
||||
"""Check if file is text and not too large"""
|
||||
# Skip binary files
|
||||
if file_path.suffix.lower() in ['.pyc', '.so', '.o', '.a', '.png', '.jpg', '.jpeg', '.gif', '.zip', '.tar', '.gz']:
|
||||
return False
|
||||
|
||||
# Check size
|
||||
try:
|
||||
size_mb = file_path.stat().st_size / (1024 * 1024)
|
||||
if size_mb > max_size_mb:
|
||||
return False
|
||||
except:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def chunk_text(text: str, max_chars: int = 4000, overlap: int = 200) -> List[str]:
|
||||
"""Split text into chunks"""
|
||||
chunks = []
|
||||
|
||||
if len(text) <= max_chars:
|
||||
return [text]
|
||||
|
||||
# Simple splitting by newline for now (could be improved to split at sentences)
|
||||
paragraphs = text.split('\n\n')
|
||||
current_chunk = ""
|
||||
|
||||
for para in paragraphs:
|
||||
if len(current_chunk) + len(para) + 2 <= max_chars:
|
||||
current_chunk += para + "\n\n"
|
||||
else:
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk.strip())
|
||||
current_chunk = para + "\n\n"
|
||||
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk.strip())
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def ingest_workspace(
|
||||
workspace_dir: str = None,
|
||||
collection_name: str = "openclaw_knowledge",
|
||||
file_patterns: List[str] = None
|
||||
):
|
||||
"""
|
||||
Ingest workspace files into RAG system
|
||||
|
||||
Args:
|
||||
workspace_dir: Path to workspace directory
|
||||
collection_name: Name of the ChromaDB collection
|
||||
file_patterns: List of file patterns to include (default: all)
|
||||
"""
|
||||
if workspace_dir is None:
|
||||
workspace_dir = os.path.expanduser("~/.openclaw/workspace")
|
||||
|
||||
workspace_path = Path(workspace_dir)
|
||||
|
||||
if not workspace_path.exists():
|
||||
print(f"❌ Workspace not found: {workspace_dir}")
|
||||
return
|
||||
|
||||
print(f"🔍 Scanning workspace: {workspace_path}")
|
||||
|
||||
# Default file patterns
|
||||
if file_patterns is None:
|
||||
file_patterns = [
|
||||
"*.md", "*.py", "*.js", "*.ts", "*.json", "*.yaml", "*.yml",
|
||||
"*.txt", "*.sh", "*.html", "*.css"
|
||||
]
|
||||
|
||||
# Find all matching files
|
||||
all_files = []
|
||||
|
||||
for pattern in file_patterns:
|
||||
for file_path in workspace_path.rglob(pattern):
|
||||
if is_text_file(file_path):
|
||||
all_files.append(file_path)
|
||||
|
||||
if not all_files:
|
||||
print(f"⚠️ No files found")
|
||||
return
|
||||
|
||||
print(f"✅ Found {len(all_files)} files\n")
|
||||
|
||||
# Initialize RAG
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
total_chunks = 0
|
||||
|
||||
for file_path in all_files[:100]: # Limit to 100 files for testing
|
||||
relative_path = file_path.relative_to(workspace_path)
|
||||
print(f"\n📄 {relative_path}")
|
||||
|
||||
# Read file
|
||||
content = read_file_safe(file_path)
|
||||
|
||||
if content is None:
|
||||
continue
|
||||
|
||||
# Chunk if too large
|
||||
if len(content) > 4000:
|
||||
text_chunks = chunk_text(content)
|
||||
print(f" Chunks: {len(text_chunks)}")
|
||||
else:
|
||||
text_chunks = [content]
|
||||
print(f" Size: {len(content)} chars")
|
||||
|
||||
# Add each chunk
|
||||
for i, chunk in enumerate(text_chunks):
|
||||
metadata = {
|
||||
"type": "workspace",
|
||||
"source": str(relative_path),
|
||||
"file_path": str(file_path),
|
||||
"file_size": len(content),
|
||||
"chunk_index": i,
|
||||
"total_chunks": len(text_chunks),
|
||||
"file_extension": file_path.suffix.lower(),
|
||||
"ingested_at": datetime.now().isoformat()
|
||||
}
|
||||
|
||||
doc_id = rag.add_document(chunk, metadata)
|
||||
total_chunks += 1
|
||||
|
||||
print(f" ✅ Indexed {len(text_chunks)} chunk(s)")
|
||||
|
||||
print(f"\n📊 Summary:")
|
||||
print(f" Files processed: {len([f for f in all_files[:100]])}")
|
||||
print(f" Total chunks indexed: {total_chunks}")
|
||||
|
||||
stats = rag.get_stats()
|
||||
print(f" Total documents in collection: {stats['total_documents']}")
|
||||
|
||||
|
||||
def ingest_skills(
|
||||
skills_base_dir: str = None,
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
):
|
||||
"""
|
||||
Ingest all SKILL.md files from skills directory
|
||||
|
||||
Args:
|
||||
skills_base_dir: Base directory for skills
|
||||
collection_name: Name of the ChromaDB collection
|
||||
"""
|
||||
# Default to OpenClaw skills dir
|
||||
if skills_base_dir is None:
|
||||
# Check both system and workspace skills
|
||||
system_skills = Path("/usr/lib/node_modules/openclaw/skills")
|
||||
workspace_skills = Path(os.path.expanduser("~/.openclaw/workspace/skills"))
|
||||
|
||||
skills_dirs = [d for d in [system_skills, workspace_skills] if d.exists()]
|
||||
|
||||
if not skills_dirs:
|
||||
print(f"❌ No skills directories found")
|
||||
return
|
||||
else:
|
||||
skills_dirs = [Path(skills_base_dir)]
|
||||
|
||||
print(f"🔍 Scanning for skills...")
|
||||
|
||||
# Find all SKILL.md files
|
||||
skill_files = []
|
||||
|
||||
for skills_dir in skills_dirs:
|
||||
# Direct SKILL.md files
|
||||
for skill_file in skills_dir.rglob("SKILL.md"):
|
||||
skill_files.append(skill_file)
|
||||
|
||||
if not skill_files:
|
||||
print(f"⚠️ No SKILL.md files found")
|
||||
return
|
||||
|
||||
print(f"✅ Found {len(skill_files)} skills\n")
|
||||
|
||||
# Initialize RAG
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
total_chunks = 0
|
||||
|
||||
for skill_file in skill_files:
|
||||
# Determine skill name from path
|
||||
if skill_file.name == "SKILL.md":
|
||||
skill_name = skill_file.parent.name
|
||||
else:
|
||||
skill_name = skill_file.stem
|
||||
|
||||
print(f"\n📜 {skill_name}")
|
||||
|
||||
content = read_file_safe(skill_file)
|
||||
|
||||
if content is None:
|
||||
continue
|
||||
|
||||
# Chunk skill documentation
|
||||
chunks = chunk_text(content, max_chars=3000, overlap=100)
|
||||
|
||||
for i, chunk in enumerate(chunks):
|
||||
metadata = {
|
||||
"type": "skill",
|
||||
"source": f"skill:{skill_name}",
|
||||
"skill_name": skill_name,
|
||||
"file_path": str(skill_file),
|
||||
"chunk_index": i,
|
||||
"total_chunks": len(chunks),
|
||||
"ingested_at": datetime.now().isoformat()
|
||||
}
|
||||
|
||||
doc_id = rag.add_document(chunk, metadata)
|
||||
total_chunks += 1
|
||||
|
||||
print(f" ✅ Indexed {len(chunks)} chunk(s)")
|
||||
|
||||
print(f"\n📊 Summary:")
|
||||
print(f" Skills processed: {len(skill_files)}")
|
||||
print(f" Total chunks indexed: {total_chunks}")
|
||||
|
||||
stats = rag.get_stats()
|
||||
print(f" Total documents in collection: {stats['total_documents']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Ingest files into OpenClaw RAG")
|
||||
parser.add_argument("type", choices=["workspace", "skills"], help="What to ingest")
|
||||
parser.add_argument("--path", help="Path to workspace or skills directory")
|
||||
parser.add_argument("--collection", default="openclaw_knowledge", help="Collection name")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.type == "workspace":
|
||||
ingest_workspace(workspace_dir=args.path, collection_name=args.collection)
|
||||
elif args.type == "skills":
|
||||
ingest_skills(skills_base_dir=args.path, collection_name=args.collection)
|
||||
@@ -1,289 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG Session Ingestor - Load all session transcripts into vector store
|
||||
Fixed to handle OpenClaw session event format
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from typing import List, Dict, Any
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def parse_jsonl(file_path: Path) -> List[Dict]:
|
||||
"""
|
||||
Parse OpenClaw session JSONL format
|
||||
|
||||
Session files contain:
|
||||
- Line 1: Session metadata (type: "session")
|
||||
- Lines 2+: Events including messages, toolCalls, etc.
|
||||
"""
|
||||
messages = []
|
||||
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
for line_num, line in enumerate(f, 1):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
try:
|
||||
event = json.loads(line)
|
||||
|
||||
# Skip session metadata line
|
||||
if line_num == 1 and event.get('type') == 'session':
|
||||
continue
|
||||
|
||||
# Extract message events only
|
||||
if event.get('type') == 'message':
|
||||
msg_obj = event.get('message', {})
|
||||
|
||||
messages.append({
|
||||
'role': msg_obj.get('role'),
|
||||
'content': msg_obj.get('content'),
|
||||
'timestamp': event.get('timestamp'),
|
||||
'id': event.get('id'),
|
||||
'sessionKey': event.get('sessionKey') # Not usually here, but check
|
||||
})
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
continue
|
||||
except Exception as e:
|
||||
print(f"❌ Error reading {file_path}: {e}")
|
||||
|
||||
return messages
|
||||
|
||||
|
||||
def extract_session_key(file_name: str) -> str:
|
||||
"""Extract session key from filename"""
|
||||
return file_name.replace('.jsonl', '')
|
||||
|
||||
|
||||
def extract_session_metadata(session_data: List[Dict], session_key: str) -> Dict:
|
||||
"""Extract metadata from session messages"""
|
||||
if not session_data:
|
||||
return {}
|
||||
|
||||
first_msg = session_data[0]
|
||||
last_msg = session_data[-1]
|
||||
|
||||
return {
|
||||
"start_time": first_msg.get("timestamp"),
|
||||
"end_time": last_msg.get("timestamp"),
|
||||
"total_messages": len(session_data),
|
||||
"has_system": any(msg.get("role") == "system" for msg in session_data),
|
||||
"has_user": any(msg.get("role") == "user" for msg in session_data),
|
||||
"has_assistant": any(msg.get("role") == "assistant" for msg in session_data),
|
||||
}
|
||||
|
||||
|
||||
def format_content(content) -> str:
|
||||
"""
|
||||
Format message content from OpenClaw format to text
|
||||
|
||||
Content can be:
|
||||
- String
|
||||
- List of dicts with 'type' field (text, thinking, toolCall, toolResult)
|
||||
"""
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
|
||||
if isinstance(content, list):
|
||||
texts = []
|
||||
|
||||
for item in content:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
|
||||
item_type = item.get('type', '')
|
||||
|
||||
if item_type == 'text':
|
||||
texts.append(item.get('text', ''))
|
||||
elif item_type == 'thinking':
|
||||
# Skip reasoning, usually not useful for RAG
|
||||
# texts.append(f"[Reasoning: {item.get('thinking', '')[:200]}]")
|
||||
pass
|
||||
elif item_type == 'toolCall':
|
||||
tool_name = item.get('name', 'unknown')
|
||||
args = str(item.get('arguments', ''))[:100]
|
||||
texts.append(f"[Tool: {tool_name}({args})]")
|
||||
elif item_type == 'toolResult':
|
||||
result = str(item.get('text', item.get('result', ''))).strip()
|
||||
# Truncate large tool results
|
||||
if len(result) > 500:
|
||||
result = result[:500] + "..."
|
||||
texts.append(f"[Tool Result: {result}]")
|
||||
|
||||
return "\n".join(texts)
|
||||
|
||||
return str(content)[:500]
|
||||
|
||||
|
||||
def chunk_messages(
|
||||
messages: List[Dict],
|
||||
context_window: int = 20,
|
||||
overlap: int = 5
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Chunk messages for better retrieval
|
||||
|
||||
Args:
|
||||
messages: List of message objects
|
||||
context_window: Messages per chunk
|
||||
overlap: Message overlap between chunks
|
||||
|
||||
Returns:
|
||||
List of {"text": str, "metadata": dict} chunks
|
||||
"""
|
||||
chunks = []
|
||||
|
||||
for i in range(0, len(messages), context_window - overlap):
|
||||
chunk_messages = messages[i:i + context_window]
|
||||
|
||||
# Build text from messages
|
||||
text_parts = []
|
||||
|
||||
for msg in chunk_messages:
|
||||
role = msg.get("role", "unknown")
|
||||
content = msg.get("content", "")
|
||||
|
||||
# Format content
|
||||
text = format_content(content)
|
||||
|
||||
if text.strip():
|
||||
text_parts.append(f"{role.upper()}: {text}")
|
||||
|
||||
text = "\n\n".join(text_parts)
|
||||
|
||||
# Don't add empty chunks
|
||||
if not text.strip():
|
||||
continue
|
||||
|
||||
# Metadata
|
||||
metadata = {
|
||||
"type": "session",
|
||||
"source": str(chunk_messages[0].get("sessionKey") or chunk_messages[0].get("id") or session_key),
|
||||
"chunk_index": int(i // (context_window - overlap)),
|
||||
"chunk_start_time": str(chunk_messages[0].get("timestamp") or ""),
|
||||
"chunk_end_time": str(chunk_messages[-1].get("timestamp") or ""),
|
||||
"message_count": int(len(chunk_messages)),
|
||||
"ingested_at": datetime.now().isoformat(),
|
||||
"date": str(chunk_messages[0].get("timestamp") or datetime.now().isoformat())
|
||||
}
|
||||
|
||||
chunks.append({
|
||||
"text": text,
|
||||
"metadata": metadata
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def ingest_sessions(
|
||||
sessions_dir: str = None,
|
||||
collection_name: str = "openclaw_knowledge",
|
||||
chunk_size: int = 20,
|
||||
chunk_overlap: int = 5
|
||||
):
|
||||
"""
|
||||
Ingest all session transcripts into RAG system
|
||||
|
||||
Args:
|
||||
sessions_dir: Directory containing session jsonl files
|
||||
collection_name: Name of the ChromaDB collection
|
||||
chunk_size: Messages per chunk
|
||||
chunk_overlap: Message overlap between chunks
|
||||
"""
|
||||
if sessions_dir is None:
|
||||
sessions_dir = os.path.expanduser("~/.openclaw/agents/main/sessions")
|
||||
|
||||
sessions_path = Path(sessions_dir)
|
||||
|
||||
if not sessions_path.exists():
|
||||
print(f"❌ Sessions directory not found: {sessions_path}")
|
||||
return
|
||||
|
||||
print(f"🔍 Finding session files in: {sessions_path}")
|
||||
|
||||
jsonl_files = list(sessions_path.glob("*.jsonl"))
|
||||
|
||||
if not jsonl_files:
|
||||
print(f"⚠️ No jsonl files found in {sessions_path}")
|
||||
return
|
||||
|
||||
print(f"✅ Found {len(jsonl_files)} session files\n")
|
||||
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
total_chunks = 0
|
||||
total_messages = 0
|
||||
skipped_empty = 0
|
||||
|
||||
for jsonl_file in sorted(jsonl_files):
|
||||
session_key = extract_session_key(jsonl_file.name)
|
||||
|
||||
print(f"\n📄 Processing: {jsonl_file.name}")
|
||||
|
||||
messages = parse_jsonl(jsonl_file)
|
||||
|
||||
if not messages:
|
||||
print(f" ⚠️ No messages, skipping")
|
||||
skipped_empty += 1
|
||||
continue
|
||||
|
||||
total_messages += len(messages)
|
||||
|
||||
# Extract session metadata
|
||||
session_metadata = extract_session_metadata(messages, session_key)
|
||||
print(f" Messages: {len(messages)}")
|
||||
|
||||
# Chunk messages
|
||||
chunks = chunk_messages(messages, chunk_size, chunk_overlap)
|
||||
|
||||
if not chunks:
|
||||
print(f" ⚠️ No valid chunks, skipping")
|
||||
skipped_empty += 1
|
||||
continue
|
||||
|
||||
print(f" Chunks: {len(chunks)}")
|
||||
|
||||
# Add to RAG
|
||||
try:
|
||||
ids = rag.add_documents_batch(chunks, batch_size=50)
|
||||
total_chunks += len(chunks)
|
||||
print(f" ✅ Indexed {len(chunks)} chunks")
|
||||
except Exception as e:
|
||||
print(f" ❌ Error: {e}")
|
||||
|
||||
# Summary
|
||||
print(f"\n📊 Summary:")
|
||||
print(f" Sessions processed: {len(jsonl_files)}")
|
||||
print(f" Skipped (empty): {skipped_empty}")
|
||||
print(f" Total messages: {total_messages}")
|
||||
print(f" Total chunks indexed: {total_chunks}")
|
||||
|
||||
stats = rag.get_stats()
|
||||
print(f" Total documents in collection: {stats['total_documents']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Ingest OpenClaw session transcripts into RAG")
|
||||
parser.add_argument("--sessions-dir", help="Path to sessions directory (default: ~/.openclaw/agents/main/sessions)")
|
||||
parser.add_argument("--chunk-size", type=int, default=20, help="Messages per chunk (default: 20)")
|
||||
parser.add_argument("--chunk-overlap", type=int, default=5, help="Message overlap (default: 5)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
ingest_sessions(
|
||||
sessions_dir=args.sessions_dir,
|
||||
chunk_size=args.chunk_size,
|
||||
chunk_overlap=args.chunk_overlap
|
||||
)
|
||||
@@ -1,43 +0,0 @@
|
||||
#!/bin/bash
|
||||
# RAG Agent Launcher - Spawns an agent with automatic knowledge base access
|
||||
|
||||
# This spawns a sub-agent that has RAG automatically integrated
|
||||
# The agent will query your knowledge base before responding to questions
|
||||
|
||||
SESSION_SPAWN_COMMAND='python3 -c "
|
||||
import sys
|
||||
sys.path.insert(0, \"/home/william/.openclaw/workspace/rag\")
|
||||
|
||||
# Add RAG context to system prompt
|
||||
ORIGINAL_TASK=\"$@\"
|
||||
|
||||
# Search for relevant context
|
||||
from rag_system import RAGSystem
|
||||
rag = RAGSystem()
|
||||
|
||||
# Find similar past conversations
|
||||
results = rag.search(ORIGINAL_TASK, n_results=3)
|
||||
|
||||
if results:
|
||||
context = \"\\n=== RELEVANT CONTEXT FROM KNOWLEDGE BASE ===\\n\"
|
||||
for i, r in enumerate(results, 1):
|
||||
meta = r.get(\"metadata\", {})
|
||||
text = r.get(\"text\", \"\")[:500]
|
||||
doc_type = meta.get(\"type\", \"unknown\")
|
||||
source = meta.get(\"source\", \"unknown\")
|
||||
context += f\"\\n[{doc_type.upper()} - {source}]\\n{text}\\n\"
|
||||
else:
|
||||
context = \"\"
|
||||
|
||||
# Respond with context-aware task
|
||||
print(f\"\"\"{context}
|
||||
|
||||
=== CURRENT TASK ===
|
||||
{ORIGINAL_TASK}
|
||||
|
||||
Use the context above if relevant to help answer the question.\"
|
||||
|
||||
\"\")"
|
||||
|
||||
# Spawn the agent with RAG context
|
||||
/home/william/.local/bin/openclaw sessions spawn "$SESSION_SPAWN_COMMAND"
|
||||
52
package.json
52
package.json
@@ -1,52 +0,0 @@
|
||||
{
|
||||
"name": "rag-openclaw",
|
||||
"version": "1.0.3",
|
||||
"description": "RAG Knowledge System for OpenClaw - Semantic search across chat history, code, docs, and skills with automatic memory retrieval",
|
||||
"homepage": "http://git.theta42.com/nova/openclaw-rag-skill",
|
||||
"author": {
|
||||
"name": "Nova AI",
|
||||
"email": "nova@vm42.us"
|
||||
},
|
||||
"owner": "wmantly",
|
||||
"openclaw": {
|
||||
"always": false,
|
||||
"capabilities": []
|
||||
},
|
||||
"environment": {
|
||||
"required": {},
|
||||
"optional": {},
|
||||
"config": {
|
||||
"paths": [
|
||||
"~/.openclaw/data/rag/"
|
||||
],
|
||||
"help": "ChromaDB storage location. No configuration required - system auto-creates data directory on first use."
|
||||
}
|
||||
},
|
||||
"install": {
|
||||
"type": "instruction",
|
||||
"steps": [
|
||||
"1. Install Python dependency: pip3 install --user chromadb",
|
||||
"2. Install location: ~/.openclaw/workspace/rag/ (created automatically)",
|
||||
"3. Data storage: ~/.openclaw/data/rag/ (auto-created on first run)",
|
||||
"4. No API keys or credentials required - fully local system"
|
||||
]
|
||||
},
|
||||
"scripts": {
|
||||
"ingest:sessions": "python3 ingest_sessions.py",
|
||||
"ingest:workspace": "python3 ingest_docs.py workspace",
|
||||
"ingest:skills": "python3 ingest_docs.py skills",
|
||||
"search": "python3 rag_query.py",
|
||||
"update": "bash scripts/rag-auto-update.sh",
|
||||
"stats": "python3 rag_manage.py stats",
|
||||
"manage": "python3 rag_manage.py"
|
||||
},
|
||||
"keywords": [
|
||||
"rag",
|
||||
"knowledge",
|
||||
"semantic-search",
|
||||
"chromadb",
|
||||
"memory",
|
||||
"retrieval-augmented-generation"
|
||||
],
|
||||
"license": "MIT"
|
||||
}
|
||||
213
rag_agent.py
213
rag_agent.py
@@ -1,213 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG-Enhanced OpenClaw Agent
|
||||
|
||||
This agent automatically retrieves relevant context from the knowledge base
|
||||
before responding, providing conversation history, code, and documentation context.
|
||||
|
||||
Usage:
|
||||
python3 /path/to/rag_agent.py <user_message> <session_jsonl_file>
|
||||
|
||||
Or integrate into OpenClaw as an agent wrapper.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory to import RAG system
|
||||
current_dir = Path(__file__).parent
|
||||
sys.path.insert(0, str(current_dir))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def extract_user_query(messages: list) -> str:
|
||||
"""
|
||||
Extract the most recent user message from conversation history.
|
||||
|
||||
Args:
|
||||
messages: List of message objects
|
||||
|
||||
Returns:
|
||||
User query string
|
||||
"""
|
||||
# Find the last user message
|
||||
for msg in reversed(messages):
|
||||
role = msg.get('role')
|
||||
|
||||
if role == 'user':
|
||||
content = msg.get('content', '')
|
||||
|
||||
# Handle different content formats
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
elif isinstance(content, list):
|
||||
# Extract text from list format
|
||||
text_parts = []
|
||||
for item in content:
|
||||
if isinstance(item, dict) and item.get('type') == 'text':
|
||||
text_parts.append(item.get('text', ''))
|
||||
return ' '.join(text_parts)
|
||||
|
||||
return ''
|
||||
|
||||
|
||||
def search_relevant_context(query: str, rag: RAGSystem, max_results: int = 5) -> str:
|
||||
"""
|
||||
Search the knowledge base for relevant context.
|
||||
|
||||
Args:
|
||||
query: User's question
|
||||
rag: RAGSystem instance
|
||||
max_results: Maximum results to return
|
||||
|
||||
Returns:
|
||||
Formatted context string
|
||||
"""
|
||||
if not query or len(query) < 3:
|
||||
return ''
|
||||
|
||||
try:
|
||||
# Search for relevant context
|
||||
results = rag.search(query, n_results=max_results)
|
||||
|
||||
if not results:
|
||||
return ''
|
||||
|
||||
# Format the results
|
||||
context_parts = []
|
||||
context_parts.append(f"Found {len(results)} relevant context items:\n")
|
||||
|
||||
for i, result in enumerate(results, 1):
|
||||
metadata = result.get('metadata', {})
|
||||
doc_type = metadata.get('type', 'unknown')
|
||||
source = metadata.get('source', 'unknown')
|
||||
|
||||
# Header based on type
|
||||
if doc_type == 'session':
|
||||
header = f"[Session Reference {i}]"
|
||||
elif doc_type == 'workspace':
|
||||
header = f"[Code/Docs {i}: {source}]"
|
||||
elif doc_type == 'skill':
|
||||
header = f"[Skill Reference {i}: {source}]"
|
||||
else:
|
||||
header = f"[Reference {i}]"
|
||||
|
||||
# Truncate long content
|
||||
text = result.get('text', '')
|
||||
if len(text) > 800:
|
||||
text = text[:800] + "..."
|
||||
|
||||
context_parts.append(f"{header}\n{text}\n")
|
||||
|
||||
return '\n'.join(context_parts)
|
||||
|
||||
except Exception as e:
|
||||
# Fail silently - RAG shouldn't break conversations
|
||||
return ''
|
||||
|
||||
|
||||
def enhance_message_with_rag(
|
||||
message_content: str,
|
||||
conversation_history: list,
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
) -> str:
|
||||
"""
|
||||
Enhance a user message with relevant RAG context.
|
||||
|
||||
This is the main integration point. Call this before sending messages to the LLM.
|
||||
|
||||
Args:
|
||||
message_content: The current user message
|
||||
conversation_history: Recent conversation messages
|
||||
collection_name: ChromaDB collection name
|
||||
|
||||
Returns:
|
||||
Enhanced message string with RAG context prepended
|
||||
"""
|
||||
try:
|
||||
# Initialize RAG system
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
# Extract user query
|
||||
user_query = extract_user_query([{'role': 'user', 'content': message_content}] + conversation_history)
|
||||
|
||||
# Search for relevant context
|
||||
context = search_relevant_context(user_query, rag, max_results=5)
|
||||
|
||||
if not context:
|
||||
return message_content
|
||||
|
||||
# Prepend context to the message
|
||||
enhanced = f"""[RAG CONTEXT - Retrieved from knowledge base:]
|
||||
{context}
|
||||
|
||||
---
|
||||
|
||||
[CURRENT USER MESSAGE:]
|
||||
{message_content}"""
|
||||
|
||||
return enhanced
|
||||
|
||||
except Exception as e:
|
||||
# Fail silently - return original message if RAG fails
|
||||
return message_content
|
||||
|
||||
|
||||
def get_response_with_rag(
|
||||
user_message: str,
|
||||
session_jsonl: str = None,
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
) -> str:
|
||||
"""
|
||||
Get an AI response with automatic RAG-enhanced context.
|
||||
|
||||
This is a helper function that can be called from scripts.
|
||||
|
||||
Args:
|
||||
user_message: The user's question
|
||||
session_jsonl: Path to session file (for conversation history)
|
||||
collection_name: ChromaDB collection name
|
||||
|
||||
Returns:
|
||||
Enhanced message ready for LLM processing
|
||||
"""
|
||||
# Load conversation history if session file provided
|
||||
conversation_history = []
|
||||
if session_jsonl and Path(session_jsonl).exists():
|
||||
try:
|
||||
with open(session_jsonl, 'r') as f:
|
||||
for line in f:
|
||||
if line.strip():
|
||||
event = json.loads(line)
|
||||
if event.get('type') == 'message':
|
||||
msg = event.get('message', {})
|
||||
conversation_history.append(msg)
|
||||
except:
|
||||
pass
|
||||
|
||||
# Enhance message
|
||||
return enhance_message_with_rag(user_message, conversation_history, collection_name)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Command-line interface for testing
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python3 rag_agent.py <user_message> [session_jsonl]")
|
||||
print("\nOr import and use:")
|
||||
print(" from rag.rag_agent import enhance_message_with_rag")
|
||||
print(" enhanced = enhance_message_with_rag(user_message, history)")
|
||||
sys.exit(1)
|
||||
|
||||
user_message = sys.argv[1]
|
||||
session_jsonl = sys.argv[2] if len(sys.argv) > 2 else None
|
||||
|
||||
# Get enhanced message
|
||||
enhanced = get_response_with_rag(user_message, session_jsonl)
|
||||
|
||||
print("\n" + "="*80)
|
||||
print("ENHANCED MESSAGE (Ready for LLM):")
|
||||
print("="*80)
|
||||
print(enhanced)
|
||||
print("="*80 + "\n")
|
||||
@@ -1,93 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG Context Checker - Simple wrapper for automatic knowledge base lookup
|
||||
|
||||
Usage from within Python/OpenClaw sessions:
|
||||
from rag_context import check_context
|
||||
check_context("your question here")
|
||||
|
||||
This prints relevant context if found, otherwise silent.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
# Use current directory for imports (skill directory)
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
from rag_query_wrapper import search_knowledge, format_for_ai
|
||||
|
||||
|
||||
def check_context(query: str, max_results: int = 5, min_score: float = 0.3) -> None:
|
||||
"""
|
||||
Check knowledge base for relevant context and print it.
|
||||
|
||||
This is designed as a simple call-and-forget function. If context exists,
|
||||
it prints it. If not, it's silent.
|
||||
|
||||
Args:
|
||||
query: The question or topic to search for
|
||||
max_results: Maximum results to retrieve
|
||||
min_score: Minimum similarity score (not used in current implementation)
|
||||
|
||||
Example:
|
||||
>>> check_context("how to send SMS")
|
||||
📚 Found 3 relevant items...
|
||||
[prints context]
|
||||
"""
|
||||
if not query or len(query) < 3:
|
||||
return
|
||||
|
||||
try:
|
||||
results = search_knowledge(query, n_results=max_results)
|
||||
|
||||
if results and results.get('count', 0) > 0:
|
||||
print("\n" + "="*80)
|
||||
print("📚 RELEVANT CONTEXT FROM KNOWLEDGE BASE\n")
|
||||
formatted = format_for_ai(results)
|
||||
print(formatted)
|
||||
print("="*80 + "\n")
|
||||
|
||||
except Exception as e:
|
||||
# Fail silently - RAG errors shouldn't break conversations
|
||||
pass
|
||||
|
||||
|
||||
def get_context(query: str, max_results: int = 5) -> str:
|
||||
"""
|
||||
Get context as a string (for automated use).
|
||||
|
||||
Returns formatted context string, or empty string if no results.
|
||||
|
||||
Args:
|
||||
query: Search query
|
||||
max_results: Maximum results
|
||||
|
||||
Returns:
|
||||
Formatted context string, or "" if no results
|
||||
|
||||
Example:
|
||||
>>> context = get_context("Reddit automation")
|
||||
>>> if context:
|
||||
>>> print("Found relevant past work:")
|
||||
>>> print(context)
|
||||
"""
|
||||
try:
|
||||
results = search_knowledge(query, n_results=max_results)
|
||||
|
||||
if results and results.get('count', 0) > 0:
|
||||
return format_for_ai(results)
|
||||
|
||||
except Exception as e:
|
||||
pass
|
||||
|
||||
return ""
|
||||
|
||||
|
||||
# Quick test when run directly
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) > 1:
|
||||
query = ' '.join(sys.argv[1:])
|
||||
check_context(query)
|
||||
else:
|
||||
print("Usage: python3 rag_context.py <query>")
|
||||
print("Or import: from rag_context import check_context")
|
||||
218
rag_manage.py
218
rag_manage.py
@@ -1,218 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG Manager - Manage the OpenClaw knowledge base (add/remove/stats)
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def show_stats(collection_name: str = "openclaw_knowledge"):
|
||||
"""Show collection statistics"""
|
||||
print("📊 OpenClaw RAG Statistics\n")
|
||||
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
stats = rag.get_stats()
|
||||
|
||||
print(f"Collection: {stats['collection_name']}")
|
||||
print(f"Storage: {stats['persist_directory']}")
|
||||
print(f"Total Documents: {stats['total_documents']}\n")
|
||||
|
||||
if stats['source_distribution']:
|
||||
print("By Source:")
|
||||
for source, count in sorted(stats['source_distribution'].items())[:15]:
|
||||
print(f" {source}: {count}")
|
||||
print()
|
||||
|
||||
if stats['type_distribution']:
|
||||
print("By Type:")
|
||||
for doc_type, count in sorted(stats['type_distribution'].items()):
|
||||
print(f" {doc_type}: {count}")
|
||||
|
||||
|
||||
def add_manual_document(
|
||||
text: str,
|
||||
source: str,
|
||||
doc_type: str = "manual",
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
):
|
||||
"""Manually add a document to the knowledge base"""
|
||||
from datetime import datetime
|
||||
|
||||
metadata = {
|
||||
"type": doc_type,
|
||||
"source": source,
|
||||
"added_at": datetime.now().isoformat()
|
||||
}
|
||||
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
doc_id = rag.add_document(text, metadata)
|
||||
|
||||
print(f"✅ Document added: {doc_id}")
|
||||
print(f" Source: {source}")
|
||||
print(f" Type: {doc_type}")
|
||||
print(f" Length: {len(text)} chars")
|
||||
|
||||
|
||||
def delete_by_source(
|
||||
source: str,
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
):
|
||||
"""Delete all documents from a specific source"""
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
# Count matching docs first
|
||||
results = rag.collection.get(where={"source": source})
|
||||
count = len(results['ids'])
|
||||
|
||||
if count == 0:
|
||||
print(f"⚠️ No documents found with source: {source}")
|
||||
return
|
||||
|
||||
# Confirm
|
||||
print(f"Found {count} documents from source: {source}")
|
||||
confirm = input("Delete them? (yes/no): ").strip().lower()
|
||||
|
||||
if confirm not in ['yes', 'y']:
|
||||
print("Cancelled")
|
||||
return
|
||||
|
||||
# Delete
|
||||
deleted = rag.delete_by_filter({"source": source})
|
||||
print(f"✅ Deleted {deleted} documents")
|
||||
|
||||
|
||||
def delete_by_type(
|
||||
doc_type: str,
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
):
|
||||
"""Delete all documents of a specific type"""
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
# Count matching docs first
|
||||
results = rag.collection.get(where={"type": doc_type})
|
||||
count = len(results['ids'])
|
||||
|
||||
if count == 0:
|
||||
print(f"⚠️ No documents found with type: {doc_type}")
|
||||
return
|
||||
|
||||
# Confirm
|
||||
print(f"Found {count} documents of type: {doc_type}")
|
||||
confirm = input("Delete them? (yes/no): ").strip().lower()
|
||||
|
||||
if confirm not in ['yes', 'y']:
|
||||
print("Cancelled")
|
||||
return
|
||||
|
||||
# Delete
|
||||
deleted = rag.delete_by_filter({"type": doc_type})
|
||||
print(f"✅ Deleted {deleted} documents")
|
||||
|
||||
|
||||
def reset_collection(collection_name: str = "openclaw_knowledge"):
|
||||
"""Delete all documents and reset the collection"""
|
||||
print("⚠️ WARNING: This will delete ALL documents from the collection!")
|
||||
|
||||
# Double confirm
|
||||
confirm1 = input("Type 'yes' to confirm: ").strip().lower()
|
||||
if confirm1 != 'yes':
|
||||
print("Cancelled")
|
||||
return
|
||||
|
||||
confirm2 = input("Are you REALLY sure? This cannot be undone (type 'yes'): ").strip().lower()
|
||||
if confirm2 != 'yes':
|
||||
print("Cancelled")
|
||||
return
|
||||
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
rag.reset_collection()
|
||||
|
||||
print("✅ Collection reset - all documents deleted")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Manage OpenClaw RAG knowledge base")
|
||||
parser.add_argument("action", choices=["stats", "add", "delete", "reset"], help="Action to perform")
|
||||
parser.add_argument("--collection", default="openclaw_knowledge", help="Collection name")
|
||||
|
||||
# Add arguments
|
||||
parser.add_argument("--text", help="Document text (for add)")
|
||||
parser.add_argument("--source", help="Document source (for add)")
|
||||
parser.add_argument("--type", "--doc-type", help="Document type (for add)")
|
||||
|
||||
# Delete arguments
|
||||
parser.add_argument("--by-source", help="Delete by source (for delete)")
|
||||
parser.add_argument("--by-type", help="Delete by type (for delete)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.action == "stats":
|
||||
show_stats(collection_name=args.collection)
|
||||
|
||||
elif args.action == "add":
|
||||
if not args.text or not args.source:
|
||||
print("❌ --text and --source required for add action")
|
||||
sys.exit(1)
|
||||
|
||||
add_manual_document(
|
||||
text=args.text,
|
||||
source=args.source,
|
||||
doc_type=args.type or "manual",
|
||||
collection_name=args.collection
|
||||
)
|
||||
|
||||
elif args.action == "delete":
|
||||
if args.by_source:
|
||||
delete_by_source(args.by_source, collection_name=args.collection)
|
||||
elif args.by_type:
|
||||
delete_by_type(args.by_type, collection_name=args.collection)
|
||||
else:
|
||||
print("❌ --by-source or --by-type required for delete action")
|
||||
sys.exit(1)
|
||||
|
||||
elif args.action == "reset":
|
||||
reset_collection(collection_name=args.collection)
|
||||
|
||||
elif args.action == "interactive":
|
||||
print("🚀 OpenClaw RAG Manager - Interactive Mode\n")
|
||||
|
||||
while True:
|
||||
print("\nActions:")
|
||||
print(" 1. Show stats")
|
||||
print(" 2. Add document")
|
||||
print(" 3. Delete by source")
|
||||
print(" 4. Delete by type")
|
||||
print(" 5. Exit")
|
||||
|
||||
choice = input("\nChoose action (1-5): ").strip()
|
||||
|
||||
if choice == '1':
|
||||
show_stats(collection_name=args.collection)
|
||||
elif choice == '2':
|
||||
text = input("Document text: ").strip()
|
||||
source = input("Source: ").strip()
|
||||
doc_type = input("Type (default: manual): ").strip() or "manual"
|
||||
|
||||
if text and source:
|
||||
add_manual_document(text, source, doc_type, collection_name=args.collection)
|
||||
elif choice == '3':
|
||||
source = input("Source to delete: ").strip()
|
||||
if source:
|
||||
delete_by_source(source, collection_name=args.collection)
|
||||
elif choice == '4':
|
||||
doc_type = input("Type to delete: ").strip()
|
||||
if doc_type:
|
||||
delete_by_type(doc_type, collection_name=args.collection)
|
||||
elif choice == '5':
|
||||
print("👋 Goodbye!")
|
||||
break
|
||||
else:
|
||||
print("❌ Invalid choice")
|
||||
182
rag_query.py
182
rag_query.py
@@ -1,182 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG Query - Search the OpenClaw knowledge base
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def format_result(result: dict, index: int) -> str:
|
||||
"""Format a single search result"""
|
||||
metadata = result['metadata']
|
||||
|
||||
# Determine type
|
||||
doc_type = metadata.get('type', 'unknown')
|
||||
source = metadata.get('source', '?')
|
||||
|
||||
# Header based on type
|
||||
if doc_type == 'session':
|
||||
chunk_idx = metadata.get('chunk_index', '?')
|
||||
header = f"\n📄 Session {source} (chunk {chunk_idx})"
|
||||
elif doc_type == 'workspace':
|
||||
header = f"\n📁 {source}"
|
||||
elif doc_type == 'skill':
|
||||
skill_name = metadata.get('skill_name', source)
|
||||
header = f"\n📜 Skill: {skill_name}"
|
||||
elif doc_type == 'memory':
|
||||
header = f"\n🧠 Memory: {source}"
|
||||
else:
|
||||
header = f"\n🔹 {doc_type}: {source}"
|
||||
|
||||
# Format text (limit length)
|
||||
text = result['text']
|
||||
if len(text) > 1000:
|
||||
text = text[:1000] + "..."
|
||||
|
||||
# Get date if available
|
||||
info = []
|
||||
if 'ingested_at' in metadata:
|
||||
info.append(f"indexed {metadata['ingested_at'][:10]}")
|
||||
|
||||
# Chunk info
|
||||
if 'chunk_index' in metadata and 'total_chunks' in metadata:
|
||||
info.append(f"chunk {metadata['chunk_index']+1}/{metadata['total_chunks']}")
|
||||
|
||||
info_str = f" ({', '.join(info)})" if info else ""
|
||||
|
||||
return f"{header}{info_str}\n{text}"
|
||||
|
||||
|
||||
def search(
|
||||
query: str,
|
||||
n_results: int = 10,
|
||||
filters: dict = None,
|
||||
collection_name: str = "openclaw_knowledge",
|
||||
verbose: bool = True
|
||||
) -> list:
|
||||
"""
|
||||
Search the RAG knowledge base
|
||||
|
||||
Args:
|
||||
query: Search query
|
||||
n_results: Number of results
|
||||
filters: Metadata filters (e.g., {"type": "skill"})
|
||||
collection_name: Collection name
|
||||
verbose: Print results
|
||||
|
||||
Returns:
|
||||
List of result dicts
|
||||
"""
|
||||
if verbose:
|
||||
print(f"🔍 Query: {query}")
|
||||
if filters:
|
||||
print(f"🎯 Filters: {filters}")
|
||||
print()
|
||||
|
||||
# Initialize RAG
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
# Search
|
||||
results = rag.search(query, n_results=n_results, filters=filters)
|
||||
|
||||
if not results:
|
||||
if verbose:
|
||||
print("❌ No results found")
|
||||
return []
|
||||
|
||||
if verbose:
|
||||
print(f"✅ Found {len(results)} results\n")
|
||||
print("=" * 80)
|
||||
|
||||
for i, result in enumerate(results, 1):
|
||||
print(format_result(result, i))
|
||||
print("=" * 80)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def interactive_search(collection_name: str = "openclaw_knowledge"):
|
||||
"""Interactive search mode"""
|
||||
print("🚀 OpenClaw RAG Search - Interactive Mode")
|
||||
print("Type 'quit' or 'exit' to stop\n")
|
||||
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
|
||||
# Show stats
|
||||
stats = rag.get_stats()
|
||||
print(f"📊 Collection: {stats['collection_name']}")
|
||||
print(f" Total documents: {stats['total_documents']}")
|
||||
print(f" Storage: {stats['persist_directory']}\n")
|
||||
|
||||
while True:
|
||||
try:
|
||||
query = input("\n🔍 Search query: ").strip()
|
||||
|
||||
if not query:
|
||||
continue
|
||||
|
||||
if query.lower() in ['quit', 'exit', 'q']:
|
||||
print("\n👋 Goodbye!")
|
||||
break
|
||||
|
||||
# Parse filters if any
|
||||
filters = None
|
||||
if query.startswith("type:"):
|
||||
parts = query.split(maxsplit=1)
|
||||
if len(parts) > 1:
|
||||
doc_type = parts[0].replace("type:", "")
|
||||
query = parts[1]
|
||||
filters = {"type": doc_type}
|
||||
|
||||
# Search
|
||||
results = rag.search(query, n_results=10, filters=filters)
|
||||
|
||||
if results:
|
||||
print(f"\n✅ {len(results)} results:")
|
||||
print("=" * 80)
|
||||
|
||||
for i, result in enumerate(results, 1):
|
||||
print(format_result(result, i))
|
||||
print("=" * 80)
|
||||
else:
|
||||
print("❌ No results found")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n👋 Goodbye!")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {e}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="Search OpenClaw RAG knowledge base")
|
||||
parser.add_argument("query", nargs="?", help="Search query (if not provided, enters interactive mode)")
|
||||
parser.add_argument("-n", "--num-results", type=int, default=10, help="Number of results")
|
||||
parser.add_argument("--type", help="Filter by document type (session, workspace, skill, memory)")
|
||||
parser.add_argument("--collection", default="openclaw_knowledge", help="Collection name")
|
||||
parser.add_argument("--interactive", "-i", action="store_true", help="Interactive mode")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Build filters
|
||||
filters = None
|
||||
if args.type:
|
||||
filters = {"type": args.type}
|
||||
|
||||
if args.interactive or not args.query:
|
||||
interactive_search(collection_name=args.collection)
|
||||
else:
|
||||
search(
|
||||
query=args.query,
|
||||
n_results=args.num_results,
|
||||
filters=filters,
|
||||
collection_name=args.collection
|
||||
)
|
||||
@@ -1,89 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Quick RAG Query - Simple function to call from Python scripts or sessions
|
||||
|
||||
Usage:
|
||||
from rag.rag_query_quick import search_context
|
||||
results = search_context("your question here")
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def search_context(
|
||||
query: str,
|
||||
n_results: int = 5,
|
||||
collection_name: str = "openclaw_knowledge"
|
||||
) -> str:
|
||||
"""
|
||||
Search the RAG knowledge base and return formatted results.
|
||||
|
||||
This is the simplest way to use RAG from within Python code.
|
||||
|
||||
Args:
|
||||
query: Search question
|
||||
n_results: Number of results to return
|
||||
collection_name: ChromaDB collection name
|
||||
|
||||
Returns:
|
||||
Formatted string with relevant context
|
||||
|
||||
Example:
|
||||
>>> from rag.rag_query_quick import search_context
|
||||
>>> context = search_context("How do I send SMS?")
|
||||
>>> print(context)
|
||||
"""
|
||||
try:
|
||||
rag = RAGSystem(collection_name=collection_name)
|
||||
results = rag.search(query, n_results=n_results)
|
||||
|
||||
if not results:
|
||||
return "No relevant context found in knowledge base."
|
||||
|
||||
output = []
|
||||
output.append(f"🔍 Found {len(results)} relevant items:\n")
|
||||
|
||||
for i, result in enumerate(results, 1):
|
||||
meta = result.get('metadata', {})
|
||||
doc_type = meta.get('type', 'unknown')
|
||||
source = meta.get('source', 'unknown')
|
||||
|
||||
# Format header
|
||||
if doc_type == 'session':
|
||||
header = f"📄 Session reference {i}"
|
||||
elif doc_type == 'workspace':
|
||||
header = f"📁 Code/Docs: {source}"
|
||||
elif doc_type == 'skill':
|
||||
header = f"📜 Skill: {source}"
|
||||
else:
|
||||
header = f"Reference {i}"
|
||||
|
||||
# Format content
|
||||
text = result.get('text', '')
|
||||
if len(text) > 600:
|
||||
text = text[:600] + "..."
|
||||
|
||||
output.append(f"\n{header}\n{text}\n")
|
||||
|
||||
return '\n'.join(output)
|
||||
|
||||
except Exception as e:
|
||||
return f"❌ RAG error: {e}"
|
||||
|
||||
|
||||
# Test it when run directly
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python3 rag_query_quick.py <query>")
|
||||
sys.exit(1)
|
||||
|
||||
query = ' '.join(sys.argv[1:])
|
||||
print(search_context(query))
|
||||
@@ -1,117 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
RAG Query Wrapper - Simple function for the AI to call from within sessions
|
||||
|
||||
This is designed for automatic RAG integration. The AI can call this function
|
||||
to retrieve relevant context from past conversations, code, and documentation.
|
||||
|
||||
Usage (from within Python script or session):
|
||||
from rag_query_wrapper import search_knowledge
|
||||
results = search_knowledge("your question")
|
||||
print(results)
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add RAG directory to path
|
||||
rag_dir = Path(__file__).parent
|
||||
sys.path.insert(0, str(rag_dir))
|
||||
|
||||
from rag_system import RAGSystem
|
||||
|
||||
|
||||
def search_knowledge(query: str, n_results: int = 5) -> dict:
|
||||
"""
|
||||
Search the knowledge base and return structured results.
|
||||
|
||||
This is the primary function for automatic RAG integration.
|
||||
Returns a structured dict with results for easy programmatic use.
|
||||
|
||||
Args:
|
||||
query: Search query
|
||||
n_results: Number of results to return
|
||||
|
||||
Returns:
|
||||
dict with:
|
||||
- query: the search query
|
||||
- count: number of results found
|
||||
- items: list of result dicts with text and metadata
|
||||
"""
|
||||
try:
|
||||
rag = RAGSystem()
|
||||
results = rag.search(query, n_results=n_results)
|
||||
|
||||
items = []
|
||||
for result in results:
|
||||
meta = result.get('metadata', {})
|
||||
items.append({
|
||||
'text': result.get('text', ''),
|
||||
'type': meta.get('type', 'unknown'),
|
||||
'source': meta.get('source', 'unknown'),
|
||||
'chunk_index': meta.get('chunk_index', 0),
|
||||
'date': meta.get('date', '')
|
||||
})
|
||||
|
||||
return {
|
||||
'query': query,
|
||||
'count': len(items),
|
||||
'items': items
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
'query': query,
|
||||
'count': 0,
|
||||
'items': [],
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
def format_for_ai(results: dict) -> str:
|
||||
"""
|
||||
Format RAG results for AI consumption.
|
||||
|
||||
Args:
|
||||
results: dict from search_knowledge()
|
||||
|
||||
Returns:
|
||||
Formatted string suitable for insertion into AI context
|
||||
"""
|
||||
if results['count'] == 0:
|
||||
return ""
|
||||
|
||||
output = [f"📚 Found {results['count']} relevant items from knowledge base:\n"]
|
||||
|
||||
for item in results['items']:
|
||||
doc_type = item['type']
|
||||
source = item['source']
|
||||
text = item['text']
|
||||
|
||||
if doc_type == 'session':
|
||||
header = f"📄 Past Conversation ({source})"
|
||||
elif doc_type == 'workspace':
|
||||
header = f"📁 Code/Documentation ({source})"
|
||||
elif doc_type == 'skill':
|
||||
header = f"📜 Skill Guide ({source})"
|
||||
else:
|
||||
header = f"🔹 Reference ({doc_type})"
|
||||
|
||||
# Truncate if too long
|
||||
if len(text) > 700:
|
||||
text = text[:700] + "..."
|
||||
|
||||
output.append(f"\n{header}\n{text}\n")
|
||||
|
||||
return '\n'.join(output)
|
||||
|
||||
|
||||
# Test function
|
||||
def _test():
|
||||
"""Quick test of RAG integration"""
|
||||
results = search_knowledge("Reddit account automation", n_results=3)
|
||||
print(format_for_ai(results))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
_test()
|
||||
289
rag_system.py
289
rag_system.py
@@ -1,289 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
OpenClaw RAG System - Core Library
|
||||
Manages vector store, ingestion, and retrieval with ChromaDB
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Optional
|
||||
from datetime import datetime
|
||||
|
||||
try:
|
||||
import chromadb
|
||||
from chromadb.config import Settings
|
||||
CHROMADB_AVAILABLE = True
|
||||
except ImportError:
|
||||
CHROMADB_AVAILABLE = False
|
||||
|
||||
|
||||
class RAGSystem:
|
||||
"""OpenClaw RAG System for knowledge retrieval"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
persist_directory: str = None,
|
||||
collection_name: str = "openclaw_knowledge",
|
||||
embedding_model: str = "all-MiniLM-L6-v2"
|
||||
):
|
||||
"""
|
||||
Initialize RAG system
|
||||
|
||||
Args:
|
||||
persist_directory: Where ChromaDB stores data
|
||||
collection_name: Name of the collection
|
||||
embedding_model: Embedding model name ( ChromaDB handles this)
|
||||
"""
|
||||
if not CHROMADB_AVAILABLE:
|
||||
raise ImportError("chromadb not installed. Run: pip3 install chromadb")
|
||||
|
||||
self.collection_name = collection_name
|
||||
|
||||
# Default to ~/.openclaw/data/rag if not specified
|
||||
if persist_directory is None:
|
||||
persist_directory = os.path.expanduser("~/.openclaw/data/rag")
|
||||
|
||||
self.persist_directory = Path(persist_directory)
|
||||
self.persist_directory.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Initialize ChromaDB client
|
||||
self.client = chromadb.PersistentClient(
|
||||
path=str(self.persist_directory),
|
||||
settings=Settings(
|
||||
anonymized_telemetry=False,
|
||||
allow_reset=True
|
||||
)
|
||||
)
|
||||
|
||||
# Get or create collection
|
||||
self.collection = self.client.get_or_create_collection(
|
||||
name=collection_name,
|
||||
metadata={
|
||||
"created": datetime.now().isoformat(),
|
||||
"description": "OpenClaw knowledge base"
|
||||
}
|
||||
)
|
||||
|
||||
def add_document(
|
||||
self,
|
||||
text: str,
|
||||
metadata: Dict,
|
||||
doc_id: Optional[str] = None
|
||||
) -> str:
|
||||
"""
|
||||
Add a document to the vector store
|
||||
|
||||
Args:
|
||||
text: Document content
|
||||
metadata: Document metadata (type, source, date, etc.)
|
||||
doc_id: Optional document ID (auto-generated if not provided)
|
||||
|
||||
Returns:
|
||||
Document ID
|
||||
"""
|
||||
# Generate ID if not provided (include more context for uniqueness)
|
||||
if doc_id is None:
|
||||
unique_str = ":".join([
|
||||
metadata.get('type', 'unknown'),
|
||||
metadata.get('source', 'unknown'),
|
||||
metadata.get('date', datetime.now().isoformat()),
|
||||
str(metadata.get('chunk_index', '0')), # Convert to string!
|
||||
text[:200]
|
||||
])
|
||||
doc_id = hashlib.md5(unique_str.encode()).hexdigest()
|
||||
|
||||
# Add to collection
|
||||
self.collection.add(
|
||||
documents=[text],
|
||||
metadatas=[metadata],
|
||||
ids=[doc_id]
|
||||
)
|
||||
|
||||
return doc_id
|
||||
|
||||
def add_documents_batch(
|
||||
self,
|
||||
documents: List[Dict],
|
||||
batch_size: int = 100
|
||||
) -> List[str]:
|
||||
"""
|
||||
Add multiple documents efficiently
|
||||
|
||||
Args:
|
||||
documents: List of {"text": str, "metadata": dict, "id": optional} dicts
|
||||
batch_size: Number of documents to add per batch
|
||||
|
||||
Returns:
|
||||
List of document IDs
|
||||
"""
|
||||
all_ids = []
|
||||
|
||||
for i in range(0, len(documents), batch_size):
|
||||
batch = documents[i:i + batch_size]
|
||||
|
||||
texts = [doc["text"] for doc in batch]
|
||||
metadatas = [doc["metadata"] for doc in batch]
|
||||
ids = [doc.get("id", hashlib.md5(
|
||||
f"{doc['metadata'].get('type', 'unknown')}:{doc['metadata'].get('source', 'unknown')}:{doc['metadata'].get('date', '')}:{str(doc['metadata'].get('chunk_index', '0'))}:{doc['text'][:100]}".encode()
|
||||
).hexdigest()) for doc in batch]
|
||||
|
||||
self.collection.add(
|
||||
documents=texts,
|
||||
metadatas=metadatas,
|
||||
ids=ids
|
||||
)
|
||||
|
||||
all_ids.extend(ids)
|
||||
print(f"✅ Added batch {i//batch_size + 1}: {len(ids)} documents")
|
||||
|
||||
return all_ids
|
||||
|
||||
def search(
|
||||
self,
|
||||
query: str,
|
||||
n_results: int = 10,
|
||||
filters: Optional[Dict] = None
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Search for relevant documents
|
||||
|
||||
Args:
|
||||
query: Search query
|
||||
n_results: Number of results to return
|
||||
filters: Optional metadata filters
|
||||
|
||||
Returns:
|
||||
List of {"text": str, "metadata": dict, "id": str, "score": float} dicts
|
||||
"""
|
||||
results = self.collection.query(
|
||||
query_texts=[query],
|
||||
n_results=n_results,
|
||||
where=filters
|
||||
)
|
||||
|
||||
# Format results
|
||||
formatted = []
|
||||
for i, doc_id in enumerate(results['ids'][0]):
|
||||
formatted.append({
|
||||
"id": doc_id,
|
||||
"text": results['documents'][0][i],
|
||||
"metadata": results['metadatas'][0][i],
|
||||
# Note: ChromaDB doesn't return scores by default in query()
|
||||
# We'd need to use a different method or approximate
|
||||
"score": 1.0 - (i / len(results['ids'][0])) # Simple approximation
|
||||
})
|
||||
|
||||
return formatted
|
||||
|
||||
def delete_document(self, doc_id: str) -> bool:
|
||||
"""Delete a document by ID"""
|
||||
try:
|
||||
self.collection.delete(ids=[doc_id])
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error deleting document {doc_id}: {e}")
|
||||
return False
|
||||
|
||||
def delete_by_filter(self, filter_dict: Dict) -> int:
|
||||
"""
|
||||
Delete documents by metadata filter
|
||||
|
||||
Args:
|
||||
filter_dict: Filter criteria (e.g., {"source": "session-2026-02-10"})
|
||||
|
||||
Returns:
|
||||
Number of documents deleted
|
||||
"""
|
||||
# First, find matching IDs
|
||||
results = self.collection.get(where=filter_dict)
|
||||
|
||||
if not results['ids']:
|
||||
return 0
|
||||
|
||||
count = len(results['ids'])
|
||||
self.collection.delete(ids=results['ids'])
|
||||
|
||||
print(f"✅ Deleted {count} documents matching filter")
|
||||
return count
|
||||
|
||||
def get_stats(self) -> Dict:
|
||||
"""Get statistics about the collection"""
|
||||
count = self.collection.count()
|
||||
|
||||
# Get sample to understand metadata structure
|
||||
sample = self.collection.get(limit=10)
|
||||
|
||||
# Count by source/type
|
||||
source_counts = {}
|
||||
type_counts = {}
|
||||
|
||||
for metadata in sample['metadatas']:
|
||||
source = metadata.get('source', 'unknown')
|
||||
doc_type = metadata.get('type', 'unknown')
|
||||
|
||||
source_counts[source] = source_counts.get(source, 0) + 1
|
||||
type_counts[doc_type] = type_counts.get(doc_type, 0) + 1
|
||||
|
||||
return {
|
||||
"total_documents": count,
|
||||
"collection_name": self.collection_name,
|
||||
"persist_directory": str(self.persist_directory),
|
||||
"source_distribution": source_counts,
|
||||
"type_distribution": type_counts
|
||||
}
|
||||
|
||||
def reset_collection(self):
|
||||
"""Delete all documents and reset the collection"""
|
||||
self.collection.delete(where={})
|
||||
print("✅ Collection reset - all documents deleted")
|
||||
|
||||
def close(self):
|
||||
"""Close the connection"""
|
||||
# ChromaDB PersistentClient doesn't need explicit close
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
"""Test the RAG system"""
|
||||
print("🚀 Testing OpenClaw RAG System...\n")
|
||||
|
||||
# Initialize
|
||||
rag = RAGSystem()
|
||||
print(f"✅ Initialized RAG system")
|
||||
print(f" Collection: {rag.collection_name}")
|
||||
print(f" Storage: {rag.persist_directory}\n")
|
||||
|
||||
# Add test document
|
||||
test_doc = {
|
||||
"text": "OpenClaw is a personal AI assistant with tools for automation, messaging, and infrastructure management. It supports Discord, Telegram, SMS via VoIP.ms, and more.",
|
||||
"metadata": {
|
||||
"type": "test",
|
||||
"source": "test-initialization",
|
||||
"date": datetime.now().isoformat()
|
||||
}
|
||||
}
|
||||
|
||||
doc_id = rag.add_document(test_doc["text"], test_doc["metadata"])
|
||||
print(f"✅ Added test document: {doc_id}\n")
|
||||
|
||||
# Search
|
||||
results = rag.search(
|
||||
query="What messaging platforms does OpenClaw support?",
|
||||
n_results=5
|
||||
)
|
||||
|
||||
print("🔍 Search Results:")
|
||||
for i, result in enumerate(results, 1):
|
||||
print(f"\n{i}. [{result['metadata'].get('source', '?')}]")
|
||||
print(f" {result['text'][:200]}...")
|
||||
|
||||
# Stats
|
||||
stats = rag.get_stats()
|
||||
print(f"\n📊 Stats:")
|
||||
print(f" Total documents: {stats['total_documents']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,134 +0,0 @@
|
||||
#!/bin/bash
|
||||
# RAG Automatic Ingestion - Daily update of knowledge base
|
||||
# This script checks for new content and intelligently updates the RAG system
|
||||
|
||||
set -e
|
||||
|
||||
# Use dynamic paths for portability
|
||||
HOME="${HOME:-$(cd ~ && pwd)}"
|
||||
OPENCLAW_DIR="${OPENCLAW_DIR:-$HOME/.openclaw}"
|
||||
WORKSPACE_DIR="${OPENCLAW_DIR}/workspace"
|
||||
|
||||
# Paths
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
RAG_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
STATE_FILE="$WORKSPACE_DIR/memory/rag-auto-state.json"
|
||||
LOG_FILE="$WORKSPACE_DIR/memory/rag-auto-update.log"
|
||||
|
||||
# Timestamp
|
||||
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
||||
DATE=$(date +%Y-%m-%d)
|
||||
|
||||
# Create memory directory
|
||||
mkdir -p "$(dirname "$STATE_FILE")"
|
||||
|
||||
# Initialize state if needed
|
||||
if [ ! -f "$STATE_FILE" ]; then
|
||||
cat > "$STATE_FILE" << EOF
|
||||
{
|
||||
"lastSessionIndex": 0,
|
||||
"lastWorkspaceIndex": 0,
|
||||
"lastSkillsIndex": 0,
|
||||
"updatedAt": "$TIMESTAMP"
|
||||
}
|
||||
EOF
|
||||
fi
|
||||
|
||||
# log function
|
||||
log() {
|
||||
echo "[$TIMESTAMP] $1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
# Get latest session file modification time
|
||||
latest_session_time() {
|
||||
find "$OPENCLAW_DIR/agents/main/sessions" -name "*.jsonl" -type f -printf '%T@\n' 2>/dev/null | sort -rn | head -1 | cut -d. -f1 || echo "0"
|
||||
}
|
||||
|
||||
log "=== RAG Auto-Update Started ==="
|
||||
|
||||
# Get current stats
|
||||
SESSION_COUNT=$(find "$OPENCLAW_DIR/agents/main/sessions" -name "*.jsonl" 2>/dev/null | wc -l)
|
||||
WORKSPACE_COUNT=$(find "$WORKSPACE_DIR" -type f \( -name "*.py" -o -name "*.js" -o -name "*.md" -o -name "*.json" \) 2>/dev/null | wc -l)
|
||||
LATEST_SESSION=$(latest_session_time)
|
||||
|
||||
# Read last indexed timestamp
|
||||
LAST_SESSION_INDEX=$(python3 -c "import json; print(json.load(open('$STATE_FILE')).get('lastSessionIndex', 0))" 2>/dev/null || echo "0")
|
||||
|
||||
log "Current status:"
|
||||
log " Sessions: $SESSION_COUNT files"
|
||||
log " Workspace: $WORKSPACE_COUNT searchable files"
|
||||
log " Latest session: $LATEST_SESSION"
|
||||
log " Last indexed: $LAST_SESSION_INDEX"
|
||||
|
||||
# Update sessions if new ones exist
|
||||
if [ "$LATEST_SESSION" -gt "$LAST_SESSION_INDEX" ]; then
|
||||
log "✓ New/updated sessions detected, re-indexing..."
|
||||
|
||||
cd "$RAG_DIR"
|
||||
python3 ingest_sessions.py --sessions-dir "$OPENCLAW_DIR/agents/main/sessions" >> "$LOG_FILE" 2>&1
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
log "✅ Sessions re-indexed successfully"
|
||||
else
|
||||
log "❌ Session indexing failed (see log)"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
log "✓ Sessions up to date (no new files)"
|
||||
fi
|
||||
|
||||
# Update workspace (always do it - captures code changes)
|
||||
log "Re-indexing workspace files..."
|
||||
cd "$RAG_DIR"
|
||||
python3 ingest_docs.py workspace >> "$LOG_FILE" 2>&1
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
log "✅ Workspace re-indexed successfully"
|
||||
else
|
||||
log "❌ Workspace indexing failed (see log)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Update skills
|
||||
log "Re-indexing skills..."
|
||||
cd "$RAG_DIR"
|
||||
python3 ingest_docs.py skills >> "$LOG_FILE" 2>&1
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
log "✅ Skills re-indexed successfully"
|
||||
else
|
||||
log "❌ Skills indexing failed (see log)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Get document count
|
||||
DOC_COUNT=$(cd "$RAG_DIR" && python3 -c "
|
||||
import sys
|
||||
sys.path.insert(0, '.')
|
||||
from rag_system import RAGSystem
|
||||
rag = RAGSystem()
|
||||
print(rag.collection.count())
|
||||
" 2>/dev/null || echo "unknown")
|
||||
|
||||
# Update state
|
||||
python3 << EOF
|
||||
import json
|
||||
|
||||
state = {
|
||||
"lastSessionIndex": $LATEST_SESSION,
|
||||
"lastWorkspaceIndex": $(date +%s),
|
||||
"lastSkillsIndex": $(date +%s),
|
||||
"updatedAt": "$TIMESTAMP",
|
||||
"totalDocuments": $DOC_COUNT,
|
||||
"sessionCount": $SESSION_COUNT
|
||||
}
|
||||
|
||||
with open('$STATE_FILE', 'w') as f:
|
||||
json.dump(state, f, indent=2)
|
||||
EOF
|
||||
|
||||
log "=== RAG Auto-Update Complete ==="
|
||||
log "Total documents in knowledge base: $DOC_COUNT"
|
||||
log "Next run: $(date -u -d '+24 hours' +%Y-%m-%dT%H:%M:%SZ)"
|
||||
|
||||
exit 0
|
||||
Reference in New Issue
Block a user