✓ Verified 💻 Development ✓ Enhanced Data

Arxivkb

Local arXiv paper manager with semantic search.

Rating
4.5 (275 reviews)
Downloads
43,741 downloads
Version
1.0.0

Overview

Local arXiv paper manager with semantic search.

Complete Documentation

View Source →

ArXivKB — Science Knowledge Base

Why This Skill?

🏠 100% local — crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.

🔍 Semantic search on paper content — FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.

📂 arXiv category-based — tracks official arXiv categories (155 available, 8 groups). No free-text queries.

🧹 Auto-cleanup — configurable expiry deletes old papers, PDFs, and chunks.

Install

bash
python3 scripts/install.py

Works on macOS and Linux. Installs Python deps (faiss-cpu, pdfplumber, tiktoken, arxiv, numpy), pulls nomic-embed-text via Ollama, creates data directories and DB.

Prerequisites

  • Ollama — must be installed and running (ollama serve)
  • Python 3.10+

Quick Start

bash
# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats

Categories

bash
akb categories list                    # Show enabled categories
akb categories browse                  # Browse all 155 arXiv categories
akb categories browse robotics         # Filter by keyword
akb categories add cs.AI cs.RO         # Enable categories
akb categories delete cs.AI            # Disable a category

Categories are official arXiv codes (e.g. cs.AI, eess.IV, q-fin.ST). The full taxonomy is built in.

Ingestion

bash
akb ingest                    # Crawl, download PDFs, chunk, embed
akb ingest --days 14          # Look back 14 days
akb ingest --dry-run          # Preview only
akb ingest --no-pdf           # Index abstracts only (faster)

Pipeline: arXiv API → PDF download → text extraction (pdfplumber) → chunking (tiktoken, 500 tokens, 50 overlap) → embedding (Ollama nomic-embed-text) → FAISS + SQLite.

Paper Details

bash
akb paper 2401.12345    # Show title, abstract, categories, PDF status

Statistics

bash
akb stats   # Papers, chunks, categories, DB size

Expiry & Cleanup

bash
akb expire               # Delete papers older than 90 days (default)
akb expire --days 30     # Override: delete papers older than 30 days
akb expire --days 30 -y  # Skip confirmation

Configuration

No config file needed. Defaults:

SettingDefaultOverride
Data directory~/workspace/arxivkbARXIVKB_DATA_DIR env or --data-dir
Ollama endpointhttp://localhost:11434— (hardcoded)
Embedding modelnomic-embed-text (768d)— (hardcoded)
Chunk size500 tokens, 50 overlap
Expiry90 days--days flag

Data Layout

text
~/workspace/arxivkb/
├── arxivkb.db           # SQLite: papers, chunks, translations, categories
├── pdfs/                  # Downloaded PDF files ({arxiv_id}.pdf)
└── faiss/
    └── arxivkb.faiss    # FAISS IndexFlatIP (chunk embeddings)

DB Schema

  • papers: id, arxiv_id, title, abstract, categories, published, status, created_at
  • chunks: id, paper_id, section, chunk_index, text, faiss_id, created_at
  • translations: paper_id, language, abstract, created_at (PK: paper_id+language)
  • categories: code, description, group_name, enabled, added_at (155 entries)

💬 Chat Commands (OpenClaw Agent)

When this skill is installed, the agent recognizes /akb as a shortcut:

CommandAction
/akb listShow enabled categories
/akb add cs.AI cs.ROEnable categories for crawling
/akb remove cs.AIDisable a category
/akb browseBrowse all 155 arXiv categories
/akb browse roboticsFilter categories by keyword
/akb statsShow paper/chunk/category counts
/akb helpShow available commands
The agent runs these via the akb CLI internally.

📱 PrivateApp Dashboard

A companion PWA dashboard is available. Provides:

  • Semantic search across paper content
  • Paper detail with abstract translation (on-demand via LLM)
  • Inline PDF viewing
  • Category browser
  • Stats (papers, chunks, categories)

Architecture

text
scripts/
├── cli.py             # CLI — categories, ingest, paper, stats, expire
├── db.py              # SQLite schema + CRUD
├── arxiv_crawler.py   # arXiv API search + PDF download
├── arxiv_taxonomy.py  # Full arXiv category taxonomy (155 categories)
├── pdf_processor.py   # PDF text extraction + tiktoken chunking
├── embed.py           # Ollama nomic-embed-text (768d, normalized)
├── faiss_index.py     # FAISS IndexFlatIP manager
├── search.py          # Semantic search: query → FAISS → group by paper
└── install.py         # One-command installer

Installation

Terminal bash

openclaw install arxivkb
    
Copied!

💻Code Examples

python3 scripts/install.py

python3-scriptsinstallpy.txt
Works on **macOS and Linux**. Installs Python deps (`faiss-cpu`, `pdfplumber`, `tiktoken`, `arxiv`, `numpy`), pulls `nomic-embed-text` via Ollama, creates data directories and DB.

### Prerequisites

- **Ollama** — must be installed and running (`ollama serve`)
- **Python 3.10+**

## Quick Start

akb categories delete cs.AI # Disable a category

akb-categories-delete-csai--disable-a-category.txt
Categories are official arXiv codes (e.g. `cs.AI`, `eess.IV`, `q-fin.ST`). The full taxonomy is built in.

## Ingestion

akb ingest --no-pdf # Index abstracts only (faster)

akb-ingest---no-pdf--index-abstracts-only-faster.txt
Pipeline: arXiv API → PDF download → text extraction (pdfplumber) → chunking (tiktoken, 500 tokens, 50 overlap) → embedding (Ollama nomic-embed-text) → FAISS + SQLite.

## Paper Details

akb expire --days 30 -y # Skip confirmation

akb-expire---days-30--y--skip-confirmation.txt
## Configuration

No config file needed. Defaults:

| Setting | Default | Override |
|---------|---------|----------|
| Data directory | `~/workspace/arxivkb` | `ARXIVKB_DATA_DIR` env or `--data-dir` |
| Ollama endpoint | `http://localhost:11434` | — (hardcoded) |
| Embedding model | `nomic-embed-text` (768d) | — (hardcoded) |
| Chunk size | 500 tokens, 50 overlap | — |
| Expiry | 90 days | `--days` flag |

## Data Layout

└── arxivkb.faiss # FAISS IndexFlatIP (chunk embeddings)

--arxivkbfaiss--faiss-indexflatip-chunk-embeddings.txt
## DB Schema

- **papers**: id, arxiv_id, title, abstract, categories, published, status, created_at
- **chunks**: id, paper_id, section, chunk_index, text, faiss_id, created_at
- **translations**: paper_id, language, abstract, created_at (PK: paper_id+language)
- **categories**: code, description, group_name, enabled, added_at (155 entries)

## 💬 Chat Commands (OpenClaw Agent)

When this skill is installed, the agent recognizes `/akb` as a shortcut:

| Command | Action |
|---------|--------|
| `/akb list` | Show enabled categories |
| `/akb add cs.AI cs.RO` | Enable categories for crawling |
| `/akb remove cs.AI` | Disable a category |
| `/akb browse` | Browse all 155 arXiv categories |
| `/akb browse robotics` | Filter categories by keyword |
| `/akb stats` | Show paper/chunk/category counts |
| `/akb help` | Show available commands |

The agent runs these via the `akb` CLI internally.

## 📱 PrivateApp Dashboard

A companion PWA dashboard is available. Provides:
- Semantic search across paper content
- Paper detail with abstract translation (on-demand via LLM)
- Inline PDF viewing
- Category browser
- Stats (papers, chunks, categories)

## Architecture
example.sh
# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats
example.sh
akb categories list                    # Show enabled categories
akb categories browse                  # Browse all 155 arXiv categories
akb categories browse robotics         # Filter by keyword
akb categories add cs.AI cs.RO         # Enable categories
akb categories delete cs.AI            # Disable a category
example.sh
akb ingest                    # Crawl, download PDFs, chunk, embed
akb ingest --days 14          # Look back 14 days
akb ingest --dry-run          # Preview only
akb ingest --no-pdf           # Index abstracts only (faster)
example.sh
akb expire               # Delete papers older than 90 days (default)
akb expire --days 30     # Override: delete papers older than 30 days
akb expire --days 30 -y  # Skip confirmation
example.txt
~/workspace/arxivkb/
├── arxivkb.db           # SQLite: papers, chunks, translations, categories
├── pdfs/                  # Downloaded PDF files ({arxiv_id}.pdf)
└── faiss/
    └── arxivkb.faiss    # FAISS IndexFlatIP (chunk embeddings)

⚙️Configuration Options

Option Type Default Description
Data directorystring~/workspace/arxivkb`ARXIVKB_DATA_DIR` env or `--data-dir`
Ollama endpointstringhttp://localhost:11434— (hardcoded)
Embedding modelstringnomic-embed-text (768d)— (hardcoded)
Chunk sizestring500 tokens, 50 overlap
Expirystring90 days`--days` flag

Tags

#devops_and-cloud

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author camopel
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install arxivkb