Why direct corpus interaction?
Index-based retrieval (dense or sparse) suffers from semantic smoothing
(blurring fine-grained entity/lexical distinctions), limited controllability
(the agent can't enforce exact filters or iteratively refine results), and
redundant re-retrieval in multi-hop settings. By executing exact-string shell
pipelines (e.g. rg -F), GrepSeek preserves lexical precision, isolates rare
symbolic patterns and exact entity names, and composes multi-stage retrieval
programs for compositional reasoning — while needing no embedding index (only
the ~14 GB raw corpus; no offline indexing).
Training
- Initialized from:
alireza7/GrepSeek-Qwen3.5-9B-SFT (cold-start SFT on alireza7/GrepSeek-ColdStart-SFT-10k; base Qwen/Qwen3.5-9B).
- RL: GRPO, group size n=5, reward = token-F1 × binary format gate (only structurally valid
<think>/<tool_call>/<tool_response>/<answer> trajectories get non-zero reward), 200 steps, LR 5e-6, batch 256, KL disabled, Ulysses SP=2, on 4×A100-80GB. Trained only on NQ + HotpotQA.
The model emits <tool_call> shell commands that must be executed against the
corpus and returned as <tool_response> turns. You need the corpus
(PeterJinGo/wiki-18-corpus),
a tool-calling vLLM server, and the GrepSeek inference harness — all in the
code repo.
Usage
git clone https://github.com/alirezasalemi7/grepseek && cd grepseek
# env: TRAINING_ENV.md · corpus: cold_start_sft/download_corpus.py
# 1. serve this checkpoint
MODEL_PATH=alireza7/GrepSeek-Qwen3.5-9B-GRPO bash rl/serve_rl.sh # -> http://localhost:10730/v1
# 2a. generation on your own questions
GREPSEEK_CORPUS_ROOT=/path/to/wiki_18_corpus \
bash inference/run_inference.sh --base_url http://localhost:10730/v1 \
--model grepseek --temperature 0.6 --input my_questions.jsonl --out_dir out
# 2b. reproduce the benchmark eval (token-F1 / EM on the Search-R1 suite)
GREPSEEK_CORPUS_ROOT=/path/to/wiki_18_corpus \
bash inference/run_inference.sh --base_url http://localhost:10730/v1 \
--model grepseek --temperature 0.6 --datasets all --out_dir eval
The inference harness also ships the semantics-preserving sharded-parallel
execution engine (+ persistent search daemon) that accelerates corpus search by
up to 7.6× while remaining byte-exact with sequential grep.
Results (token-level F1)
Trained only on NQ + HotpotQA (marked *); the other five are out-of-distribution.
GrepSeek gets the best micro-average and wins 4/7 benchmarks.
Table with columns: NQ*, TriviaQA, PopQA, HotpotQA*, 2Wiki, MuSiQue, Bamboogle, micro-avg | NQ* | TriviaQA | PopQA | HotpotQA* | 2Wiki | MuSiQue | Bamboogle | micro-avg |
|---|
| Search-R1 (Qwen3-Emb-4B, best baseline) | 0.5067 | 0.7693 | 0.5101 | 0.5591 | 0.4299 | 0.2878 | 0.6989 | 0.5441 |
| GrepSeek (this model) |
Micro-average EM = 0.4948 (also best overall; full EM table in the paper). Gains
are largest on multi-hop tasks (HotpotQA, 2Wiki, MuSiQue) that reward exact
entity disambiguation and iterative evidence aggregation.
Limitations
Because retrieval is purely lexical, GrepSeek is weaker on surface-form
variation / long-tail queries — e.g. PopQA (diacritics, name variants) — and
grep has no semantic relevance ranking, so an authoritative passage can be
buried behind earlier file-order matches. Dense retrieval remains advantageous on
heavily semantic or paraphrase-driven queries.
License
Inherits the license of the base model Qwen/Qwen3.5-9B — confirm and update the
license field above if needed.
Citation
@misc{salemi2026grepseektrainingsearchagents,
title={GrepSeek: Training Search Agents for Direct Corpus Interaction},
author={Alireza Salemi and Chang Zeng and Atharva Nijasure and Jui-Hui Chung and Razieh Rahimi and Fernando Diaz and Hamed Zamani},
year={2026},
eprint={2605.29307},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.29307},
}