alireza7

GrepSeek-Qwen3.5-9B-GRPO

README

License: apache-2.0

Why direct corpus interaction?

Index-based retrieval (dense or sparse) suffers from semantic smoothing (blurring fine-grained entity/lexical distinctions), limited controllability (the agent can't enforce exact filters or iteratively refine results), and redundant re-retrieval in multi-hop settings. By executing exact-string shell pipelines (e.g. rg -F), GrepSeek preserves lexical precision, isolates rare symbolic patterns and exact entity names, and composes multi-stage retrieval programs for compositional reasoning — while needing no embedding index (only the ~14 GB raw corpus; no offline indexing).

Training

Initialized from: alireza7/GrepSeek-Qwen3.5-9B-SFT (cold-start SFT on alireza7/GrepSeek-ColdStart-SFT-10k; base Qwen/Qwen3.5-9B).
RL: GRPO, group size n=5, reward = token-F1 × binary format gate (only structurally valid <think>/<tool_call>/<tool_response>/<answer> trajectories get non-zero reward), 200 steps, LR 5e-6, batch 256, KL disabled, Ulysses SP=2, on 4×A100-80GB. Trained only on NQ + HotpotQA.

⚠️ A tool-using agent, not a standalone chatbot

The model emits <tool_call> shell commands that must be executed against the corpus and returned as <tool_response> turns. You need the corpus (PeterJinGo/wiki-18-corpus), a tool-calling vLLM server, and the GrepSeek inference harness — all in the code repo.

Usage

bash
git clone https://github.com/alirezasalemi7/grepseek && cd grepseek
# env: TRAINING_ENV.md  ·  corpus: cold_start_sft/download_corpus.py

# 1. serve this checkpoint
MODEL_PATH=alireza7/GrepSeek-Qwen3.5-9B-GRPO bash rl/serve_rl.sh        # -> http://localhost:10730/v1

# 2a. generation on your own questions
GREPSEEK_CORPUS_ROOT=/path/to/wiki_18_corpus \
  bash inference/run_inference.sh --base_url http://localhost:10730/v1 \
    --model grepseek --temperature 0.6 --input my_questions.jsonl --out_dir out

# 2b. reproduce the benchmark eval (token-F1 / EM on the Search-R1 suite)
GREPSEEK_CORPUS_ROOT=/path/to/wiki_18_corpus \
  bash inference/run_inference.sh --base_url http://localhost:10730/v1 \
    --model grepseek --temperature 0.6 --datasets all --out_dir eval

The inference harness also ships the semantics-preserving sharded-parallel execution engine (+ persistent search daemon) that accelerates corpus search by up to 7.6× while remaining byte-exact with sequential grep.

Results (token-level F1)

Trained only on NQ + HotpotQA (marked *); the other five are out-of-distribution. GrepSeek gets the best micro-average and wins 4/7 benchmarks.

Table with columns: NQ*, TriviaQA, PopQA, HotpotQA*, 2Wiki, MuSiQue, Bamboogle, micro-avg
	NQ*	TriviaQA	PopQA	HotpotQA*	2Wiki	MuSiQue	Bamboogle	micro-avg
Search-R1 (Qwen3-Emb-4B, best baseline)	0.5067	0.7693	0.5101	0.5591	0.4299	0.2878	0.6989	0.5441
GrepSeek (this model)

Micro-average EM = 0.4948 (also best overall; full EM table in the paper). Gains are largest on multi-hop tasks (HotpotQA, 2Wiki, MuSiQue) that reward exact entity disambiguation and iterative evidence aggregation.

Limitations

Because retrieval is purely lexical, GrepSeek is weaker on surface-form variation / long-tail queries — e.g. PopQA (diacritics, name variants) — and grep has no semantic relevance ranking, so an authoritative passage can be buried behind earlier file-order matches. Dense retrieval remains advantageous on heavily semantic or paraphrase-driven queries.

License

Inherits the license of the base model Qwen/Qwen3.5-9B — confirm and update the license field above if needed.

Citation

bibtex
@misc{salemi2026grepseektrainingsearchagents,
      title={GrepSeek: Training Search Agents for Direct Corpus Interaction},
      author={Alireza Salemi and Chang Zeng and Atharva Nijasure and Jui-Hui Chung and Razieh Rahimi and Fernando Diaz and Hamed Zamani},
      year={2026},
      eprint={2605.29307},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.29307},
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

alireza7

Model Tree

Base

alireza7/GrepSeek-Qwen3.5-9B-SFT

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities