Anugya/text2cypher-smollm2 API & Inference Endpoint

Model Details

Base model: HuggingFaceTB/SmolLM2-135M-Instruct
Model type: Causal Language Model
Language: English
License: Apache 2.0
Finetuned by: Anugya Sahu

Training Data

Dataset: RomanTeucher/text2cypher-curated
1000 training samples, 75 validation, 50 test
Each sample contains a graph schema, a natural language question, and a target Cypher query

How to Use

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Anugya/text2cypher-smollm2")
tokenizer = AutoTokenizer.from_pretrained("Anugya/text2cypher-smollm2")
tokenizer.pad_token = tokenizer.eos_token

schema = "Movie {title, year}, Person {name}, (Person)-[:DIRECTED]->(Movie)"
question = "Which movies did Christopher Nolan direct before 2010?"

prompt = f"""### Schema:
{schema}

### Question:
{question}

### Cypher:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))

Training Details

Full fine-tune — all weights updated, no LoRA
Epochs: 3
Learning rate: 2e-4
Batch size: 4
Max token length: 256
Hardware: CPU (Apple M-series)
Precision: float32

Evaluation

Evaluated on 50 test samples using:

Exact Match — strict comparison after lowercasing and stripping
Token F1 — token overlap between prediction and ground truth

Limitations

135M parameter model — generates Cypher that looks right but often isn't
No query execution validation against a real Neo4j database
May struggle with complex schemas or multi-hop queries
Trained on CPU with limited epochs — larger training would improve results

text2cypher-smollm2

Get help setting up a custom Dedicated Endpoints.

README

Model Details

Training Data

How to Use

Training Details

Evaluation

Limitations

Explore FriendliAI today

text2cypher-smollm2