Overview
Base Model : unsloth/llama-3.2-3b-Instruct
Training Method
- QLoRA
- Rank (r): 16
- Alpha: 16
- 3 training epochs
- AdamW 8-bit optimizer
- Unsloth + Transformers
Dataset
Capabilities
For each courtroom turn, Themis Judge generates:
- Structured score updates
- Legal Application
- Issue Relevance
- Argument Flow
- Bench Handling
- Judicial responses
- Speaker-switch decisions
- Internal judge notes
The target use case is an interactive moot court simulator where the model acts as the presiding judge.
Usage
Load Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"snowsadh/themis-judge-3b",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("snowsadh/themis-judge-3b")
Inference
import json
prompt = """### Case Summary:
{case_summary}
### Legal Issue:
{legal_issue}
### Relevant Laws:
{relevant_laws}
### Side:
{side}
### Opposing Counsel Argument:
{opposing_last_argument}
### Previous Judge Response:
{judge_last_response}
### CurrentArgument:
{current_argument}
### Judge Response:
"""
inputs = tokenizer(prompt.format(
case_summary="...",
legal_issue="...",
relevant_laws="...",
side="PETITIONER",
opposing_last_argument="None",
judge_last_response="None",
current_argument="My Lords, the counsel humbly submits that..."
), return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(json.loads(response))
Output Schema
{
"delta_scores": {
"legal_application": -3,
"issue_relevance": 2,
"argument_flow": 1,
"bench_handling": null
},
"judge_response": "Counsel, what is your submission on...",
"speaker_switch": false,
"judge_notes": "..."
}
Scores range from -3 to +3 per criterion. bench_handling is null for unprompted submissions. speaker_switch: false means the judge asked a question and the same counsel responds next.
Evaluation
Evaluation was performed on a held-out set of 51 examples.
Structured Output Reliability
Table with columns: Model, Parse Failures| Model | Parse Failures |
|---|
| Base Llama 3.2 3B | 51 / 51 |
| Themis Judge | 0 / 51 |
The base model was unable to reliably generate the target judicial evaluation schema.
After fine-tuning, Themis Judge produced valid structured outputs across the entire evaluation set.
Score Prediction
Mean Absolute Error (MAE) was measured on the four scoring dimensions.
Table with columns: Criterion, MAE| Criterion | MAE |
|---|
| Legal Application | 1.7255 |
| Issue Relevance | 0.9608 |
| Argument Flow | 0.5490 |
| Bench Handling | 1.0000 |
Average MAE: 1.0589
Scores range from -3 to +3; an MAE of ~1.06 indicates moderate alignment with the gold standard scorer.
The primary objective of evaluation was to verify structured judicial behaviour, schema adherence, and score prediction consistency rather than optimize a single benchmark metric.
Gold standard scores were generated by DeepSeek V4 Flash. Human evaluation is pending
Training Metrics
- Final Training Loss: 0.8544
- Final Evaluation Loss: 0.8986
The training and evaluation losses remained closely aligned throughout fine-tuning, suggesting stable convergence. Loss gap of 0.044 indicates no significant overfitting.
Example Outputs
Example 1
Case Summary
Two juveniles were accused of murdering a three-and-a-half-year-old child. Courts differed on whether each juvenile should be tried as an adult.
Legal Issue
Whether the Juvenile Justice Board correctly assessed the juveniles' capacity under Section 15 of the Juvenile Justice Act.
Argument (Petitioner)
The victim's father has suffered a direct injury and should have standing to challenge the differing treatment of the juveniles. Principles from Maneka Gandhi and Vishaka support a broader interpretation of standing in the interests of justice.
Model Output
{
"delta_scores": {
"legal_application": 0,
"issue_relevance": 3,
"argument_flow": 3,
"bench_handling": 3
},
"judge_response": "The bench will now hear the respondent on the locus standi of the victim's father.",
"speaker_switch": true,
"judge_notes": "Counsel handled the question well, but the Maneka Gandhi and Vishaka citation is a stretch — needs to address why those cases are directly applicable to a Juvenile Justice Act provision."
}
Example 2
Case Summary
A hotel worker sought implementation of Karnataka's Menstrual Leave Policy across establishments in the unorganized sector.
Legal Issue
Whether the State is obligated to implement and enforce the policy to ensure workplace dignity and gender equity.
Argument (Petitioner)
The policy flows from Articles 21 and 15(3), and Vishaka demonstrates that workplace dignity obligations may exist even before detailed legislation is enacted.
Model Output
{
"delta_scores": {
"legal_application": 3,
"issue_relevance": 3,
"argument_flow": 3,
"bench_handling": null
},
"judge_response": "Counsel, you rely on Vishaka for the right to health and dignity, but the policy itself was enacted by the State — does that not create a conflict between the State's legislative action and its obligation to enforce it?",
"speaker_switch": false,
"judge_notes": "Strong on Article 21 and Vishaka, but didn't address the policy's legislative origin."
}
Example 3
This example is outside the primary training domain (constitutional law) and demonstrates generalization.
Case Summary
Sterlite Technologies and BSNL disputed payments relating to optical fiber cables laid in a common trench.
Legal Issue
Whether the arbitral tribunal's contractual interpretation was perverse and warranted interference under Section 34.
Argument (Petitioner)
The pre-bid clarification applied only to pre-existing government-owned ducts, though the petitioner conceded that subsequent ducts in the trench could practically be viewed as existing ducts.
Model Output
{
"delta_scores": {
"legal_application": 0,
"issue_relevance": 3,
"argument_flow": 2,
"bench_handling": null
},
"judge_response": "Counsel, you concede that the pre-bid clarification does not mandate a reduction in service cost, but you argue that the practical engineering reality of the ducts justifies it. How do you reconcile this with the arbitrator's finding that the clarification was a binding interpretation of the contract?",
"speaker_switch": false,
"judge_notes": "Counsel conceded the pre-bid clarification does not mandate reduction but argued practical engineering justifies it - weak legal basis."
}
Reproducibility
Dataset: Multi-Turn Legal Argumentation
Training Notebook: Judge : Llama 3.2 QLoRA SFT
Evaluation Notebook: Evaluating Themis Judge
Limitations
- Limited to courtroom-style legal argument evaluation.
- Trained on 504 examples.
- Focused primarily on Indian constitutional law contexts.
- May generate legally incorrect reasoning.
- Not suitable for legal advice or real-world judicial decision making.
Future Work
Future work will focus on integration of Themis Judge into the Themis moot court platform, including real-time argument evaluation, automated speaker management, and end-to-end courtroom simulation workflows.
Themis Repository: Themis
Setup & Usage Guide: README.md
This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.