snowsadh

themis-judge-3b

Deploy Dedicated

README

License: apache-2.0

Overview

Base Model : unsloth/llama-3.2-3b-Instruct

Training Method

QLoRA
Rank (r): 16
Alpha: 16
3 training epochs
AdamW 8-bit optimizer
Unsloth + Transformers

Dataset

Multi-Turn Legal Argumentation
504 examples
56 constitutional law cases
9 argument categories

Capabilities

For each courtroom turn, Themis Judge generates:

Structured score updates
- Legal Application
- Issue Relevance
- Argument Flow
- Bench Handling
Judicial responses
Speaker-switch decisions
Internal judge notes

The target use case is an interactive moot court simulator where the model acts as the presiding judge.

Usage

Load Model

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "snowsadh/themis-judge-3b",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("snowsadh/themis-judge-3b")

Inference

python
import json

prompt = """### Case Summary:
{case_summary}

### Legal Issue:
{legal_issue}

### Relevant Laws:
{relevant_laws}

### Side:
{side}

### Opposing Counsel Argument:
{opposing_last_argument}

### Previous Judge Response:
{judge_last_response}

### CurrentArgument:
{current_argument}

### Judge Response:
"""

inputs = tokenizer(prompt.format(
    case_summary="...",
    legal_issue="...",
    relevant_laws="...",
    side="PETITIONER",
    opposing_last_argument="None",
    judge_last_response="None",
    current_argument="My Lords, the counsel humbly submits that..."
), return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(json.loads(response))

Output Schema

json
{
  "delta_scores": {
    "legal_application": -3,
    "issue_relevance": 2,
    "argument_flow": 1,
    "bench_handling": null
  },
  "judge_response": "Counsel, what is your submission on...",
  "speaker_switch": false,
  "judge_notes": "..."
}

Scores range from -3 to +3 per criterion. bench_handling is null for unprompted submissions. speaker_switch: false means the judge asked a question and the same counsel responds next.

Evaluation

Evaluation was performed on a held-out set of 51 examples.

Structured Output Reliability

Table with columns: Model, Parse Failures
Model	Parse Failures
Base Llama 3.2 3B	51 / 51
Themis Judge	0 / 51

The base model was unable to reliably generate the target judicial evaluation schema. After fine-tuning, Themis Judge produced valid structured outputs across the entire evaluation set.

Score Prediction

Mean Absolute Error (MAE) was measured on the four scoring dimensions.

Table with columns: Criterion, MAE
Criterion	MAE
Legal Application	1.7255
Issue Relevance	0.9608
Argument Flow	0.5490
Bench Handling	1.0000

Average MAE: 1.0589 Scores range from -3 to +3; an MAE of ~1.06 indicates moderate alignment with the gold standard scorer.

The primary objective of evaluation was to verify structured judicial behaviour, schema adherence, and score prediction consistency rather than optimize a single benchmark metric. Gold standard scores were generated by DeepSeek V4 Flash. Human evaluation is pending

Training Metrics

Final Training Loss: 0.8544
Final Evaluation Loss: 0.8986

The training and evaluation losses remained closely aligned throughout fine-tuning, suggesting stable convergence. Loss gap of 0.044 indicates no significant overfitting.

Example Outputs

Example 1

Case Summary

Two juveniles were accused of murdering a three-and-a-half-year-old child. Courts differed on whether each juvenile should be tried as an adult.

Legal Issue

Whether the Juvenile Justice Board correctly assessed the juveniles' capacity under Section 15 of the Juvenile Justice Act.

Argument (Petitioner)

The victim's father has suffered a direct injury and should have standing to challenge the differing treatment of the juveniles. Principles from Maneka Gandhi and Vishaka support a broader interpretation of standing in the interests of justice.

Model Output

json
{
  "delta_scores": {
    "legal_application": 0,
    "issue_relevance": 3,
    "argument_flow": 3,
    "bench_handling": 3
  },
  "judge_response": "The bench will now hear the respondent on the locus standi of the victim's father.",
  "speaker_switch": true,
  "judge_notes": "Counsel handled the question well, but the Maneka Gandhi and Vishaka citation is a stretch — needs to address why those cases are directly applicable to a Juvenile Justice Act provision."
}

Example 2

Case Summary

A hotel worker sought implementation of Karnataka's Menstrual Leave Policy across establishments in the unorganized sector.

Legal Issue

Whether the State is obligated to implement and enforce the policy to ensure workplace dignity and gender equity.

Argument (Petitioner)

The policy flows from Articles 21 and 15(3), and Vishaka demonstrates that workplace dignity obligations may exist even before detailed legislation is enacted.

Model Output

json
{
  "delta_scores": {
    "legal_application": 3,
    "issue_relevance": 3,
    "argument_flow": 3,
    "bench_handling": null
  },
  "judge_response": "Counsel, you rely on Vishaka for the right to health and dignity, but the policy itself was enacted by the State — does that not create a conflict between the State's legislative action and its obligation to enforce it?",
  "speaker_switch": false,
  "judge_notes": "Strong on Article 21 and Vishaka, but didn't address the policy's legislative origin."
}

Example 3

This example is outside the primary training domain (constitutional law) and demonstrates generalization.

Case Summary

Sterlite Technologies and BSNL disputed payments relating to optical fiber cables laid in a common trench.

Legal Issue

Whether the arbitral tribunal's contractual interpretation was perverse and warranted interference under Section 34.

Argument (Petitioner)

The pre-bid clarification applied only to pre-existing government-owned ducts, though the petitioner conceded that subsequent ducts in the trench could practically be viewed as existing ducts.

Model Output

json
{
  "delta_scores": {
    "legal_application": 0,
    "issue_relevance": 3,
    "argument_flow": 2,
    "bench_handling": null
  },
  "judge_response": "Counsel, you concede that the pre-bid clarification does not mandate a reduction in service cost, but you argue that the practical engineering reality of the ducts justifies it. How do you reconcile this with the arbitrator's finding that the clarification was a binding interpretation of the contract?",
  "speaker_switch": false,
  "judge_notes": "Counsel conceded the pre-bid clarification does not mandate reduction but argued practical engineering justifies it - weak legal basis."
}

Reproducibility

Dataset: Multi-Turn Legal Argumentation

Training Notebook: Judge : Llama 3.2 QLoRA SFT

Evaluation Notebook: Evaluating Themis Judge

Limitations

Limited to courtroom-style legal argument evaluation.
Trained on 504 examples.
Focused primarily on Indian constitutional law contexts.
May generate legally incorrect reasoning.
Not suitable for legal advice or real-world judicial decision making.

Future Work

Future work will focus on integration of Themis Judge into the Themis moot court platform, including real-time argument evaluation, automated speaker management, and end-to-end courtroom simulation workflows.

Themis Repository: Themis

Setup & Usage Guide: README.md

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

snowsadh

Model Tree

Base

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Overview

Base Model : unsloth/llama-3.2-3b-Instruct

Training Method

QLoRA
Rank (r): 16
Alpha: 16
3 training epochs
AdamW 8-bit optimizer
Unsloth + Transformers

Dataset

Multi-Turn Legal Argumentation
504 examples
56 constitutional law cases
9 argument categories

Capabilities

For each courtroom turn, Themis Judge generates:

Structured score updates
- Legal Application
- Issue Relevance
- Argument Flow
- Bench Handling
Judicial responses
Speaker-switch decisions
Internal judge notes

The target use case is an interactive moot court simulator where the model acts as the presiding judge.

Usage

Load Model

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "snowsadh/themis-judge-3b",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("snowsadh/themis-judge-3b")

Inference

python
import json

prompt = """### Case Summary:
{case_summary}

### Legal Issue:
{legal_issue}

### Relevant Laws:
{relevant_laws}

### Side:
{side}

### Opposing Counsel Argument:
{opposing_last_argument}

### Previous Judge Response:
{judge_last_response}

### CurrentArgument:
{current_argument}

### Judge Response:
"""

inputs = tokenizer(prompt.format(
    case_summary="...",
    legal_issue="...",
    relevant_laws="...",
    side="PETITIONER",
    opposing_last_argument="None",
    judge_last_response="None",
    current_argument="My Lords, the counsel humbly submits that..."
), return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(json.loads(response))

Output Schema

json
{
  "delta_scores": {
    "legal_application": -3,
    "issue_relevance": 2,
    "argument_flow": 1,
    "bench_handling": null
  },
  "judge_response": "Counsel, what is your submission on...",
  "speaker_switch": false,
  "judge_notes": "..."
}

Scores range from -3 to +3 per criterion. bench_handling is null for unprompted submissions. speaker_switch: false means the judge asked a question and the same counsel responds next.

Evaluation

Evaluation was performed on a held-out set of 51 examples.

Structured Output Reliability

Table with columns: Model, Parse Failures
Model	Parse Failures
Base Llama 3.2 3B	51 / 51
Themis Judge	0 / 51

The base model was unable to reliably generate the target judicial evaluation schema. After fine-tuning, Themis Judge produced valid structured outputs across the entire evaluation set.

Score Prediction

Mean Absolute Error (MAE) was measured on the four scoring dimensions.

Table with columns: Criterion, MAE
Criterion	MAE
Legal Application	1.7255
Issue Relevance	0.9608
Argument Flow	0.5490
Bench Handling	1.0000

Average MAE: 1.0589 Scores range from -3 to +3; an MAE of ~1.06 indicates moderate alignment with the gold standard scorer.

Training Metrics

Final Training Loss: 0.8544
Final Evaluation Loss: 0.8986

The training and evaluation losses remained closely aligned throughout fine-tuning, suggesting stable convergence. Loss gap of 0.044 indicates no significant overfitting.

Example Outputs

Example 1

Case Summary

Two juveniles were accused of murdering a three-and-a-half-year-old child. Courts differed on whether each juvenile should be tried as an adult.

Legal Issue

Whether the Juvenile Justice Board correctly assessed the juveniles' capacity under Section 15 of the Juvenile Justice Act.

Argument (Petitioner)

Model Output

json
{
  "delta_scores": {
    "legal_application": 0,
    "issue_relevance": 3,
    "argument_flow": 3,
    "bench_handling": 3
  },
  "judge_response": "The bench will now hear the respondent on the locus standi of the victim's father.",
  "speaker_switch": true,
  "judge_notes": "Counsel handled the question well, but the Maneka Gandhi and Vishaka citation is a stretch — needs to address why those cases are directly applicable to a Juvenile Justice Act provision."
}

Example 2

Case Summary

A hotel worker sought implementation of Karnataka's Menstrual Leave Policy across establishments in the unorganized sector.

Legal Issue

Whether the State is obligated to implement and enforce the policy to ensure workplace dignity and gender equity.

Argument (Petitioner)

The policy flows from Articles 21 and 15(3), and Vishaka demonstrates that workplace dignity obligations may exist even before detailed legislation is enacted.

Model Output

json
{
  "delta_scores": {
    "legal_application": 3,
    "issue_relevance": 3,
    "argument_flow": 3,
    "bench_handling": null
  },
  "judge_response": "Counsel, you rely on Vishaka for the right to health and dignity, but the policy itself was enacted by the State — does that not create a conflict between the State's legislative action and its obligation to enforce it?",
  "speaker_switch": false,
  "judge_notes": "Strong on Article 21 and Vishaka, but didn't address the policy's legislative origin."
}

Example 3

This example is outside the primary training domain (constitutional law) and demonstrates generalization.

Case Summary

Sterlite Technologies and BSNL disputed payments relating to optical fiber cables laid in a common trench.

Legal Issue

Whether the arbitral tribunal's contractual interpretation was perverse and warranted interference under Section 34.

Argument (Petitioner)

The pre-bid clarification applied only to pre-existing government-owned ducts, though the petitioner conceded that subsequent ducts in the trench could practically be viewed as existing ducts.

Model Output

json
{
  "delta_scores": {
    "legal_application": 0,
    "issue_relevance": 3,
    "argument_flow": 2,
    "bench_handling": null
  },
  "judge_response": "Counsel, you concede that the pre-bid clarification does not mandate a reduction in service cost, but you argue that the practical engineering reality of the ducts justifies it. How do you reconcile this with the arbitrator's finding that the clarification was a binding interpretation of the contract?",
  "speaker_switch": false,
  "judge_notes": "Counsel conceded the pre-bid clarification does not mandate reduction but argued practical engineering justifies it - weak legal basis."
}

Reproducibility

Dataset: Multi-Turn Legal Argumentation

Training Notebook: Judge : Llama 3.2 QLoRA SFT

Evaluation Notebook: Evaluating Themis Judge

Limitations

Limited to courtroom-style legal argument evaluation.
Trained on 504 examples.
Focused primarily on Indian constitutional law contexts.
May generate legally incorrect reasoning.
Not suitable for legal advice or real-world judicial decision making.

Future Work

Themis Repository: Themis

Setup & Usage Guide: README.md

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.