abercrombie-grpo API & Inference Endpoint

Links

Training environment (Prime Intellect, public): smolclaims/abercrombie
Base model: Qwen/Qwen3.5-4B
Benchmark: LegalBench Abercrombie

Results

On the 95-row LegalBench Abercrombie held-out test set (non-thinking inference, greedy decoding):

Table with columns: Category, Base Qwen3.5-4B, + Abercrombie-GRPO LoRA, Delta
Category	Base Qwen3.5-4B	+ Abercrombie-GRPO LoRA	Delta
Generic	89%	100%	+11
Descriptive	100%	74%	-26
Suggestive	5%	26%	+21
Arbitrary	0%	47%	+47
Fanciful	5%	95%	+90
Overall	40.0%	68.4%	+28.4

Mean ordinal distance: 1.09 -> 0.53 (halved).

Output format

The model is trained to emit exactly six lines and nothing else:

markdown
Q1: [Yes/No]
Q2: [Yes/No]
Q3: [Yes/No]
Q4: [Yes/No]
Q5: [Yes/No]
FINAL_CLASSIFICATION: [Generic/Descriptive/Suggestive/Arbitrary/Fanciful]

Each Qn is a doctrinal sub-question. Q1 = coined term test, Q2 = semantic relationship, Q3 = imagination test, Q4 = immediate conveyance, Q5 = genus test. The routing rule (Q1=Yes -> Fanciful, else Q2=No -> Arbitrary, else Q5=Yes -> Generic, else Q4=Yes -> Descriptive, else Q3=Yes -> Suggestive) is baked into the system prompt.

Usage

1. Install

bash
pip install transformers accelerate peft torch

2. System prompt (required - do not modify)

python
SYSTEM_PROMPT = """You are a trademark distinctiveness classifier. Given a mark and the goods or services it identifies, classify the mark on the Abercrombie spectrum: Generic, Descriptive, Suggestive, Arbitrary, or Fanciful.

Answer five questions about the mark, then provide a final classification. Evaluate each question in relation to the specific goods or services and the relevant purchasing public. Treat the mark as a whole; do not decompose compound marks into separate components.

Q1 - Coined Term Test. Is the mark an invented term created solely for trademark use, with no prior independent meaning?

Q2 - Semantic Relationship Test. Does the mark's ordinary dictionary meaning have any plausible semantic relationship to the goods or services?

Q3 - Imagination Test. Must the consumer use imagination, thought, or a multi-step mental process to connect the mark to the nature of the goods or services?

Q4 - Immediate Conveyance Test. Does the mark immediately convey an idea of a feature, quality, function, ingredient, or characteristic of the goods or services to the relevant purchasing public?

Q5 - Genus Test. Does the relevant purchasing public understand the mark primarily as the name of the general category of goods or services, rather than as an indicator of source?

When Q2=Yes and Q5=No, exactly one of Q3 or Q4 must be Yes: a semantically-related, non-generic mark is either descriptively immediate or suggestively imaginative, never neither.

Apply this routing rule to determine the final classification:
- If Q1 = Yes, classify as Fanciful
- Else if Q2 = No, classify as Arbitrary
- Else if Q5 = Yes, classify as Generic
- Else if Q4 = Yes, classify as Descriptive
- Else if Q3 = Yes, classify as Suggestive

Respond in exactly this format with no other text:
Q1: [Yes/No]
Q2: [Yes/No]
Q3: [Yes/No]
Q4: [Yes/No]
Q5: [Yes/No]
FINAL_CLASSIFICATION: [Generic/Descriptive/Suggestive/Arbitrary/Fanciful]"""

3. Load and run

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE  = "Qwen/Qwen3.5-4B"
LORA  = "DoodDood/abercrombie-grpo"
dtype = torch.bfloat16

tok   = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=dtype, device_map="auto")
model = PeftModel.from_pretrained(model, LORA)
model.eval()

def classify(mark_and_goods: str) -> str:
    msgs = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": mark_and_goods},
    ]
    prompt = tok.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True,
        enable_thinking=False,
    )
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs, max_new_tokens=128, do_sample=False,
            pad_token_id=tok.eos_token_id,
        )
    return tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# Input format: `The mark "X" for Y.` (matches LegalBench phrasing)
print(classify('The mark "Kodak" for cameras.'))
# Expected: Q1: Yes, Q2-Q5: No, FINAL_CLASSIFICATION: Fanciful

print(classify('The mark "Apple" for personal computers.'))
# Expected: Q1: No, Q2: No, ..., FINAL_CLASSIFICATION: Arbitrary

print(classify('The mark "Salt" for packages of sodium chloride.'))
# Expected: Q1-Q4: No, Q5: Yes, FINAL_CLASSIFICATION: Generic

Important caveats

Don't modify the system prompt. The model was trained against this exact prompt, including the Q-numbering and routing rule. Changes will degrade output.
Always use enable_thinking=False. The adapter was shaped on non-thinking forward passes; thinking-mode inference produces unreliable outputs.
Greedy decoding only. Sampling adds noise to a strict-format task. Use do_sample=False.
Phrase the input as The mark "X" for Y. This matches the LegalBench surface form the model was trained on. Other phrasings may work but are not guaranteed.

Method

Trained on Prime Intellect's hosted RL with the Verifiers framework on a custom synthetic dataset (2,100 marks, balanced across 5 classes, with a generator blacklist that excludes every LegalBench test mark - no train/test contamination).

Reward stack (5 functions, weights 1.0 / 0.3 / 0.2 / 0.15 / 0.3):

Ordinal accuracy on the final label - distance-based, dominant signal.
Decisive Q - the dispositive sub-element for the true label only.
Consistency bonus - gated on correct answer AND matching decisive Q.
Routing consistency - stated FINAL matches own self-routing.
Routed truth - own Q-chain decomposition lands on the true label.

300 steps, batch 128, 16 rollouts/example, LoRA r=16. Total compute: ~$12.

The full environment, reward functions, and synthetic training data are public at the Prime Intellect env page.

Category

Base Qwen3.5-4B

+ Abercrombie-GRPO LoRA

Delta

Generic

89%

100%

+11

Descriptive

100%

74%

-26

Suggestive

26%

+21

Arbitrary

47%

+47

Fanciful

95%

+90

Overall

40.0%

68.4%

+28.4

python

SYSTEM_PROMPT = """You are a trademark distinctiveness classifier. Given a mark and the goods or services it identifies, classify the mark on the Abercrombie spectrum: Generic, Descriptive, Suggestive, Arbitrary, or Fanciful.

Answer five questions about the mark, then provide a final classification. Evaluate each question in relation to the specific goods or services and the relevant purchasing public. Treat the mark as a whole; do not decompose compound marks into separate components.

Q1 - Coined Term Test. Is the mark an invented term created solely for trademark use, with no prior independent meaning?

Q2 - Semantic Relationship Test. Does the mark's ordinary dictionary meaning have any plausible semantic relationship to the goods or services?

Q3 - Imagination Test. Must the consumer use imagination, thought, or a multi-step mental process to connect the mark to the nature of the goods or services?

Q4 - Immediate Conveyance Test. Does the mark immediately convey an idea of a feature, quality, function, ingredient, or characteristic of the goods or services to the relevant purchasing public?

Q5 - Genus Test. Does the relevant purchasing public understand the mark primarily as the name of the general category of goods or services, rather than as an indicator of source?

When Q2=Yes and Q5=No, exactly one of Q3 or Q4 must be Yes: a semantically-related, non-generic mark is either descriptively immediate or suggestively imaginative, never neither.

Apply this routing rule to determine the final classification:
- If Q1 = Yes, classify as Fanciful
- Else if Q2 = No, classify as Arbitrary
- Else if Q5 = Yes, classify as Generic
- Else if Q4 = Yes, classify as Descriptive
- Else if Q3 = Yes, classify as Suggestive

Respond in exactly this format with no other text:
Q1: [Yes/No]
Q2: [Yes/No]
Q3: [Yes/No]
Q4: [Yes/No]
Q5: [Yes/No]
FINAL_CLASSIFICATION: [Generic/Descriptive/Suggestive/Arbitrary/Fanciful]"""

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE  = "Qwen/Qwen3.5-4B"
LORA  = "DoodDood/abercrombie-grpo"
dtype = torch.bfloat16

tok   = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=dtype, device_map="auto")
model = PeftModel.from_pretrained(model, LORA)
model.eval()

def classify(mark_and_goods: str) -> str:
    msgs = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": mark_and_goods},
    ]
    prompt = tok.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True,
        enable_thinking=False,
    )
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs, max_new_tokens=128, do_sample=False,
            pad_token_id=tok.eos_token_id,
        )
    return tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# Input format: `The mark "X" for Y.` (matches LegalBench phrasing)
print(classify('The mark "Kodak" for cameras.'))
# Expected: Q1: Yes, Q2-Q5: No, FINAL_CLASSIFICATION: Fanciful

print(classify('The mark "Apple" for personal computers.'))
# Expected: Q1: No, Q2: No, ..., FINAL_CLASSIFICATION: Arbitrary

print(classify('The mark "Salt" for packages of sodium chloride.'))
# Expected: Q1-Q4: No, Q5: Yes, FINAL_CLASSIFICATION: Generic

abercrombie-grpo

README

Links

Results

Output format

Usage

1. Install

2. System prompt (required - do not modify)

3. Load and run

Important caveats

Method

Explore FriendliAI today

README

Links

Results

Output format

Usage

1. Install

2. System prompt (required - do not modify)

3. Load and run

Important caveats

Method