Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

The Problem

Financial institutions process thousands of customer complaints daily across mobile apps, websites, contact centres, email, and regulatory portals. These complaints arrive as free-form text — often incomplete, ambiguous, or written by customers who do not know which banking product or issue category applies to their situation.

The result is predictable: complaints get routed to the wrong team, require manual review and reassignment, and take longer to resolve than they should. Traditional classification models handle one label at a time and struggle with the nuanced language of consumer finance. A rule-based keyword system breaks down the moment a customer phrases something slightly differently.

This model addresses that by treating complaint categorisation as a structured generation task — the model reads the complaint narrative and produces all four required ticket fields in a single inference step.


What the Model Does

Given a customer complaint narrative, the model outputs a structured JSON object containing:

json

{
"product": "Checking or savings account",
"sub_product": "Checking account",
"issue": "Unauthorized transactions or other transaction problem",
"sub_issue": "Debit card issue"
}

These four fields map directly to the CFPB Consumer Complaint taxonomy and can be consumed directly by complaint management systems, business rule engines, and routing workflows — no manual classification required.

Example

Input complaint:

"I reported fraudulent transactions on my debit card and the bank reversed my provisional credit without explaining the investigation outcome. I have been trying to reach someone for three weeks and keep getting transferred."

Model output:

json

{
"product": "Checking or savings account",
"sub_product": "Checking account",
"issue": "Unauthorized transactions or other transaction problem",
"sub_issue": "Debit card issue"
}

Model Details

PropertyValue
Base modelQwen/Qwen2.5-7B-Instruct
Fine-tuning methodLoRA (Low-Rank Adaptation) via PEFT
Training hardwareAMD Instinct MI300X (192 GB VRAM)
Training backendROCm 7.2.4 / HIP
Model precisionbfloat16
Task typeStructured JSON generation (causal LM)
Output formatJSON with 4 fields: product, sub_product, issue, sub_issue

Training Configuration

LoRA Adapter

ParameterValue
Rank (r)16
Alpha32
Dropout0.05
Target modulesq_proj, k_proj, v_proj, o_proj
Trainable parameters~1% of total model parameters

The base model weights are fully frozen. Only the LoRA adapter matrices are updated during training, making this efficient both in compute and storage — the saved adapter is significantly smaller than the full model.

Training Hyperparameters

ParameterValue
Epochs5 (with early stopping, patience=3)
Batch size per device8
Gradient accumulation steps4
Effective batch size32
Learning rate1e-4
OptimiserAdamW (PyTorch native)
LR schedulerLinear
Precisionbf16
Max sequence length1024 tokens

Early stopping was applied with a patience of 3 evaluation checkpoints. Prior experiments on smaller model variants showed validation loss plateauing around epoch 2–3, so early stopping prevents wasted compute without sacrificing quality.

Dataset

Source: CFPB Consumer Complaint Database (formatted as multi-turn chat JSONL)

Splits used:

SplitSize
TrainFull dataset (no cap)
Validation500
Test500

Sampling strategy: Training data was sampled using proportional stratification by product × issue combination. This ensures that long-tail complaint categories — which would appear only once or twice in a random 500-sample draw — receive proportional representation. Without this, the model sees most issue labels fewer than 3 times, which is insufficient for reliable generation.

Chat template: Qwen's built-in apply_chat_template was used to format each example into a single training string with <|im_start|> / <|im_end|> special tokens. The assistant turn (the JSON output) was included in full — no generation prompt was added at training time.


Inference with Constrained Decoding

At inference time, this model uses a two-pass constrained decoding approach:

  1. Pass 1 — Standard greedy decoding generates the JSON output.
  2. Pass 2 — Each field value is snapped to the nearest canonical CFPB label using TF-IDF cosine similarity (unigram + bigram features).

This matters because the CFPB taxonomy contains 80+ canonical issue strings with very similar phrasing. A model that generates "unauthorized transaction" when the canonical label is "unauthorized transactions or other transaction problem" would score zero on exact match — but is semantically correct. The constrained decoder corrects these surface-level mismatches without changing the underlying prediction.

Jaccard similarity was evaluated as an alternative snapping strategy but proved insufficient for near-duplicate labels (e.g., "problem with fees" vs "other fee") where single-word differences produce high Jaccard overlap. TF-IDF on bigrams separates these reliably.


Evaluation Results

Evaluated on 500 held-out test examples from the CFPB dataset. Baseline is the unmodified Qwen2.5-7B-Instruct base model with no fine-tuning.

Primary Metrics — Structured JSON Extraction

MetricBaselineFine-tunedΔ
Exact JSON Match0.00000.2280+0.2280
Avg Field Accuracy0.00300.5925+0.5895
Micro F10.00300.5925+0.5895
Macro F10.00080.2395+0.2387
Weighted F10.00590.5814+0.5755

Per-field accuracy:

FieldBaselineFine-tunedΔ
product0.0100.910+0.900
sub_product0.0020.628+0.626
issue0.0000.336+0.336
sub_issue0.0000.496+0.496

product accuracy of 91% is expected — the CFPB product taxonomy has around a dozen top-level categories and the model learns them well. issue at 33.6% reflects the genuine difficulty of the field: 80+ canonical strings with overlapping phrasing, many appearing infrequently even in the full training set.

Secondary Metrics — Generative Quality

These metrics measure output fluency and n-gram overlap. They are secondary to the structured metrics above, but confirm the model is generating coherent, well-formed text.

MetricBaselineFine-tunedΔ
ROUGE-10.45920.7035+0.2443
ROUGE-20.25230.6049+0.3526
ROUGE-L0.42580.6915+0.2657
BLEU0.00030.1905+0.1902
SacreBLEU19.7765.07+45.30
METEOR0.11330.6309+0.5176

The SacreBLEU jump from 19.77 to 65.07 and METEOR from 0.11 to 0.63 indicate the fine-tuned model is generating outputs that are not just structurally similar to references, but lexically aligned — which for this task means using the correct canonical CFPB terminology consistently.


How to Use

Load the adapter

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model_name = "Qwen/Qwen2.5-7B-Instruct"
adapter_path = "your-hf-username/qwen2.5-7b-cfpb-complaint-categorisation"
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
base = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, adapter_path)
model.eval()

Run inference

python

def categorise_complaint(complaint_text: str, model, tokenizer) -> dict:
messages = [
{
"role": "system",
"content": (
"You are a banking complaint classification assistant. "
"Given a consumer complaint narrative, extract the CFPB ticket fields "
"as a JSON object with keys: product, sub_product, issue, sub_issue."
),
},
{
"role": "user",
"content": complaint_text,
},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
prompt_len = inputs["input_ids"].shape[1]
generated = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
return generated
complaint = """
I reported fraudulent transactions on my debit card and the bank reversed
my provisional credit without explaining the investigation outcome.
"""
result = categorise_complaint(complaint, model, tokenizer)
print(result)
# {"product": "Checking or savings account", "sub_product": "Checking account",
# "issue": "Unauthorized transactions or other transaction problem",
# "sub_issue": "Debit card issue"}

Dependencies

markdown

transformers==4.44.0
peft==0.12.0
accelerate==0.34.0
datasets==2.21.0
torch (ROCm-compatible build for AMD, or standard CUDA build)
scikit-learn
rouge-score
sacrebleu
nltk

Limitations

  • CFPB taxonomy only. The model is trained on and constrained to CFPB Consumer Complaint Database labels. It is not a general-purpose complaint classifier and should not be used with complaint taxonomies from other regulatory bodies or internal systems without retraining.
  • Issue field accuracy. The issue field (33.6% accuracy) is the weakest link. The CFPB issue taxonomy contains 80+ canonical strings with overlapping phrasing. Expanding training data and further tuning the constrained decoder are the most direct paths to improvement.
  • English language only. All training data is in English. Performance on non-English complaints is untested and likely poor.
  • Context length. Complaints longer than 1024 tokens will be truncated. Most CFPB complaints are well within this limit, but very long narratives may lose relevant context.

Intended Use

This model is intended for use by:

  • Banking operations teams automating first-touch complaint categorisation
  • Compliance teams processing regulatory complaint filings
  • Contact centre platforms routing incoming complaints before agent assignment
  • Research teams studying LLM adaptation for financial NLP tasks

It is not intended for consumer-facing deployment without human review of outputs, or for use in jurisdictions where automated complaint classification decisions have legal or regulatory implications without appropriate oversight.


Training Infrastructure

Trained on an AMD Instinct MI300X GPU (192 GB HBM3 VRAM) running ROCm 7.2.4. The training stack is fully ROCm-native — bitsandbytes (CUDA-only) is not used. Model precision is bfloat16, which is the native compute type for the CDNA3 architecture.


Citation

If you use this model in research or production, please cite the CFPB Consumer Complaint Database as the data source:

markdown

Consumer Financial Protection Bureau (CFPB)
Consumer Complaint Database
https://www.consumerfinance.gov/data-research/consumer-complaints/

Model provider

aryachakraborty

aryachakraborty

Model tree

Base

Qwen/Qwen2.5-7B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today