deepakdsoni/antahkarana-7B API & Inference Endpoint

📦 Model family

Model	What
antahkarana-v1	the original architecture + v1 vision models — the most stable continual learner (only positive backward transfer)
antahkarana-v2	accuracy-recovering v2 (36.5M) — matches SOTA accuracy at ~3× less forgetting
antahkarana-7B	the architecture scaled to a 7B language model

At a glance — what makes this different

Standard fine-tuning suffers catastrophic forgetting: teach a model a new task and it loses the old one. Antahkarana-7B is trained with a small set of cognitive "faculties," each derived from a Vedic concept and implemented as a concrete mechanism:

Faculty (Vedic)	Mechanism (ML)	What it does
saṃskāra	Fisher-importance consolidation + decay over LoRA	protects what mattered for old domains → don't forget
vijñāna-smṛti	dark-knowledge / exemplar replay	rehearses past domains while learning new ones
pramāṇa	calibrated-confidence gate	abstains ("I'm not sure") instead of hallucinating
manas / buddhi	two decorrelated views, cross-teaching	safe self-learning from unlabeled data (research track)

How it works

The borrowed mind (Mistral-7B) stays frozen as the stable core (śruti); a small trainable instrument (chitta = LoRA, ~0.2% of params) learns new domains, guided by the faculties — and the pramāṇa gate decides whether to answer or abstain:

Antahkarana-7B architecture

Measured outcome (continual instruction-tuning, 4 domains, 3 seeds): ~3.8× less forgetting than naive LoRA, with higher and far more stable accuracy.

Antahkarana-7B vs naive LoRA

The journey: from a 2,500-year-old architecture to a 7B model

This model is the production endpoint of a multi-stage research-to-engineering program.

1. The architecture. The Vedic tradition describes the mind as an antaḥkaraṇa — an "inner instrument" of distinct faculties (chitta/memory, manas/perception, buddhi/discernment, ahaṃkāra/identity, plus pramāṇa/valid knowledge and the guṇa dynamics). Each faculty was mapped to a concrete, testable ML mechanism.

2. Research validation (vision, 36–52M params). The mechanisms were first proven on continual-learning image benchmarks (Split-CIFAR-100, Split-Tiny-ImageNet) against the field's standard methods (EWC, ER, DER++): the architecture was the most stable learner tested and the only one with positive backward transfer, with a clean ablation showing each Vedic-derived component adds value.

3. Scaling on a frozen modern backbone (E1–E2). On a frozen ViT-B/16, the consolidation works in adapter space, matching the SOTA (DER++) on accuracy while forgetting less, and extends to the harder class-incremental setting with label-free novelty detection (avidyā).

4. Self-learning and memory (E-S, śruti/smṛti/nidrā). The model learns from unlabeled data via decorrelated co-training and reaches near-supervised accuracy from ~2% labels; a complementary study showed an external "smṛti" memory + periodic "sleep" consolidation retains knowledge ~2.4× better than holding it in weights.

5. The 7B model (E7). The architecture was ported to language: frozen Mistral-7B + LoRA + saṃskāra + vijñāna-smṛti + pramāṇa, continually instruction-tuned across four domains with checkpointing, then merged into the standalone 7B model published here.

Results

Continual instruction-tuning — naive LoRA vs Antaḥkaraṇa-LoRA (3-seed mean ± std)

Four text-classification domains learned in sequence (AG News → DBpedia → Emotion → SST-2), each with its own label space, so forgetting is meaningful.

Metric	naive LoRA	Antaḥkaraṇa (this model)
Final accuracy ↑	0.849 ± .029	0.882 ± .003
Forgetting ↓	0.053 ± .032	0.014 ± .009 (~3.8× less)
Confidence on known domains	0.841	0.954
Known − unknown confidence gap ↑	0.467	0.494

Live deployment test (this merged model)

General language preserved — correct world-knowledge answers (e.g. capital of Japan → Tokyo; a fluent one-sentence definition of photosynthesis).
Continual retention: 8/8 correct across all four domains, including the first one learned — no catastrophic forgetting, demonstrated live.
pramāṇa abstention — on a factually neutral input (no sentiment to extract), confidence drops to 0.53 and the model abstains rather than guessing; on clear inputs it stays 0.97–0.99 and answers.

Why this is an innovation in today's AI

Most of modern AI is static: a model is trained once, frozen, and shipped. Teaching it something new means expensive retraining — and naive fine-tuning overwrites old knowledge (catastrophic forgetting). The field's strongest continual-learning methods buy stability only by trading away accuracy, or vice-versa.

Antaḥkaraṇa breaks that trade-off. Across a rigorous benchmark vs the standard methods (EWC, LwF, ER, DER++), it is the only method that lands in the "ideal corner" — high accuracy and very low forgetting — matching the SOTA's accuracy while forgetting ~3× less:

Accuracy vs forgetting frontier

That combination is what makes a model genuinely lifelong: it can keep learning in deployment without expensive retraining and without losing what it already knew — while the pramāṇa gate lets it say "I don't know" instead of hallucinating. A static, occasionally-confident model becomes a living, honest one. That is the shift the architecture is reaching for.

Potential — and where it needs to adapt

What this architecture could unlock:

Lifelong enterprise models — absorb new products, policies, and data continuously, without retraining the base or forgetting prior knowledge.
Trustworthy / high-stakes AI — calibrated abstention (pramāṇa) for medical, legal, and financial settings where "I'm not sure" is safer than a confident guess.
Label-efficient & self-learning — learns from unlabeled data (co-training), reaching near-supervised accuracy from as little as ~2% labels — cutting annotation cost dramatically.
Personal / on-device AI — a tiny adapter (~160 MB) + external memory personalizes a frozen base to a user, privacy-preserving, with no full retraining.
Agentic memory — the śruti (stable core) / smṛti (external memory) / nidrā (sleep-consolidation) design gives agents that accumulate experience over time.

Where it still needs to adapt (honest roadmap):

Beyond classification — the LLM evaluation here is classification framed as generation; it needs extension to open-ended instruction-following and longer, more realistic domain streams.
Sharper pramāṇa — the abstention gate works but is over-confident on adversarial nonsense; it needs stronger calibration (e.g. conformal / ensemble methods) at scale.
Scale & breadth — validated on 4 domains and 7B; next is longer continual streams, established continual-LLM benchmarks, and larger models (13B–70B).
Self-learning + memory at LLM scale — co-training and the smṛti/nidrā memory are proven in vision and small setups; integrating them into the LLM continual loop is the next build.
Conditional compute — a guṇa-driven mixture-of-experts / early-exit layer (efficiency) is designed but not yet implemented.

Usage

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("deepakdsoni/antahkarana-7B")
model = AutoModelForCausalLM.from_pretrained(
    "deepakdsoni/antahkarana-7B", dtype=torch.bfloat16, device_map="auto")

prompt = ("Classify the sentiment of this movie review (negative, positive).\n"
          "Text: a heartfelt, beautifully acted triumph.\nAnswer:")
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
                     max_new_tokens=4, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0], skip_special_tokens=True))

Requires a GPU for full-precision inference (~15 GB in bf16); 4-bit quantization (bitsandbytes) runs in ~5 GB.

Training details


Base model	`mistralai/Mistral-7B-v0.1` (frozen)
Adapter	LoRA (r=16, α=32) on `q/k/v/o_proj`; ~13.6M trainable (0.19%)
Method	saṃskāra (Fisher Ω + decay) on LoRA · vijñāna-smṛti exemplar replay · pramāṇa confidence gate
Curriculum	4 classification domains in sequence, per-task checkpointing (resumable)
Merge	LoRA folded into base via `merge_and_unload` → standalone full-weights 7B
Precision	bfloat16

To continue lifelong-learning (add new domains with saṃskāra protection), use the LoRA adapter + resume workflow rather than this merged checkpoint — merging flattens the LoRA structure.

Limitations & honest notes

Continual evaluation is on classification framed as generation (clean, measurable), not open-ended instruction following — a natural next extension.
The pramāṇa gate is not perfect: it abstains well on genuinely under-determined input but can still be over-confident on adversarial nonsense; the robust evidence is the calibration AUROC and the in-distribution-vs-unfamiliar confidence gap across many examples.
The model inherits the capabilities, biases, and knowledge cutoff of Mistral-7B-v0.1.

License & attribution

Released under the Apache License 2.0. This is a derivative work of Mistral-7B-v0.1 (© Mistral AI, Apache-2.0) — see the NOTICE file. The base 7B weights were used as a frozen foundation and were not trained from scratch. The Antaḥkaraṇa architecture, continual training, and merging are the contribution of the author.

Built on the Upaniṣads, Sāṃkhya, Yoga, Nyāya, and modern ML (PyTorch · Transformers · PEFT).

Citation

bibtex
@misc{antahkarana7b2026,
  title  = {Antahkarana-7B: Lifelong Learning with a Vedic-Derived Cognitive Architecture},
  author = {Deepak Soni},
  year   = {2026},
  note   = {Built on Mistral-7B-v0.1 (Apache-2.0)},
  url    = {https://huggingface.co/deepakdsoni/antahkarana-7B}
}

antahkarana-7B

Get help setting up a custom Dedicated Endpoints.

README