Techno-1

C2OptimisedAssembly

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Uploaded finetuned model

  • Developed by: Techno-1
  • License: apache-2.0
  • Finetuned from model : unsloth/Qwen3.5-2B

Training was completed on a free T4 GPU Google Colab instance from UnSloth with template used linked below https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb

image

Chat Template:

markdown

{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '### Instruction:\n' + message['content'] + '\n\n' }}
{% elif message['role'] == 'assistant' %}
{{ '### Response:\n' + message['content'] + eos_token + '\n' }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ '### Response:\n' }}
{% endif %}

System Prompt:

markdown

You are an expert systems programmer and compiler engineer. Your goal is to translate C code into high-performance, hardware-specific x86-64 AVX2 assembly. You value register efficiency, branchless execution, and correct usage of SIMD instructions like VMAXPS and VADDPS.

Prompt:

markdown

\### Instruction:
Optimize the vector addition function using AVX2. Assume n is a multiple of 8.
\### Input:
void vec_add(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
\### Response:

Max tokens:

markdown

256

The rest of the settings were left as defaults

Analysis:

This question was a control test to see how the fine tuning effected the model in general. The code is logically very similar to an example in the training data though the syntax and layout is slightly different.

The fine tuned model here outputted code that matched the sort the training data showed as the 'correct' response whereas the base model in comparison had confident hallucinations in it's response saying that: VMAXPS (Maximum) is preferred over VADDPS (Addition) for performance which according to gemini flash lite extended is incorrect because because MAX and ADD are different mathematical operations.

An issue both models had was repeating themselves but I'm not certain if they weren't outputting end tokens at all and were rambling or if the chat template code just wasn't stopping them properly after parsing it.

This indicates that the fine tuning can improve the models ability to correctly apply looked up optimisations given a prompt in C.

However this result is well within the scope of the data seen in fine tuning training and so lies quite firmly on the training data manifold.

Though the model could instead of taking C code and outputting the optimised assembly and be trained by passing the C code paired with already compiled assembly and then a further optimised version of the existing assembly that takes advantage of the LLMs context of the original C logic and what will or won't be run together to allow it to optimise further than the compiler through contextual understansing of the code. The more advanced optimisations like contextual and the synthetic data generation as a whole would be done by a more advanced model. If this could be done reliably and scaled up to lookup a huge amount of optimisation patterns that leverage the LLMs contextual understanding capacities then even pattern matching optimisations to existing C codebases could be useful as some optimisations simply don't occur because compilers are sometimes too cautious and less experienced human programmers are unsure of how to override it.

Screenshot 2026-05-23 at 1.32.04 am

The models were given this:

Prompt:

markdown

\### Instruction:
Optimize the vector subtraction and multiplication function using AVX2. Assume n is a multiple of 8.
\### Input:
void vec_sub_mul(float* a, float* b, float* c, float* d, int n) {
for (int i = 0; i < n; i++) {
d[i] = (a[i] - b[i]) * c[i];
}
}

The rest of the configurations remained the same.

This prompt was out of distribution but should have combined ideas from 2 pieces of code that were firmly within the distribution.

Unfortunately in these results both the original and fine tuned models hallucinated indicating that though the models have learnt some of the general syntax of assembly they haven't yet learnt the underlying logic and valid patterns.

This is to be sort of expected for non thinking models with only 2 Billion parameters being fine tuned only with LoRA adapters. While getting new logic from them with LoRa done with the setup used may not be impossible it would be quite difficult. This is because without thinking more these models gain very little layers comparatively to their pretained bulk to pass/ process information through which means the number of logical steps they can complete is severely limited and maxes out at however many new adapter layers they have passed the logic the original model was able to do before it's weights were frozen. In future an experiment either with tuning longer thinking models or generally larger models or deeper tuning or LoRA with more layers may improve the logic and generalisation performance of this sort of fine tuning.

As it is though it seems that reinforcement learning on the current model configuration would probably be quite difficult due to the lack of basic logical generalisation on the problem domain.

A useful model in this configuration could still potentially be tuned to be a more categorical identifier of inefficiencies and optimisations instead of writing the whole thing itself. It could be fine tuned to memorise patterns of inefficiency in either binary or C code which after detection could be fixed by a human or smarter model.

Overall though the experiment wasn't a full success useful information was gained and possible future directions identified.

This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Model provider

Techno-1

Model tree

Base

unsloth/Qwen3.5-2B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today