Marcoson320/codeparrot-ds-from-scratch API & Inference Endpoint

訓練配置

項目	值
架構	GPT-2 (12 layers, 12 heads, 768 hidden)
參數量	124,242,432 (~124M)
Context length	128 tokens
Tokenizer	`huggingface-course/code-search-net-tokenizer` (BPE, vocab=50,000)
訓練資料	`huggingface-course/codeparrot-ds-train` (1 epoch, 16.7M chunks)
驗證資料	`huggingface-course/codeparrot-ds-valid`
Optimizer	AdamW
Learning rate	5e-4, cosine warmup 1000 steps
Weight decay	0.1
Effective batch size	256 (per_device_bs=64 × grad_accum=2 × world_size=2)
Mixed precision	fp16
訓練步數	65,243 (1 epoch)
訓練時間	~19 小時
硬體	2 × AMD Radeon Instinct MI50 (gfx906) via PyTorch ROCm + DDP

訓練結果

step	epoch	train_loss	eval_loss
5,000	0.077	2.677	1.752
10,000	0.153	1.685	1.520
15,000	0.230	1.529	1.415
20,000	0.307	1.447	1.347
25,000	0.383	1.386	1.295
30,000	0.460	1.334	1.247
35,000	0.537	1.288	1.204
60,000	0.920	—	1.054
65,243	1.000	1.106	1.051

使用方法

python
from transformers import pipeline

pipe = pipeline("text-generation", model="Marcoson320/codeparrot-ds-from-scratch", device=0)

prompt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""

print(pipe(prompt, max_new_tokens=64, num_return_sequences=1)[0]["generated_text"])

範例輸出：

python
# create scatter plot with x, y
axScatter = fig.add_subplot(111)
axScatter.scatter(x, y, s=50, marker="d", c="red", alpha=0.7)

已知限制

小模型 + ctx=128：容易在 continuation 中陷入 repetition loop，可在推論時加 repetition_penalty=1.2 或 no_repeat_ngram_size=3 緩解。
API coverage 有限：訓練資料雖然覆蓋 pandas/sklearn/matplotlib/seaborn，但較少見的 API call 可能寫不出來。
無 instruction tuning：純 continuation model，不能對話。

致謝

訓練配方來自 HuggingFace LLM Course Chapter 7.6
資料集由 HuggingFace Course 團隊整理
Tokenizer 沿用 huggingface-course/code-search-net-tokenizer

License

Apache-2.0

codeparrot-ds-from-scratch

Get help setting up a custom Dedicated Endpoints.

README

訓練配置

訓練結果

使用方法

已知限制

致謝

License

Explore FriendliAI today

codeparrot-ds-from-scratch