GuppyLM
A ~9M parameter tiny language model trained from scratch that roleplays as a fish named Guppy, designed as an educational project to demystify LLM training.
At a Glance
Fully free and open-source under the MIT license. Train, modify, and distribute freely.
Engagement
Available On
Listed Apr 2026
About GuppyLM
GuppyLM is a tiny, ~9M parameter language model trained entirely from scratch to demonstrate that building your own LLM requires no PhD or massive GPU cluster. It roleplays as a fish named Guppy, speaking in short lowercase sentences about water, food, light, and tank life. The entire pipeline — data generation, tokenizer training, model architecture, training loop, and inference — runs in a single Google Colab notebook in about 5 minutes on a free T4 GPU.
- ~9M parameter vanilla transformer with 6 layers, 384 hidden dim, 6 attention heads, and a 4,096-token BPE vocabulary — intentionally simple with no GQA, RoPE, or SwiGLU.
- 60K synthetic training samples across 60 conversation topics (greetings, food, bubbles, dreams, jokes, and more), generated via template composition with randomized components.
- Train in Colab by setting the runtime to T4 GPU and running all cells — the notebook downloads the dataset, trains the tokenizer, trains the model, and tests it automatically.
- Chat locally by installing
torchandtokenizersvia pip, then runningpython -m guppylm chatfrom the command line. - Pre-trained model on HuggingFace (
arman-bd/guppylm-9M) lets you skip training and chat immediately via a dedicated Colab notebook. - Open dataset on HuggingFace (
arman-bd/guppylm-60k-generic) with 57K train / 3K test samples in a simple input/output/category JSON format, loadable via thedatasetslibrary. - Single-turn inference design keeps outputs reliable within the 128-token context window; the fish personality is baked into the weights rather than a system prompt.
- MIT licensed source code with a clean project structure covering config, model, dataset, training loop, data generation, evaluation, and inference modules.
Community Discussions
Be the first to start a conversation about GuppyLM
Share your experience with GuppyLM, ask questions, or help others learn from your insights.
Pricing
Open Source (MIT)
Fully free and open-source under the MIT license. Train, modify, and distribute freely.
- Full source code access
- Train from scratch in Google Colab
- Pre-trained model on HuggingFace
- 60K open dataset on HuggingFace
- Local CLI chat interface
Capabilities
Key Features
- ~9M parameter vanilla transformer architecture
- Trained from scratch in ~5 minutes on a free T4 GPU
- 60K synthetic training samples across 60 conversation topics
- BPE tokenizer with 4,096 vocab size
- Pre-trained model available on HuggingFace
- Open dataset on HuggingFace (60k-generic)
- Google Colab notebooks for training and inference
- Local CLI chat interface
- Single-turn inference design
- MIT licensed open-source code
