# GuppyLM

> A ~9M parameter tiny language model trained from scratch that roleplays as a fish named Guppy, designed as an educational project to demystify LLM training.

GuppyLM is a tiny, ~9M parameter language model trained entirely from scratch to demonstrate that building your own LLM requires no PhD or massive GPU cluster. It roleplays as a fish named Guppy, speaking in short lowercase sentences about water, food, light, and tank life. The entire pipeline — data generation, tokenizer training, model architecture, training loop, and inference — runs in a single Google Colab notebook in about 5 minutes on a free T4 GPU.

- **~9M parameter vanilla transformer** with 6 layers, 384 hidden dim, 6 attention heads, and a 4,096-token BPE vocabulary — intentionally simple with no GQA, RoPE, or SwiGLU.
- **60K synthetic training samples** across 60 conversation topics (greetings, food, bubbles, dreams, jokes, and more), generated via template composition with randomized components.
- **Train in Colab** by setting the runtime to T4 GPU and running all cells — the notebook downloads the dataset, trains the tokenizer, trains the model, and tests it automatically.
- **Chat locally** by installing `torch` and `tokenizers` via pip, then running `python -m guppylm chat` from the command line.
- **Pre-trained model on HuggingFace** (`arman-bd/guppylm-9M`) lets you skip training and chat immediately via a dedicated Colab notebook.
- **Open dataset on HuggingFace** (`arman-bd/guppylm-60k-generic`) with 57K train / 3K test samples in a simple input/output/category JSON format, loadable via the `datasets` library.
- **Single-turn inference** design keeps outputs reliable within the 128-token context window; the fish personality is baked into the weights rather than a system prompt.
- **MIT licensed** source code with a clean project structure covering config, model, dataset, training loop, data generation, evaluation, and inference modules.

## Features
- ~9M parameter vanilla transformer architecture
- Trained from scratch in ~5 minutes on a free T4 GPU
- 60K synthetic training samples across 60 conversation topics
- BPE tokenizer with 4,096 vocab size
- Pre-trained model available on HuggingFace
- Open dataset on HuggingFace (60k-generic)
- Google Colab notebooks for training and inference
- Local CLI chat interface
- Single-turn inference design
- MIT licensed open-source code

## Integrations
HuggingFace Hub, Google Colab, PyTorch, tokenizers (HuggingFace)

## Platforms
CLI, API, DEVELOPER_SDK

## Pricing
Open Source

## Links
- Website: https://github.com/arman-bd/guppylm
- Documentation: https://github.com/arman-bd/guppylm/blob/main/README.md
- Repository: https://github.com/arman-bd/guppylm
- EveryDev.ai: https://www.everydev.ai/tools/guppylm