# GuppyLM > A ~9M parameter tiny language model trained from scratch that roleplays as a fish named Guppy, designed as an educational project to demystify LLM training. GuppyLM is a tiny, ~9M parameter language model trained entirely from scratch to demonstrate that building your own LLM requires no PhD or massive GPU cluster. It roleplays as a fish named Guppy, speaking in short lowercase sentences about water, food, light, and tank life. The entire pipeline — data generation, tokenizer training, model architecture, training loop, and inference — runs in a single Google Colab notebook in about 5 minutes on a free T4 GPU. - **~9M parameter vanilla transformer** with 6 layers, 384 hidden dim, 6 attention heads, and a 4,096-token BPE vocabulary — intentionally simple with no GQA, RoPE, or SwiGLU. - **60K synthetic training samples** across 60 conversation topics (greetings, food, bubbles, dreams, jokes, and more), generated via template composition with randomized components. - **Train in Colab** by setting the runtime to T4 GPU and running all cells — the notebook downloads the dataset, trains the tokenizer, trains the model, and tests it automatically. - **Chat locally** by installing `torch` and `tokenizers` via pip, then running `python -m guppylm chat` from the command line. - **Pre-trained model on HuggingFace** (`arman-bd/guppylm-9M`) lets you skip training and chat immediately via a dedicated Colab notebook. - **Open dataset on HuggingFace** (`arman-bd/guppylm-60k-generic`) with 57K train / 3K test samples in a simple input/output/category JSON format, loadable via the `datasets` library. - **Single-turn inference** design keeps outputs reliable within the 128-token context window; the fish personality is baked into the weights rather than a system prompt. - **MIT licensed** source code with a clean project structure covering config, model, dataset, training loop, data generation, evaluation, and inference modules. ## Features - ~9M parameter vanilla transformer architecture - Trained from scratch in ~5 minutes on a free T4 GPU - 60K synthetic training samples across 60 conversation topics - BPE tokenizer with 4,096 vocab size - Pre-trained model available on HuggingFace - Open dataset on HuggingFace (60k-generic) - Google Colab notebooks for training and inference - Local CLI chat interface - Single-turn inference design - MIT licensed open-source code ## Integrations HuggingFace Hub, Google Colab, PyTorch, tokenizers (HuggingFace) ## Platforms CLI, API, DEVELOPER_SDK ## Pricing Open Source ## Links - Website: https://github.com/arman-bd/guppylm - Documentation: https://github.com/arman-bd/guppylm/blob/main/README.md - Repository: https://github.com/arman-bd/guppylm - EveryDev.ai: https://www.everydev.ai/tools/guppylm