An open-source framework for researching and developing foundation models, with full reproducibility of every step from raw data to final model.
At a Glance
About Marin
Marin is an open-source framework built by the marin-community organization for the research and development of foundation models. It operates as an open lab where every step of the model-building process—data curation, training, evaluation, and even failed experiments—is recorded and shared publicly in real time. The project is licensed under Apache 2.0 and hosted on GitHub, with documentation available on ReadTheDocs.
What It Is
Marin is a Python-based framework designed to make foundation model research fully reproducible and transparent. Rather than sharing only final model weights, Marin captures the entire provenance graph: raw data sources, tokenization pipelines, training configurations, hyperparameter choices, and evaluation results. It targets researchers and practitioners who want to train language models like Llama, DeepSeek, or Qwen-style architectures from scratch, and who want every decision to be auditable and replicable.
How the Experiment Workflow Works
Marin structures research as a directed acyclic graph of steps, similar to a Makefile, where each step can depend on prior steps and is executed in topological order. The lifecycle of an experiment follows a defined pattern:
- A GitHub issue is created to preregister the experiment with hypotheses and goals.
- A pull request is submitted with code that reproduces the experiment.
- The code defines a provenance graph that is executed, with results summarized in a WandB report.
This means every experiment—including those that failed—is traceable through a GitHub issue, a PR, executable code, and a WandB run. Example experiments tracked this way include comparisons of z-loss impact, optimizer sweeps (AdamW vs. alternatives), BERT vs. fastText as quality filters, and MoE vs. dense model efficiency.
Models Trained with Marin
The marin-community has used the framework to train and release several models:
- Marin-8B-Base: The project claims this was the first open-source 8B parameter model to outperform Llama 3.1 8B, beating it on 14 out of 19 standard benchmarks.
- Marin-8B-Instruct: A fine-tuned instruction-following variant available to try on Together AI.
- Marin-32B-Base: The project states this beats OLMo 2 32B Base on 14/19 standard benchmarks and is competitive with Gemma 3 27B PT and Qwen 2.5 32B Base.
All training scripts, execution graphs, and WandB reports for these models are publicly linked from the project homepage.
Core Capabilities
Marin covers the full pipeline for language model development:
- Data curation: filtering, transformation, and quality scoring of raw datasets
- Tokenization: configurable tokenization pipelines (e.g., Llama 3 tokenizer)
- Training: supports TPU pods (including multislice TPU) and GPU multi-node setups
- Evaluation: integrates with EleutherAI's
lm-evaluation-harnessfor in-loop eval during training - Speedrun competition: a community benchmark inspired by the nanogpt speedrun, where participants compete to train models to a target quality within a compute budget
Current Status and Community
As of May 2026, the repository shows active development with 983 stars, 116 forks, and 578 open issues. The project acknowledges support from the Google TPU Research Cloud program. Community participation happens via Discord and a mailing list, and the project explicitly invites contributions across architecture experiments, training algorithms, datasets, and evaluations. Agent skill guides (e.g., for adding new datasets) are included in the repository under .agents/skills/.
Community Discussions
Be the first to start a conversation about Marin
Share your experience with Marin, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.
- Full framework source code
- Data curation and tokenization pipelines
- Language model training on TPU and GPU
- In-loop evaluation with lm-evaluation-harness
- WandB integration
Capabilities
Key Features
- Full reproducibility of every training step
- Provenance graph execution (DAG-based, like a Makefile)
- Data curation, filtering, transformation, and tokenization pipelines
- Language model training on TPU pods and multi-node GPUs
- In-loop evaluation with lm-evaluation-harness
- WandB integration for experiment reporting
- GitHub issue-based experiment preregistration
- Speedrun competition for efficient training methods
- Perplexity Gap Dashboard for analysis
- Agent skill guides for common tasks
