Needle
A 26M parameter open-source function-call model distilled from Gemini, designed to run on tiny consumer devices like phones, watches, and glasses.
At a Glance
About Needle
Needle is a 26-million parameter "Simple Attention Network" (SAN) developed by Cactus Compute, distilled from Gemini 3.1 and optimized for single-shot function calling on extremely resource-constrained devices. The model weights are fully open on HuggingFace under the Cactus-Compute/needle repository, and the project is licensed under MIT. According to the repository, in production Needle runs on the Cactus runtime at 6,000 tokens/sec prefill and 1,200 tokens/sec decode speed.
What It Is
Needle is a tiny language model purpose-built for tool/function calling on consumer hardware — phones, smartwatches, AR glasses, and similar edge devices. Rather than being a general-purpose conversational model, it is specifically post-trained on a 2-billion-token single-shot function call dataset to excel at structured output generation for agentic pipelines. The architecture uses a 12-layer encoder with grouped-query attention (GQA) and RoPE, cross-attending into an 8-layer decoder, with zero-centered RMSNorm (ZCRMSNorm) and gated residuals throughout, and a BPE vocabulary of 8,192 tokens at d=512.
Architecture and Training
The Simple Attention Network design deliberately omits feed-forward network (FFN) layers in the encoder, keeping the parameter count at 26M while retaining cross-attention between encoder and decoder stacks. Key training details from the repository:
- Pretraining: 200B tokens on 16 TPU v6e chips over approximately 27 hours
- Post-training: 2B tokens of single-shot function call data in approximately 45 minutes
- Dataset generation: Synthesized via Gemini; generation scripts are open-sourced alongside the weights
The repository notes that Needle beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function call benchmarks for personal AI, while acknowledging those models have broader conversational scope and capacity.
Setup and Workflow
Getting started requires cloning the repository and running the provided setup script:
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground
This opens a Gradio web UI at localhost:7860 for interactive testing and one-click finetuning. The CLI exposes commands for inference (needle run), finetuning on custom JSONL data (needle finetune), full training runs, pretraining, evaluation, tokenization, synthetic data generation via Gemini, and TPU management. Weights are auto-downloaded on first use.
Finetuning for Custom Tools
A key design goal is local finetuning accessibility. The playground UI generates synthetic training data via the Gemini API, trains the model, evaluates it, and bundles the result — all from a single command. For CLI-based finetuning, users supply a JSONL file of tool definitions and examples. The repository explicitly targets Mac and PC users for local finetuning, reflecting the model's consumer-device orientation.
Current Status
The repository was created in February 2026 and last updated in May 2026, with 1,270 stars and 54 forks as of that date. The project is described as "an experimental run for Simple Attention Networks" and is positioned as a research and production prototype rather than a finished product. The authors caution that small models can be finicky and recommend testing and finetuning on specific tool sets before deployment.
Community Discussions
Be the first to start a conversation about Needle
Share your experience with Needle, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source under MIT License. Free to use, modify, and distribute.
- 26M parameter Needle model weights (open on HuggingFace)
- Full source code on GitHub
- CLI for inference, finetuning, training, and evaluation
- Gradio web UI playground
- Synthetic data generation via Gemini API
Capabilities
Key Features
- 26M parameter Simple Attention Network (SAN) architecture
- Single-shot function/tool calling
- Encoder-decoder with cross-attention, GQA, RoPE, ZCRMSNorm
- 6000 tok/sec prefill and 1200 tok/sec decode on Cactus runtime
- Pretrained on 200B tokens, post-trained on 2B function call tokens
- Fully open weights on HuggingFace
- Local finetuning on Mac/PC
- Gradio web UI playground for testing and finetuning
- CLI for inference, finetuning, training, evaluation, and TPU management
- Synthetic training data generation via Gemini API
- Python API for inference
- MIT licensed open-source codebase
