Needle

Name: Needle
Availability: OnlineOnly
Author: Cactus Compute

A 26M parameter open-source function-call model distilled from Gemini, designed to run on tiny consumer devices like phones, watches, and glasses.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under MIT License. Free to use, modify, and distribute.

Engagement

Available On

macOS

Android

iOS

Web

API

Cactus ComputeSan Francisco, CAEst. 2025$1000000 raised

Listed May 2026

About Needle

Needle is a 26-million parameter "Simple Attention Network" (SAN) developed by Cactus Compute, distilled from Gemini 3.1 and optimized for single-shot function calling on extremely resource-constrained devices. The model weights are fully open on HuggingFace under the Cactus-Compute/needle repository, and the project is licensed under MIT. According to the repository, in production Needle runs on the Cactus runtime at 6,000 tokens/sec prefill and 1,200 tokens/sec decode speed.

What It Is

Needle is a tiny language model purpose-built for tool/function calling on consumer hardware — phones, smartwatches, AR glasses, and similar edge devices. Rather than being a general-purpose conversational model, it is specifically post-trained on a 2-billion-token single-shot function call dataset to excel at structured output generation for agentic pipelines. The architecture uses a 12-layer encoder with grouped-query attention (GQA) and RoPE, cross-attending into an 8-layer decoder, with zero-centered RMSNorm (ZCRMSNorm) and gated residuals throughout, and a BPE vocabulary of 8,192 tokens at d=512.

Architecture and Training

The Simple Attention Network design deliberately omits feed-forward network (FFN) layers in the encoder, keeping the parameter count at 26M while retaining cross-attention between encoder and decoder stacks. Key training details from the repository:

Pretraining: 200B tokens on 16 TPU v6e chips over approximately 27 hours
Post-training: 2B tokens of single-shot function call data in approximately 45 minutes
Dataset generation: Synthesized via Gemini; generation scripts are open-sourced alongside the weights

The repository notes that Needle beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function call benchmarks for personal AI, while acknowledging those models have broader conversational scope and capacity.

Setup and Workflow

Getting started requires cloning the repository and running the provided setup script:

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

This opens a Gradio web UI at localhost:7860 for interactive testing and one-click finetuning. The CLI exposes commands for inference (needle run), finetuning on custom JSONL data (needle finetune), full training runs, pretraining, evaluation, tokenization, synthetic data generation via Gemini, and TPU management. Weights are auto-downloaded on first use.

Finetuning for Custom Tools

A key design goal is local finetuning accessibility. The playground UI generates synthetic training data via the Gemini API, trains the model, evaluates it, and bundles the result — all from a single command. For CLI-based finetuning, users supply a JSONL file of tool definitions and examples. The repository explicitly targets Mac and PC users for local finetuning, reflecting the model's consumer-device orientation.

Current Status

The repository was created in February 2026 and last updated in May 2026, with 1,270 stars and 54 forks as of that date. The project is described as "an experimental run for Simple Attention Networks" and is positioned as a research and production prototype rather than a finished product. The authors caution that small models can be finicky and recommend testing and finetuning on specific tool sets before deployment.

Community Discussions

Be the first to start a conversation about Needle

Share your experience with Needle, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under MIT License. Free to use, modify, and distribute.

26M parameter Needle model weights (open on HuggingFace)
Full source code on GitHub
CLI for inference, finetuning, training, and evaluation
Gradio web UI playground
Synthetic data generation via Gemini API

Capabilities

Key Features

26M parameter Simple Attention Network (SAN) architecture
Single-shot function/tool calling
Encoder-decoder with cross-attention, GQA, RoPE, ZCRMSNorm
6000 tok/sec prefill and 1200 tok/sec decode on Cactus runtime
Pretrained on 200B tokens, post-trained on 2B function call tokens
Fully open weights on HuggingFace
Local finetuning on Mac/PC
Gradio web UI playground for testing and finetuning
CLI for inference, finetuning, training, evaluation, and TPU management
Synthetic training data generation via Gemini API
Python API for inference
MIT licensed open-source codebase

Integrations

Cactus runtime

HuggingFace (model weights)

Gemini API (data synthesis)

Gradio (web UI)

TPU v6e (training)

API Available

Back to all tools