Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    2,341+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1228
    • Coding1045
    • Infrastructure455
    • Marketing414
    • Design374
    • Projects340
    • Analytics319
    • Research306
    • Testing200
    • Data171
    • Integration169
    • Security169
    • MCP164
    • Learning146
    • Communication131
    • Prompts122
    • Extensions120
    • Commerce116
    • Voice107
    • DevOps92
    • Web73
    • Finance19
    1. Home
    2. Tools
    3. whichllm
    whichllm icon

    whichllm

    Local Inference

    A CLI tool that auto-detects your GPU/CPU/RAM and ranks the best local LLMs from HuggingFace that actually fit and perform on your hardware.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under the MIT License. Install via pip, uv, or Homebrew.

    Engagement

    Available On

    macOS
    Linux
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Local InferenceCommand Line AssistantsModel Management

    Alternatives

    apfelRamaLamaAxolotl
    Developer
    Andyyyy64Andyyyy64 builds open-source developer tooling focused on lo…

    Listed May 2026

    About whichllm

    whichllm is an open-source command-line tool that helps users find the best local large language model for their specific hardware. Built in Python and published on PyPI under the MIT license, it auto-detects NVIDIA, AMD, Apple Silicon, and CPU-only configurations, then ranks models from HuggingFace using real benchmark data rather than parameter count alone. The project reached v0.5.2 as of May 2026 and has accumulated over 500 GitHub stars since its March 2026 creation.

    What It Is

    whichllm sits in the local-inference tooling category: it answers the question "which model should I actually run?" rather than just "which model fits in my VRAM?" It fetches live model data from the HuggingFace API, merges scores from multiple benchmark sources (LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, Open LLM Leaderboard, and a multimodal/vision index), and produces a ranked list with estimated token-per-second speeds. The result is a single terminal command that outputs a ranked table or JSON for scripting.

    How the Ranking Engine Works

    The scoring system assigns each model a 0–100 score built from several weighted factors:

    • Benchmark quality — merged from LiveBench, Artificial Analysis, Aider, vision benchmarks, Arena ELO, and Open LLM Leaderboard, weighted by source confidence
    • Model size — log₂-scaled as a world-knowledge proxy; MoE models use total params for quality but active params for speed
    • Quantization penalty — lower-bit quants are discounted multiplicatively
    • Evidence confidence — scores tagged direct, variant, base, interpolated, or self-reported and discounted accordingly (×0.55 for self-reported, ×1.0 for direct)
    • Runtime fit — full GPU, partial offload (×0.72), or CPU-only (×0.50)
    • Speed gate — ±8 points based on usability relative to a fit-dependent tok/s floor
    • Source trust — official-org bonus, known-repackager penalty
    • Popularity — downloads/likes as a tie-breaker, weight shrinks as evidence strengthens

    Inheritance is rejected when a model's parameter count diverges more than 2× from its family's dominant member, preventing small forks from borrowing a large base model's benchmark score.

    Key Commands and Workflow

    The tool is designed around a single-command workflow with optional flags for deeper control:

    • whichllm — auto-detect hardware and show ranked models
    • whichllm --gpu "RTX 4090" — simulate any GPU before purchasing
    • whichllm run — download and start an interactive chat with the best model, using uv for isolated environment setup
    • whichllm snippet "qwen 7b" — print a copy-paste Python code snippet for any model
    • whichllm plan "llama 3 70b" — reverse lookup: what GPU do I need?
    • whichllm hardware — display detected hardware info only
    • --json flag — pipe-friendly JSON output for scripting with jq

    Supported model formats include GGUF (via llama-cpp-python), AWQ/GPTQ (via transformers + autoawq/auto-gptq), and FP16/BF16 (via transformers).

    Architecture and Data Pipeline

    The project is structured into four main layers: CLI (cli.py via Typer), hardware detection (hardware/), model fetching and benchmarking (models/), and the ranking engine (engine/). Hardware detection covers NVIDIA via nvidia-ml-py, AMD via dbgpu/ROCm, Apple Silicon via Metal, and CPU/RAM/disk via standard system calls. Model data is cached at ~/.cache/whichllm/ with a 6-hour TTL for model lists and 24-hour TTL for benchmark data, with curated frozen fallbacks for offline or rate-limited use. VRAM estimation accounts for weights, GQA KV cache, activations, and framework overhead (~500 MB).

    Update: v0.5.2

    The latest release, v0.5.2, was published on May 15, 2026, with the repository last pushed the same day. The project was created in March 2026 and has moved quickly through five minor versions. The GitHub repository lists Python 3.11+ as the minimum requirement and supports installation via uvx, Homebrew, or pip. Active development is signaled by 8 open issues and ongoing benchmark source integration work.

    whichllm - 1

    Community Discussions

    Be the first to start a conversation about whichllm

    Share your experience with whichllm, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source (MIT)

    Fully free and open-source under the MIT License. Install via pip, uv, or Homebrew.

    • Auto hardware detection
    • Benchmark-aware LLM ranking
    • GPU simulation
    • whichllm run for instant model chat
    • whichllm snippet for Python code generation

    Capabilities

    Key Features

    • Auto-detect NVIDIA, AMD, Apple Silicon, and CPU-only hardware
    • Benchmark-aware ranking using LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, and Open LLM Leaderboard
    • GPU simulation with --gpu flag for pre-purchase planning
    • One-command model download and interactive chat via whichllm run
    • Copy-paste Python code snippet generation via whichllm snippet
    • Reverse hardware lookup via whichllm plan
    • JSON output for scripting and pipelines
    • Task profiles: general, coding, vision, math
    • Live HuggingFace API data with local cache (6h/24h TTL)
    • Supports GGUF, AWQ, GPTQ, FP16, BF16 model formats
    • Evidence-graded scoring with confidence dampening
    • Recency-aware benchmark demotion to prevent stale leaderboard bias
    • Offline fallback with curated frozen benchmark data
    • Ollama integration via JSON pipe

    Integrations

    HuggingFace API
    Ollama
    llama-cpp-python
    transformers
    autoawq
    auto-gptq
    nvidia-ml-py
    uv
    Homebrew
    PyPI
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate whichllm and help others make informed decisions.

    Developer

    Andyyyy64

    Andyyyy64 builds open-source developer tooling focused on local AI inference. The whichllm project auto-detects hardware and ranks local LLMs using real benchmark data, making it easier for developers to choose and run models without manual research. The project is written in Python and distributed via PyPI, Homebrew, and uv.

    Read more about Andyyyy64
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    apfel icon

    apfel

    A free, open-source CLI tool that unlocks Apple's on-device LLM on macOS 26+ as a terminal command, OpenAI-compatible HTTP server, and interactive chat.

    RamaLama icon

    RamaLama

    An open-source CLI tool that simplifies running and serving AI models locally using OCI containers, with automatic GPU detection and multi-registry support.

    Axolotl icon

    Axolotl

    Open-source tool for fine-tuning LLMs faster and at scale, supporting multi-GPU training, LoRA, FSDP, and a wide range of model architectures.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    101 tools

    Command Line Assistants

    AI-powered command-line assistants that help developers navigate, search, and execute terminal commands with intelligent suggestions and context awareness.

    128 tools

    Model Management

    Tools for managing, versioning, and deploying AI models.

    35 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    Discussions