# RamaLama

> An open-source CLI tool that simplifies running and serving AI models locally using OCI containers, with automatic GPU detection and multi-registry support.

RamaLama is an open-source tool that simplifies the local use and serving of AI models for inference from any source through the familiar approach of OCI containers. It eliminates the need to manually configure the host system by automatically detecting GPUs and pulling the appropriate accelerated container image. Engineers can use container-centric development patterns to work with AI models, treating them similarly to how Podman and Docker treat container images.

- **Automatic GPU Detection** – *On first run, RamaLama inspects your system for GPU support (NVIDIA CUDA, AMD ROCm, Intel ARC, Apple Silicon, Ascend NPU, Moore Threads) and pulls the correct accelerated OCI image automatically.*
- **Multi-Registry Transport Support** – *Pull models from HuggingFace, Ollama, ModelScope, OCI Container Registries (quay.io, Docker Hub), and the RamaLama Labs Container Registry using simple URI prefixes.*
- **Secure Rootless Containers** – *AI models run in rootless containers with read-only volume mounts, no network access (`--network=none`), auto-cleanup (`--rm`), dropped Linux capabilities, and no new privileges.*
- **Chatbot and REST API Serving** – *Use `ramalama run` to start an interactive chatbot or `ramalama serve` to expose a REST API endpoint with an optional web UI on a configurable port.*
- **RAG Support** – *Generate Retrieval Augmented Generation vector databases from PDF, DOCX, PPTX, XLSX, HTML, AsciiDoc, and Markdown files and package them as OCI images for use with `ramalama run --rag`.*
- **Model Conversion** – *Convert models between formats (e.g., Ollama to OCI, Safetensors to GGUF) using `ramalama convert` with optional quantization.*
- **Shortname Aliases** – *Use short, memorable names like `granite`, `mistral`, or `tiny` instead of full registry URIs via configurable shortnames.conf files.*
- **Multiple Inference Runtimes** – *Supports llama.cpp, vLLM, and MLX (Apple Silicon only) runtimes, selectable via `--runtime` flag.*
- **Cross-Platform Installation** – *Install via PyPI (`pip install ramalama`), DNF on Fedora, a macOS `.pkg` installer, or a one-line curl script on Linux and macOS.*
- **Benchmarking and Perplexity** – *Evaluate model performance with `ramalama bench` and measure prediction quality with `ramalama perplexity`.*

## Features
- Automatic GPU detection and accelerated container image selection
- Multi-registry model transport (HuggingFace, Ollama, ModelScope, OCI)
- Rootless container isolation with no network access and auto-cleanup
- Interactive chatbot mode via ramalama run
- REST API serving via ramalama serve with optional web UI
- RAG (Retrieval Augmented Generation) data generation and OCI packaging
- Model conversion between formats (Ollama to OCI, Safetensors to GGUF)
- Shortname aliases for common models
- Support for llama.cpp, vLLM, and MLX inference runtimes
- Model benchmarking and perplexity calculation
- Push/pull models to/from remote registries
- Cross-platform: Linux, macOS, Windows (via Docker/Podman WSL2)

## Integrations
Podman, Docker, HuggingFace, Ollama, ModelScope, quay.io, Docker Hub, llama.cpp, vLLM, MLX, Pulp, Artifactory, OCI Container Registries

## Platforms
WINDOWS, MACOS, LINUX, CLI, API

## Pricing
Open Source

## Version
v0.19.0

## Links
- Website: https://ramalama.ai
- Documentation: https://github.com/containers/ramalama/tree/main/docs
- Repository: https://github.com/containers/ramalama
- EveryDev.ai: https://www.everydev.ai/tools/ramalama