scikit-learn
An open-source Python library providing simple and efficient tools for predictive data analysis, including classification, regression, clustering, and more.
At a Glance
About scikit-learn
scikit-learn is a free, open-source machine learning library for Python, released under the BSD 3-Clause license. Built on top of NumPy, SciPy, and matplotlib, it provides a consistent, accessible API for a wide range of supervised and unsupervised learning tasks. The project is hosted on GitHub with over 66,000 stars and is actively maintained by a community of contributors with financial support from organizations including Probabl, INRIA, Microsoft, NVIDIA, and others.
What It Is
scikit-learn is a Python-based machine learning toolkit that covers the full predictive modeling workflow — from data preprocessing and feature extraction through model training, evaluation, and selection. It is designed to be accessible to practitioners at all levels while remaining flexible enough for advanced research and production use. The library is distributed under the BSD 3-Clause license, making it free to use, modify, and redistribute in commercial and non-commercial contexts.
Core Capabilities
scikit-learn organizes its functionality into six major areas:
- Classification — identifying which category an object belongs to, with algorithms including gradient boosting, nearest neighbors, random forest, and logistic regression; applications include spam detection and image recognition.
- Regression — predicting continuous-valued attributes using gradient boosting, nearest neighbors, random forest, ridge regression, and more; applications include drug response modeling and stock price prediction.
- Clustering — automatic grouping of similar objects using k-Means, HDBSCAN, hierarchical clustering, and others; applications include customer segmentation.
- Dimensionality reduction — reducing the number of variables via PCA, feature selection, and non-negative matrix factorization.
- Model selection — comparing, validating, and tuning models through grid search, cross-validation, and evaluation metrics.
- Preprocessing — feature extraction and normalization for transforming raw input data (including text) into formats suitable for ML algorithms.
Architecture and Dependencies
scikit-learn is built directly on the scientific Python stack: NumPy for array operations, SciPy for numerical routines, and matplotlib for visualization. This tight integration means it works naturally within the broader Python data science ecosystem, including pandas for data manipulation and Jupyter notebooks for interactive analysis. The library exposes a consistent estimator API — fit, predict, transform — that makes it straightforward to compose pipelines and swap algorithms.
Update: Release 1.8.0
The current stable release is 1.8.0, published in December 2025. The project maintains a rapid release cadence: version 1.7.0 shipped in June 2025, 1.7.1 in July 2025, 1.7.2 in September 2025, and 1.8.0 in December 2025. Development on version 1.9 is ongoing, with a release candidate (1.9.0rc1) already available. The changelog and release highlights are published alongside each release on the official documentation site.
Community and Governance
scikit-learn operates as a community-driven open-source project with a published governance model and roadmap. The project maintains active channels on Discord, GitHub Discussions, Stack Overflow, a mailing list, and social platforms including Bluesky, Mastodon, LinkedIn, YouTube, Facebook, Instagram, and TikTok. The homepage features testimonials from organizations such as INRIA and Spotify (as published on the scikit-learn website), though these are vendor-curated endorsements. Development and maintenance are financially supported by a named set of sponsors listed on the about page.
Community Discussions
Be the first to start a conversation about scikit-learn
Share your experience with scikit-learn, ask questions, or help others learn from your insights.
Pricing
Open Source
Completely free and open-source under the BSD 3-Clause license. Free to use, modify, and distribute.
- All classification, regression, clustering, and dimensionality reduction algorithms
- Model selection and evaluation tools
- Preprocessing and feature extraction
- Full source code access
- BSD 3-Clause license for commercial use
Capabilities
Key Features
- Classification algorithms (gradient boosting, random forest, logistic regression, nearest neighbors)
- Regression algorithms (ridge, gradient boosting, random forest)
- Clustering (k-Means, HDBSCAN, hierarchical clustering)
- Dimensionality reduction (PCA, feature selection, NMF)
- Model selection (grid search, cross-validation, evaluation metrics)
- Preprocessing and feature extraction
- Consistent estimator API (fit/predict/transform)
- Pipeline composition
- BSD 3-Clause open-source license
