Friday, May 22, 2026

Python Library Categorisation from DS, ML perspective

 Python libraries in Data Science (DS) and Machine Learning (ML) are categorized by their specific role in the end-to-end model pipeline: data ingestion, manipulation, visualization, algorithm training, and production deployment. 


1. Data Processing & Manipulation

These libraries handle the heavy lifting of data cleaning, restructuring, and numerical operations.
  • NumPy: The foundation of scientific computing. Provides support for large, multi-dimensional arrays and high-level mathematical functions.
  • Pandas: Essential for data wrangling. Offers DataFrame structures to easily manipulate tabular data, handle missing values, and merge datasets.
  • SciPy: Built on NumPy, it provides modules for optimization, integration, linear algebra, and statistics. 

2. Exploratory Data Analysis (EDA) & Visualization

These tools help uncover data distributions, correlations, and tell a story with data.
  • Matplotlib: The foundational plotting library for static, animated, and interactive visualizations.
  • Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive and informative statistical graphics.
  • Plotly: Ideal for interactive and publication-ready graphs that can be embedded in web applications. 

3. Traditional Machine Learning

Libraries focused on classical statistical learning, classification, regression, and clustering.
  • Scikit-Learn: The gold standard for classical ML. Contains algorithms for SVMs, Random Forests, K-Means, dimensionality reduction (PCA), and preprocessing.
  • XGBoost: Highly optimized and scalable library designed for gradient-boosted decision trees, heavily utilized for tabular data competitions.
  • LightGBM: A fast, distributed gradient boosting framework by Microsoft, known for its high performance and low memory usage. 

4. Deep Learning & AI

Frameworks tailored for building, training, and deploying neural networks on GPUs/TPUs.
  • PyTorch: Developed by Meta, widely preferred in AI research and production for its dynamic computation graph and intuitive Pythonic feel.
  • TensorFlow: Developed by Google, a comprehensive ecosystem for scaling deep learning models from research to production.
  • Keras: A high-level API specification running on top of TensorFlow, allowing fast prototyping of neural networks. 

5. Specialized Libraries

Libraries built to tackle domain-specific DS/ML tasks.
  • Hugging Face Transformers: The industry standard for Natural Language Processing (NLP) and Large Language Models (LLMs), enabling state-of-the-art text, image, and audio models.
  • OpenCV: The premier library for Computer Vision, used for image processing and video analytics.
  • SciPy (Stats): Specifically for probability distributions, statistical tests, and frequency analysis. 

6. MLOps & Deployment

Libraries to track experiments, package models, and deploy them in production.
  • MLflow: Manages the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
  • Streamlit: Turns data scripts into shareable web apps in minutes, perfect for creating quick ML user interfaces.
  • BentoML: A unified model serving framework to package and deploy ML models into scalable endpoints. 

Python Library Categorisation from DS, ML perspective

  Python libraries in Data Science (DS) and Machine Learning (ML) are categorized by their specific role in the end-to-end model pipeline: d...