Friday, May 22, 2026

Python Library Categorisation from DS, ML perspective

 Python libraries in Data Science (DS) and Machine Learning (ML) are categorized by their specific role in the end-to-end model pipeline: data ingestion, manipulation, visualization, algorithm training, and production deployment. 


1. Data Processing & Manipulation

These libraries handle the heavy lifting of data cleaning, restructuring, and numerical operations.
  • NumPy: The foundation of scientific computing. Provides support for large, multi-dimensional arrays and high-level mathematical functions.
  • Pandas: Essential for data wrangling. Offers DataFrame structures to easily manipulate tabular data, handle missing values, and merge datasets.
  • SciPy: Built on NumPy, it provides modules for optimization, integration, linear algebra, and statistics. 

2. Exploratory Data Analysis (EDA) & Visualization

These tools help uncover data distributions, correlations, and tell a story with data.
  • Matplotlib: The foundational plotting library for static, animated, and interactive visualizations.
  • Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive and informative statistical graphics.
  • Plotly: Ideal for interactive and publication-ready graphs that can be embedded in web applications. 

3. Traditional Machine Learning

Libraries focused on classical statistical learning, classification, regression, and clustering.
  • Scikit-Learn: The gold standard for classical ML. Contains algorithms for SVMs, Random Forests, K-Means, dimensionality reduction (PCA), and preprocessing.
  • XGBoost: Highly optimized and scalable library designed for gradient-boosted decision trees, heavily utilized for tabular data competitions.
  • LightGBM: A fast, distributed gradient boosting framework by Microsoft, known for its high performance and low memory usage. 

4. Deep Learning & AI

Frameworks tailored for building, training, and deploying neural networks on GPUs/TPUs.
  • PyTorch: Developed by Meta, widely preferred in AI research and production for its dynamic computation graph and intuitive Pythonic feel.
  • TensorFlow: Developed by Google, a comprehensive ecosystem for scaling deep learning models from research to production.
  • Keras: A high-level API specification running on top of TensorFlow, allowing fast prototyping of neural networks. 

5. Specialized Libraries

Libraries built to tackle domain-specific DS/ML tasks.
  • Hugging Face Transformers: The industry standard for Natural Language Processing (NLP) and Large Language Models (LLMs), enabling state-of-the-art text, image, and audio models.
  • OpenCV: The premier library for Computer Vision, used for image processing and video analytics.
  • SciPy (Stats): Specifically for probability distributions, statistical tests, and frequency analysis. 

6. MLOps & Deployment

Libraries to track experiments, package models, and deploy them in production.
  • MLflow: Manages the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
  • Streamlit: Turns data scripts into shareable web apps in minutes, perfect for creating quick ML user interfaces.
  • BentoML: A unified model serving framework to package and deploy ML models into scalable endpoints. 

No comments:

Post a Comment

Python Library Categorisation from DS, ML perspective

  Python libraries in Data Science (DS) and Machine Learning (ML) are categorized by their specific role in the end-to-end model pipeline: d...