Machine Learning: Datasets for Regression with Python

Gui-Rong Liu
Author
Dr. G.R. (Guirong) Liu is an expert in computational mechanics, particularly known for pioneering work in meshfree methods, smoothed finite element methods (S-FEM), and particle-based simulations. He currently serves as a Professor of Aerospace Engineering and Engineering Mechanics at the University of Cincinnati. A prolific researcher and educator, Dr. Liu is recognized for pioneering meshfree methods, smoothed finite element methods (S-FEM), and smoothed particle hydrodynamics (SPH). These innovations have significantly advanced simulation techniques in both solid and fluid mechanics. In recent years, Dr. Liu has authored several textbooks and reference works for courses and research in areas such as artificial intelligence, machine learning, mathematics, computational methods, mechanics of materials, solid mechanics, engineering mechanics, applied mechanics, and fluid dynamics.

Synopsis

In “What Is Machine Learning” (https://sci-en-tech.com/ebooks/), we introduced machine learning (ML) at a high level and summarized its core concepts. A central takeaway was that modern ML models are fundamentally data-driven. Consequently, building a reliable model depends critically on a well-prepared dataset; thorough understanding, careful examination, and appropriate treatment of data are essential for both effective training and rigorous testing. Building on this, in “Machine Learning: Datasets for Classification with Python” (https://sci-en-tech.com/ebooks/), we discussed the creation, examination, and preprocessing of data specifically for classification tasks. This booklet focuses on several major publicly available datasets used for training supervised regression ML models.   · California Housing Dataset   · Diabetes Dataset   · Airfoil Self-Noise Dataset · Concrete Compressive Strength Dataset    · Energy Efficiency Dataset    · Bike Sharing Dataset · Give Me Some Credit Dataset   · Superconductivity Dataset    Using these datasets, we provide practical demonstrations of how to load data in Python—using scikit-learn or pandas—inspect their structure, and visualize representative samples. These demonstrations illustrate the fundamental principles of data management within a typical machine learning workflow. In particular, we address the following concepts, techniques, rules, and procedures:   · Data Normalization · Special Mapping and Transformations · Feature Extraction from Temporal Data · Encoding Cyclical Features · One-Hot Encoding · Feature Importance Examination · Feature Reduction via PCA · Split-Fit-Transform Rule  These concepts are essential for improving the computational stability of machine learning models and for ensuring effective generalization to unseen data. High-quality data are the cornerstone of high-quality machine learning: if the underlying data are flawed, the resulting model will inevitably reflect those flaws, regardless of the training algorithm employed. This booklet presents the major techniques for preparing datasets ready for training supervised regression machine learning models. Although these techniques are presented in the context of regression models, they are also applicable to classification models, with appropriate consideration of differences in the target variable. Note: This booklet focuses exclusively on data preparation; the training of ML models will be addressed in subsequent volumes.
Cover for Machine Learning: Datasets for Regression with Python