Some Nuances related to ML: AI Foundations #2

3 min readFeb 17, 2025

Artificial Intelligence (AI) and Machine Learning (ML) are reshaping the way technology interacts with the world. Building on the foundations laid out in our previous article, this installment explores nuanced concepts like data preparation, different types of learning, and challenges in ML development.

The Cornerstone of Machine Learning: Data Preparation

The quality of an ML model heavily depends on how well the data is prepared. A common practice is to split the dataset into:

Training Set: 80% of the data is used to train the model.
Test Set: 20% of the data evaluates the model’s performance.

The Machine Learning Process

Data Pre-Processing

Import the data
Clean the data
Split into training set and test set

2. Modelling

Build the model
Train the model
Make predictions

3. Evaluation

Calculate performance metrics
Make a verdict concluding your inference

Feature Scaling

To ensure consistent results, feature scaling is applied to numerical columns. Two common methods are:

Normalization: Rescales values to fit within the range [0, 1].
Standardization: Centers data around the mean with a standard deviation of 1, typically resulting in values between -3 and 3.

Dimensions Matter: Feature Engineering and Reduction

Feature Engineering

Creating or selecting meaningful features is critical to a model’s success. The process includes:

Feature Selection: Identifying the most impactful features.
Feature Extraction: Combining or transforming features to enhance performance.
Dimensionality Reduction: Reducing the number of input variables using algorithms like Principal Component Analysis (PCA). Benefits include reduced computation time, lower memory requirements, and sometimes better accuracy.

Types of Learning in Machine Learning

Supervised Learning

The most common type of ML, where the system learns from labeled data. Predicting house prices or classifying emails as spam are classic examples.

Unsupervised Learning

Here, the system learns patterns from unlabeled data. Tasks include:

Clustering: Grouping similar data points (e.g., customer segmentation).
Dimensionality Reduction: Simplifying data without losing important information.
Anomaly Detection: Identifying rare or unusual data points
Association Rule Learning: Discovering interesting relationships between variables.

Semi-Supervised Learning

This hybrid approach combines a small amount of labeled data with a large set of unlabeled data. For instance, photo organization tools use a few labeled images to categorize thousands of unlabeled ones.

Reinforcement Learning

Unlike other types of ML, reinforcement learning involves an agent learning through interaction with its environment. The agent receives rewards for correct actions and penalties for mistakes, gradually developing an optimal strategy, or policy.

Learning Styles: Batch vs. Online

Batch Learning

Processes all training data at once.
Suitable for scenarios where data does not change frequently.
Requires retraining from scratch to adapt to new data.

Online Learning

Processes data incrementally, either individually or in small batches.
Adaptable to real-time updates.
Controlled by the learning rate, which determines how quickly the model adapts to new data. A high learning rate enables rapid adaptation but risks forgetting older information, while a low rate provides stability.

Approaches to Generalization

Instance-Based Learning

The model memorizes examples and uses similarity measures for predictions. It’s simple but can struggle with scalability.

Model-Based Learning

The system builds a generalized model from training data, such as a linear regression line. It relies on optimizing a utility function or minimizing a cost function, such as the distance between predicted and actual values.

Common Challenges in Machine Learning

Overfitting: The model performs well on training data but fails to generalize. Solutions include:

Regularization: Simplifies the model to avoid capturing noise.
Cross-Validation: Ensures performance consistency across different data subsets.

2. Underfitting: The model is too simple to capture data patterns. Addressed by increasing model complexity or adding relevant features.

3. Poor Data Quality: “Garbage in, garbage out” highlights the need for clean, representative data.

4. Imbalanced Data: Ensuring the training set reflects the real-world distribution.

FAQs to Solidify Understanding

What is a labeled training set? A dataset where each example includes input features and a corresponding output label.
What are common supervised tasks? Classification and regression.
How does online learning differ from batch learning? Online learning updates the model incrementally, while batch learning requires retraining with all data.
What is the purpose of a test set? To evaluate how well the model generalizes to new, unseen data.

Understanding these nuances strengthens your ability to navigate and apply Machine Learning. Whether you’re tuning hyperparameters or exploring new datasets, a solid grasp of foundational concepts is the first step to success in AI.