Some Nuances related to ML: AI Foundations #2
Artificial Intelligence (AI) and Machine Learning (ML) are reshaping the way technology interacts with the world. Building on the foundations laid out in our previous article, this installment explores nuanced concepts like data preparation, different types of learning, and challenges in ML development.
The Cornerstone of Machine Learning: Data Preparation
The quality of an ML model heavily depends on how well the data is prepared. A common practice is to split the dataset into:
- Training Set: 80% of the data is used to train the model.
- Test Set: 20% of the data evaluates the model’s performance.
The Machine Learning Process
- Data Pre-Processing
- Import the data
- Clean the data
- Split into training set and test set
2. Modelling
- Build the model
- Train the model
- Make predictions
3. Evaluation
- Calculate performance metrics
- Make a verdict concluding your inference
Feature Scaling
To ensure consistent results, feature scaling is applied to numerical columns. Two common methods are:
- Normalization: Rescales values to fit within the range [0, 1].
- Standardization: Centers data around the mean with a standard deviation of 1, typically resulting in values between -3 and 3.
Dimensions Matter: Feature Engineering and Reduction
Feature Engineering
Creating or selecting meaningful features is critical to a model’s success. The process includes:
- Feature Selection: Identifying the most impactful features.
- Feature Extraction: Combining or transforming features to enhance performance.
- Dimensionality Reduction: Reducing the number of input variables using algorithms like Principal Component Analysis (PCA). Benefits include reduced computation time, lower memory requirements, and sometimes better accuracy.
Types of Learning in Machine Learning
Supervised Learning
The most common type of ML, where the system learns from labeled data. Predicting house prices or classifying emails as spam are classic examples.
Unsupervised Learning
Here, the system learns patterns from unlabeled data. Tasks include:
- Clustering: Grouping similar data points (e.g., customer segmentation).
- Dimensionality Reduction: Simplifying data without losing important information.
- Anomaly Detection: Identifying rare or unusual data points
- Association Rule Learning: Discovering interesting relationships between variables.
Semi-Supervised Learning
This hybrid approach combines a small amount of labeled data with a large set of unlabeled data. For instance, photo organization tools use a few labeled images to categorize thousands of unlabeled ones.
Reinforcement Learning
Unlike other types of ML, reinforcement learning involves an agent learning through interaction with its environment. The agent receives rewards for correct actions and penalties for mistakes, gradually developing an optimal strategy, or policy.
Learning Styles: Batch vs. Online
Batch Learning
- Processes all training data at once.
- Suitable for scenarios where data does not change frequently.
- Requires retraining from scratch to adapt to new data.
Online Learning
- Processes data incrementally, either individually or in small batches.
- Adaptable to real-time updates.
- Controlled by the learning rate, which determines how quickly the model adapts to new data. A high learning rate enables rapid adaptation but risks forgetting older information, while a low rate provides stability.
Approaches to Generalization
Instance-Based Learning
The model memorizes examples and uses similarity measures for predictions. It’s simple but can struggle with scalability.
Model-Based Learning
The system builds a generalized model from training data, such as a linear regression line. It relies on optimizing a utility function or minimizing a cost function, such as the distance between predicted and actual values.
Common Challenges in Machine Learning
- Overfitting: The model performs well on training data but fails to generalize. Solutions include:
- Regularization: Simplifies the model to avoid capturing noise.
- Cross-Validation: Ensures performance consistency across different data subsets.
2. Underfitting: The model is too simple to capture data patterns. Addressed by increasing model complexity or adding relevant features.
3. Poor Data Quality: “Garbage in, garbage out” highlights the need for clean, representative data.
4. Imbalanced Data: Ensuring the training set reflects the real-world distribution.
FAQs to Solidify Understanding
- What is a labeled training set? A dataset where each example includes input features and a corresponding output label.
- What are common supervised tasks? Classification and regression.
- How does online learning differ from batch learning? Online learning updates the model incrementally, while batch learning requires retraining with all data.
- What is the purpose of a test set? To evaluate how well the model generalizes to new, unseen data.
Understanding these nuances strengthens your ability to navigate and apply Machine Learning. Whether you’re tuning hyperparameters or exploring new datasets, a solid grasp of foundational concepts is the first step to success in AI.