Table of Content
- Fight Overfitting in Machine learning
- Get More Data
- Reduce Model Complexity
- Use Regularization
- Validate
- Combine Models
- Conclusion
Fight Overfitting in Machine learning
Machine learning is a powerful tool for creating predictive models from data, but with great power comes the great potential for error. Complex models can detect subtle patterns, but they are also prone to becoming attached to particularities in the data used to create them: a tendency known as overfitting. Taking steps to fight overfitting is necessary to develop predictive models that make accurate predictions on new data, especially when using complex models like neural networks or decision trees.
Get More Data
A predictive model is only as powerful as the data used to train it. The more data you use, the better the model will tend to perform when making predictions on new data. A small data set may not exhibit all the patterns present in the population of interest, which can lead to models that don’t generalize well. Including more data in the training set can reduce overfitting and improve model accuracy. Specific models like neural networks tend to be extraordinarily data-hungry and may require thousands or even millions of examples to get good performance.
Reduce Model Complexity
It’s not always possible to get more data, due to financial considerations, time constraints, and other factors. Reducing the model’s complexity is another way to combat overfitting and has the additional benefit of making it faster to train. Many machine-learning models have parameters you can adjust that affect complexity and the tendency to overfit. For instance, tree-based models typically have a setting for maximum tree depth, so reducing tree depth will simplify the model and the propensity to overfit. The specific names of parameters for controlling complexity will vary depending on the programming libraries you use, but many models have arguments you can adjust to constrain their complexity.
Use Regularization
Regularization reduces overfitting by adding a penalty for model complexity. For example, linear regression assigns a numeric weight to each column or feature in the training data and then uses those weights to make predictions. The larger a column’s weight, the more influence it has on the projections. If a feature receives an enormous weight, it could have too much impact on the predictions and cause the model to overfit. Ridge regression is an alternative to the standard linear regression that adds a penalty for large weights, which constrains the model and reduces overfitting. Read the documentation for the model you are using to see if it offers parameters to control regularization; common arguments that control regularization include lambda, alpha, l1, and l2.
Validate
Although you can never be sure how well a model will perform on new data, it is essential to simulate performance so that you can iterate and improve the model. Validation describes setting some of the data aside when building a model and using that held-out data to test the model’s performance. If you don’t hold some data back to check performance and instead assess the model using the training data itself, it has a good chance of overfitting.
Also, be wary of performing feature engineering or preprocessing on the entire data set before splitting it into training and validation sets, as you may inadvertently introduce some information from the latter set into the former. If that happens, your validation scores will probably not reflect the model’s true performance.
Combine Models
Combining two or more models, known as ensembling, is a technique to reduce overfitting and boost predictive power. Different models detect different patterns so averaging them together dampens the impact of overfitting by any single model. Ensembling can be as simple as running the same model several times and averaging the results together, although combining a diverse set of models will tend to give your predictions more resistance to overfitting.
Conclusion
Overfitting is one of the most pervasive problems for creating machine learning models that make useful predictions. Powerful models like boosted decision trees and neural networks can discover complex relationships in data, but they are also prone to overfitting. Taking steps to fight overfitting is essential to make models that generalize well to new data. If you are interested in reading more about Machine Learning please check out this introduction article.