Binary Classification Model for Edible Mushrooms Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

Dataset Used: Mushroom Data Set

Dataset ML Model: Binary classification with categorical attributes

Dataset Reference:

One potential source of performance benchmarks:

INTRODUCTION: This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible or definitely poisonous. The Guide, The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf, clearly states that there is no simple rule for determining the edibility of a mushroom.

CONCLUSION: It was interesting to observe that just about all algorithms (except Naive Bayes and Support Vector Machine) scored an accuracy of 100% on the training data using a training/validation split of 70%/30%. Furthermore, all eight algorithms also scored a 100% accuracy rate using the validation dataset.

I reduced the training and validation to only a 50%-50% split, so the algorithms had less training data to work with. The same eight algorithms turned in a 100% accuracy on the larger validation dataset. After reducing the training and validation to a 30%-70% split, the Stochastic Gradient Boosting model dropped out of the race for predictive perfection. After a 20%-80% of training and validation split, Random Forest was the only model that was able to maintain a perfect prediction score for both the training and validation datasets.

For future studies, we can examine and see whether the machine learning algorithms can be trained with fewer features but still maintain the high prediction accuracy. For now, the Random Forest algorithm appeared to be the best-performing model for determining whether mushroom species are edible.

The HTML formatted report can be found here on GitHub.