IMDb Movie Rating Prediction Using a Random Forest Classification Approach

Rifqy  Rosyidah Ilmi

doi:10.35335/25rz6180

Authors

Rifqy Rosyidah Ilmi Universitas Sunan Gresik, Indonesia Author

DOI:

https://doi.org/10.35335/25rz6180

Keywords:

IMDb, Machine Learning, Random Forest, SMOTE

Abstract

Accurate movie rating prediction is essential for supporting audience preferences and analytical decision-making in the digital film industry. The availability of large-scale metadata from IMDb provides valuable opportunities for applying machine learning techniques to analyze rating patterns. This study investigates the effectiveness of a Random Forest classification model for predicting IMDb movie rating categories based on structured attributes, including genre, movie duration, content rating, actor popularity, and user review statistics. Data preprocessing involved handling missing values, removing duplicates, encoding categorical variables, normalizing numerical features, and partitioning the dataset into training and testing subsets. To mitigate class imbalance among rating categories, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training data. Experimental evaluation demonstrates that the proposed model achieves an overall accuracy of 0.78, accompanied by balanced precision, recall, and F1-score values across all classes. Confusion matrix analysis shows that classification errors predominantly occur between neighboring rating categories, reflecting the inherent subjectivity of movie ratings. Furthermore, feature importance analysis highlights genre, duration, content rating, and user engagement indicators as the most influential predictors. These results indicate that Random Forest offers a robust and interpretable baseline model for IMDb rating prediction and provides meaningful insights for future movie analytics and recommendation research.

References

Ahmad, A., & Khalid, H. (2021). Movie rating prediction using machine learning techniques. Journal of Information Science, 47(4), 555–566. https://doi.org/10.1177/0165551520942213

Breiman, L. (2001). Machine Learning, Volume 45, Number 1 - SpringerLink. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324

Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

He, H., & Ma, Y. (2021). Imbalanced learning foundations and algorithms. Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, 11(4). https://doi.org/10.1002/widm.1400

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2 ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1

Liu, Y., Zhang, X., & Wang, J. (2022). Predicting movie popularity and ratings using machine learning approaches. Expert Systems with Applications, 198, 116742. https://doi.org/10.1016/j.eswa.2022.116742

Lundberg, S. M., & Lee, S.-I. (2020). A unified approach to interpreting model predictions. Nature Machine Intelligence, 2, 56–67. https://doi.org/10.1038/s42256-019-0138-9

Martínez, A., & Martínez, J. (2021). Analysis of IMDb dataset for movie success prediction. Procedia Computer Science, 179, 547–554. https://doi.org/10.1016/j.procs.2021.01.036

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830. http://jmlr.org/papers/v12/pedregosa11a.html

Powers, D. M. W. (2020). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. https://arxiv.org/abs/2010.16061

Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

Sarker, I. H. (2021). Machine learning algorithms and applications. SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00592-x

Song, J., & Kim, Y. (2020). Movie rating prediction using ensemble learning. Applied Sciences, 10(18), 6352. https://doi.org/10.3390/app10186352

Wang, Z., Li, Y., & Liu, H. (2023). Explainable random forest for decision support systems. Expert Systems with Applications, 213, 118882. https://doi.org/10.1016/j.eswa.2022.118882

Wu, J., Guo, Y., Zhou, H., Shen, L., & Liu, L. (2020). Vehicular Delay Tolerant Network Routing Algorithm Based on Bayesian Network. IEEE Access, 8, 18727–18740. https://doi.org/10.1109/ACCESS.2020.2967898

Yang, L., Qian, Y., & Wang, Y. (2020). Movie rating prediction via ensemble classifiers. IEEE Access, 8, 123210–123219. https://doi.org/10.1109/ACCESS.2020.3008024

Zhou, Z.-H. (2021). Ensemble learning. In C. Sammut & G. I. Webb (Ed.), Encyclopedia of Machine Learning and Data Mining (hal. 1–9). Springer. https://doi.org/10.1007/978-1-4899-7687-1_453

IMDb Movie Rating Prediction Using a Random Forest Classification Approach

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

QUICK MENU

Template