IMDb Movie Rating Prediction Using a Random Forest Classification Approach

Authors

  • Rifqy Rosyidah Ilmi Universitas Sunan Gresik, Indonesia Author

DOI:

https://doi.org/10.35335/25rz6180

Keywords:

IMDb, Machine Learning, Random Forest, SMOTE

Abstract

Accurate movie rating prediction is essential for supporting audience preferences and analytical decision-making in the digital film industry. The availability of large-scale metadata from IMDb provides valuable opportunities for applying machine learning techniques to analyze rating patterns. This study investigates the effectiveness of a Random Forest classification model for predicting IMDb movie rating categories based on structured attributes, including genre, movie duration, content rating, actor popularity, and user review statistics. Data preprocessing involved handling missing values, removing duplicates, encoding categorical variables, normalizing numerical features, and partitioning the dataset into training and testing subsets. To mitigate class imbalance among rating categories, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training data. Experimental evaluation demonstrates that the proposed model achieves an overall accuracy of 0.78, accompanied by balanced precision, recall, and F1-score values across all classes. Confusion matrix analysis shows that classification errors predominantly occur between neighboring rating categories, reflecting the inherent subjectivity of movie ratings. Furthermore, feature importance analysis highlights genre, duration, content rating, and user engagement indicators as the most influential predictors. These results indicate that Random Forest offers a robust and interpretable baseline model for IMDb rating prediction and provides meaningful insights for future movie analytics and recommendation research.  

References

Ahmad, A., & Khalid, H. (2021). Movie rating prediction using machine learning techniques. Journal of Information Science, 47(4), 555–566. https://doi.org/10.1177/0165551520942213

Breiman, L. (2001). Machine Learning, Volume 45, Number 1 - SpringerLink. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324

Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

He, H., & Ma, Y. (2021). Imbalanced learning foundations and algorithms. Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, 11(4). https://doi.org/10.1002/widm.1400

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2 ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1

Liu, Y., Zhang, X., & Wang, J. (2022). Predicting movie popularity and ratings using machine learning approaches. Expert Systems with Applications, 198, 116742. https://doi.org/10.1016/j.eswa.2022.116742

Lundberg, S. M., & Lee, S.-I. (2020). A unified approach to interpreting model predictions. Nature Machine Intelligence, 2, 56–67. https://doi.org/10.1038/s42256-019-0138-9

Martínez, A., & Martínez, J. (2021). Analysis of IMDb dataset for movie success prediction. Procedia Computer Science, 179, 547–554. https://doi.org/10.1016/j.procs.2021.01.036

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830. http://jmlr.org/papers/v12/pedregosa11a.html

Powers, D. M. W. (2020). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. https://arxiv.org/abs/2010.16061

Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

Sarker, I. H. (2021). Machine learning algorithms and applications. SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00592-x

Song, J., & Kim, Y. (2020). Movie rating prediction using ensemble learning. Applied Sciences, 10(18), 6352. https://doi.org/10.3390/app10186352

Wang, Z., Li, Y., & Liu, H. (2023). Explainable random forest for decision support systems. Expert Systems with Applications, 213, 118882. https://doi.org/10.1016/j.eswa.2022.118882

Wu, J., Guo, Y., Zhou, H., Shen, L., & Liu, L. (2020). Vehicular Delay Tolerant Network Routing Algorithm Based on Bayesian Network. IEEE Access, 8, 18727–18740. https://doi.org/10.1109/ACCESS.2020.2967898

Yang, L., Qian, Y., & Wang, Y. (2020). Movie rating prediction via ensemble classifiers. IEEE Access, 8, 123210–123219. https://doi.org/10.1109/ACCESS.2020.3008024

Zhou, Z.-H. (2021). Ensemble learning. In C. Sammut & G. I. Webb (Ed.), Encyclopedia of Machine Learning and Data Mining (hal. 1–9). Springer. https://doi.org/10.1007/978-1-4899-7687-1_453

Downloads

Published

2026-06-20

How to Cite

IMDb Movie Rating Prediction Using a Random Forest Classification Approach. (2026). Vertex, 15(2), 18-24. https://doi.org/10.35335/25rz6180

Similar Articles

You may also start an advanced similarity search for this article.