IMDb Movie Rating Prediction Using a Random Forest Classification Approach
DOI:
https://doi.org/10.35335/25rz6180Keywords:
IMDb, Machine Learning, Random Forest, SMOTEAbstract
Accurate movie rating prediction is essential for supporting audience preferences and analytical decision-making in the digital film industry. The availability of large-scale metadata from IMDb provides valuable opportunities for applying machine learning techniques to analyze rating patterns. This study investigates the effectiveness of a Random Forest classification model for predicting IMDb movie rating categories based on structured attributes, including genre, movie duration, content rating, actor popularity, and user review statistics. Data preprocessing involved handling missing values, removing duplicates, encoding categorical variables, normalizing numerical features, and partitioning the dataset into training and testing subsets. To mitigate class imbalance among rating categories, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training data. Experimental evaluation demonstrates that the proposed model achieves an overall accuracy of 0.78, accompanied by balanced precision, recall, and F1-score values across all classes. Confusion matrix analysis shows that classification errors predominantly occur between neighboring rating categories, reflecting the inherent subjectivity of movie ratings. Furthermore, feature importance analysis highlights genre, duration, content rating, and user engagement indicators as the most influential predictors. These results indicate that Random Forest offers a robust and interpretable baseline model for IMDb rating prediction and provides meaningful insights for future movie analytics and recommendation research.References
Ahmad, A., & Khalid, H. (2021). Movie rating prediction using machine learning techniques. Journal of Information Science, 47(4), 555–566. https://doi.org/10.1177/0165551520942213
Breiman, L. (2001). Machine Learning, Volume 45, Number 1 - SpringerLink. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324
Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
He, H., & Ma, Y. (2021). Imbalanced learning foundations and algorithms. Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, 11(4). https://doi.org/10.1002/widm.1400
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2 ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
Liu, Y., Zhang, X., & Wang, J. (2022). Predicting movie popularity and ratings using machine learning approaches. Expert Systems with Applications, 198, 116742. https://doi.org/10.1016/j.eswa.2022.116742
Lundberg, S. M., & Lee, S.-I. (2020). A unified approach to interpreting model predictions. Nature Machine Intelligence, 2, 56–67. https://doi.org/10.1038/s42256-019-0138-9
Martínez, A., & Martínez, J. (2021). Analysis of IMDb dataset for movie success prediction. Procedia Computer Science, 179, 547–554. https://doi.org/10.1016/j.procs.2021.01.036
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830. http://jmlr.org/papers/v12/pedregosa11a.html
Powers, D. M. W. (2020). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. https://arxiv.org/abs/2010.16061
Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Sarker, I. H. (2021). Machine learning algorithms and applications. SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00592-x
Song, J., & Kim, Y. (2020). Movie rating prediction using ensemble learning. Applied Sciences, 10(18), 6352. https://doi.org/10.3390/app10186352
Wang, Z., Li, Y., & Liu, H. (2023). Explainable random forest for decision support systems. Expert Systems with Applications, 213, 118882. https://doi.org/10.1016/j.eswa.2022.118882
Wu, J., Guo, Y., Zhou, H., Shen, L., & Liu, L. (2020). Vehicular Delay Tolerant Network Routing Algorithm Based on Bayesian Network. IEEE Access, 8, 18727–18740. https://doi.org/10.1109/ACCESS.2020.2967898
Yang, L., Qian, Y., & Wang, Y. (2020). Movie rating prediction via ensemble classifiers. IEEE Access, 8, 123210–123219. https://doi.org/10.1109/ACCESS.2020.3008024
Zhou, Z.-H. (2021). Ensemble learning. In C. Sammut & G. I. Webb (Ed.), Encyclopedia of Machine Learning and Data Mining (hal. 1–9). Springer. https://doi.org/10.1007/978-1-4899-7687-1_453
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Rifqy Rosyidah Ilmi (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
