The Number of Clusters in Hybrid Predictive Models: Does It Really Matter?

Łapczyński, Mariusz; Jefmański, Bartłomiej

doi:10.5604/01.3001.0013.9131

Artykuł - szczegóły

Czasopismo

Przegląd Statystyczny

2019 | 66 | z. 3 | 228--238

Tytuł artykułu

The Number of Clusters in Hybrid Predictive Models: Does It Really Matter?

Autorzy

Mariusz Łapczyński , Bartłomiej Jefmański

Warianty tytułu

Języki publikacji

Abstrakty

For quite a long time, research studies have attempted to combine various analytical tools to build predictive models. It is possible to combine tools of the same type (ensemble models, committees) or tools of different types (hybrid models). Hybrid models are used in such areas as customer relationship management (CRM), web usage mining, medical sciences, petroleum geology and anomaly detection in computer networks. Our hybrid model was created as a sequential combination of a cluster analysis and decision trees. In the first step of the procedure, objects were grouped into clusters using the k-means algorithm. The second step involved building a decision tree model with a new independent variable that indicated which cluster the objects belonged to. The analysis was based on 14 data sets collected from publicly accessible repositories. The performance of the models was assessed with the use of measures derived from the confusion matrix, including the accuracy, precision, recall, F-measure, and the lift in the first and second decile. We tried to find a relationship between the number of clusters and the quality of hybrid predictive models. According to our knowledge, similar studies have not been conducted yet. Our research demonstrates that in some cases building hybrid models can improve the performance of predictive models. It turned out that the models with the highest performance measures require building a relatively large number of clusters (from 9 to 15). (original abstract)

Słowa kluczowe

Cluster analysis Forecasting models K-means methods Decision tree

Analiza skupień Modele prognostyczne Metoda k-średnich Drzewo decyzyjne

Czasopismo

Przegląd Statystyczny

Rocznik

2019

Tom

Numer

z. 3

Strony

228--238

Opis fizyczny

Twórcy

autor

Mariusz Łapczyński

Cracow University of Economics, Poland

autor

Bartłomiej Jefmański

Wrocław University of Economics and Business, Poland

Bibliografia

Asuncion A., Newman D., (2007), UCI machine learning repository, http://archive.ics.uci.edu.
Blattberg R., Kim B. D., Neslin S., (2008), Database Marketing - Analyzing and Managing Customers, 1st ed., Springer, New York. DOI: 10.1007/978-0-387-72579-6.
Bose I., Chen X., (2009), Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn, Journal of Organizational Computing and Electronic Commerce, 19(2), 133-151, DOI: 10.1080/10919390902821291.
Breiman L., Friedman J., Olshen R., Stone C., (1984), Classification and Regression Trees, 1st ed. Wadsworth statistics / probability series, Wadsworth Publishing Company, Belmont, California.
Chu B. H., Tsai M. S., Ho C. S., (2007), Toward a Hybrid Data Mining Model for Customer Retention, Knowledge-Based Systems, 20(8), 703-718. DOI: 10.1016/j.knosys.2006.10.003.
Everitt B., Landau S., Leese M. D. S., (2011), Cluster Analysis, 5th ed. Wiley Series in Probability and Statistics, John Wiley & Sons, Chichester, West Sussex. DOI: 10.1002/9780470977811.
Ferraretti D., Lamma E., Gamberoni G., Febo M., Di Cuia R., (2011), Integrating Clustering and Classification Techniques: A Case Study for Reservoir Facies Prediction, [in:] Ryżko D., Gawrysik P., Rybiński H., Kryszkieiwcz M., Emerging Intelligent Technologies in Industry, Springer, Berlin, Heidelberg, 21-34. DOI: 10.1007/978-3-642-22732-5_3.
Gaddam S., Phoha V., Balagani K., (2007), K-means + ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-means Clustering and ID3 Decision Tree Learning Methods, IEEE Transactions on Knowledge and Data Engineering, 19(3), 345-354.DOI: 10.1109/TKDE.2007.44.
Khan D., Mohamudally N., (2011), An Integration of k-means and Decision Tree (ID3) Towards a More Efficient Data Mining Algorithm, Journal of Computing, 3(12), 76-82, https://sites. google.com/site/journalofcomputing/volume-3-issue-12-december-2011.
Łapczyński M., Jefmański B., (2013), Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees, [in:] Perner P., (ed.), Advances in Data Mining, Ibai Publishing, Fockendorf, 153-162.
Łapczyński M., Surma J., (2012), Hybrid Predictive Models for Optimizing Marketing Banner Ad Campaign in Online Social Network, [in:] Stahlbock R., (ed), Proceedings of the 2012 International Conference on Data Mining (DMIN), CSREA Press, Las Vegas, Nevada, 140-146.
Li Y., Deng Z., Qian Q., Xu R., (2011), Churn Forecast Based on Two-step Classification in Security Industry, Intelligent Information Management, 3(4), 160-165. DOI: 10.4236/iim.2011.34019.
Lloyd S., (1982), Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137, Institute of Electrical and Electronics Engineers (IEEE). DOI: 10.1109/ TIT.1982.1056489.
Shouman M., Turner T., Stocker R., (2012), Integrating Decision Tree and K-Means Clustering with Different Initial Centroid Selection Methods in the Diagnosis of Heart Disease Patients, [in:] Stahlbock R., (ed), Proceedings of the 2012 International Conference on Data Mining (DMIN), CSREA Press, Las Vegas, Nevada, 24-30.
Walesiak M., Dudek A., (2011), clusterSim: Searching for Optimal Clustering Procedure for a Data Set, https://cran.r-project.org/web/packages/clusterSim. R package version 0.47-3.

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.5604/01.3001.0013.9131

Identyfikator YADDA

bwmeta1.element.ekon-element-000171595043

Komentarze

Musisz być zalogowany aby pisać komentarze.

Przegląd Statystyczny

The Number of Clusters in Hybrid Predictive Models: Does It Really Matter?

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane