Wpływ liczby skupień na jakość predykcyjnych modeli hybrydowych

Łapczyński, Mariusz; Jefmański, Bartłomiej

Artykuł - szczegóły

Czasopismo

Handel Wewnętrzny

2014 | nr 1 tom 2 | 140--150

Tytuł artykułu

Wpływ liczby skupień na jakość predykcyjnych modeli hybrydowych

Autorzy

Mariusz Łapczyński , Bartłomiej Jefmański

Treść / Zawartość

Pełne teksty:

http://handelue.home.pl/ibrkk/pliki/hw/archiwum/handel_wew_1-2014_tom-2.pdf [zdalny]

Warianty tytułu

Impact of Clusters' Number on Performance of Hybrid Predictive Models

Języki publikacji

Abstrakty

Celem opracowania jest przedstawienie wpływu liczby skupień na jakość predykcyjnych modeli hybrydowych. Autorzy zbudowali modele hybrydowe łącząc metodę k-średnich z drzewami klasyfikacyjnymi CART. W analizie wykorzystano kilka zbiorów obserwacji pobranych z ogólnie dostępnych repozytoriów i tematycznie związanych z szeroko rozumianą działalnością marketingową przedsiębiorstw. Przy wyborze optymalnej liczby skupień korzystano z miar: Calińskiego-Harabasza, Krzanowskiego-Lai, Davies'a-Bouldin'a, Hartigana oraz statystyki gap. Następnie zbudowano modele drzew klasyfikacyjnych traktując przynależność obiektu do skupienia jako dodatkową zmienną niezależną. Oceny rozwiązań dokonano za pomocą popularnych współczynników: dokładności, czułości, precyzji, średniej G (G mean) i miary F (F measure) oraz za pomocą wartości współczynnika lift dla pierwszego decyla zbioru testowego. (abstrakt oryginalny)

The goal of this paper is to present the impact of the number of clusters on the performance of hybrid predictive models. The authors constructed a hybrid model combining the k-means algorithm with decision trees (CART algorithm). The study is based on several datasets downloaded from popular repositories and related to the marketing activity of companies. In the first stage, objects were clustered by using the k-means algorithm. While choosing the optimal number of clusters, authors decided to utilise popular cluster validity measures: the Calinski-Harabasz index, the Krzanowski-Lai index, the Davies-Bouldin index, the Hartigan index and gap statistics. In the second stage, the C&RT algorithm was applied, treating the cluster membership of objects as a new independent variable. The performance of models was evaluated by using popular measures such as accuracy, precision, recall, G-mean, F-measure and lift in the first decile. (original abstract)

Słowa kluczowe

Analiza skupień Drzewa klasyfikacyjne Mierniki jakości klasyfikacji Drzewo decyzyjne

Cluster analysis Classification trees Measures of clustering quality Decision tree

Czasopismo

Handel Wewnętrzny

Rocznik

2014

Numer

nr 1 tom 2

Strony

140--150

Opis fizyczny

Twórcy

autor

Mariusz Łapczyński

Uniwersytet Ekonomiczny w Krakowie

autor

Bartłomiej Jefmański

Uniwersytet Ekonomiczny we Wrocławiu

Bibliografia

Ball G., Hall D. (1965), ISODATA, a novel method of data analysis and pattern classification, Tech. rept. NTIS AD 699616, Stanford Research Institute, Menlo Park.
Blake C.L., Merz C.J. (1998), Churn Data Set, UCI Repository of Machine Learning Databases, University of California, Department of Information and Computer Science, Irvine, CA, http://www.sgi.com/tech/mlc/db.
Breiman L. et al. (1984), Classification and Regression Trees, Wadsworth International Group, Belmont, CA.
Frank A., Asuncion A. (2010), UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, http://archive.ics.uci.edu/ml
Friedman H.P, Rubin J. (1967), On Some Invariant Criteria for Grouping Data, "Journal of the American Statistical Association", Vol. 62, No. 320.
Hartigan J.A. (1975), Clustering Algorithms, Wiley, New York, London, Sydney, Toronto.
Hartigan, J.A., Wong M.A. (1979), AK-means clustering algorithm, "Applied Statistics", Vol. 28, No. 1.
Łapczyński M., Surma J. (2012), Hybrid Predictive Models for Optimizing Marketing Banner Ad Campaign in On-line Social Network, (w:) Stahlbock R., Weiss G.M., Proceedings of the 2012 International Conference on Data Mining, CSREA Press, Las Vegas Nevada, USA.
Lloyd S. (1982), Least squares quantization in PCM, "IEEE Transactions on Information Theory", Vol. 28, Iss. 2.
MacQueen J.B. (1967), Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability", University of California Press.
Moro S., Laureano R., Cortez P. (2011), Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology, (w:) Novais P. et al. (eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, Guimarães, Portugal, October.
Steinhaus H. (1956), Sur la division des corp materiels en parties, "Bulletin de l'Académie Polonaise des Sciences", Vol. IV, No. 12.
van der Putten P., van Someren M. (red.) (2000), CoIL Challenge 2000: The Insurance Company Case, (w:) Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09, Sentient Machine Research, Amsterdam, June 22.
Walesiak M., Dudek A. (2013), Package 'clusterSim', http://cran.r-project.org/web/packages/clusterSim/clusterSim.pdf [dostęp: 15.01.2013].

Typ dokumentu

Bibliografia

Identyfikatory

Identyfikator YADDA

bwmeta1.element.ekon-element-000171310031

Komentarze

Musisz być zalogowany aby pisać komentarze.

Handel Wewnętrzny

Wpływ liczby skupień na jakość predykcyjnych modeli hybrydowych

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane