The Use of Data Mining Models in Solving the Problem of Imbalanced Classes Based on the Example of an Online Marketing Campaign

Łapczyński, Mariusz; Surma, Jerzy

doi:10.15611/ekt.2015.3.01

Artykuł - szczegóły

Czasopismo

Ekonometria / Uniwersytet Ekonomiczny we Wrocławiu

2015 | nr 3 (49) | 9--19

Tytuł artykułu

The Use of Data Mining Models in Solving the Problem of Imbalanced Classes Based on the Example of an Online Marketing Campaign

Autorzy

Mariusz Łapczyński , Jerzy Surma

Treść / Zawartość

Pełne teksty:

http://www.dbc.wroc.pl/publication/35076 [zdalny]

Warianty tytułu

Wykorzystanie modeli data mining w rozwiązywaniu problemu niezrównoważonych klas na przykładzie kampanii marketingowych w Internecie

Języki publikacji

Abstrakty

While building predictive models in analytical CRM, researchers often encounter the problem of imbalanced classes (skewed distributions of dependent variables), which consists in the fact that the number of observations belonging to one category of the dependent variable is much lower than the number of observations belonging to the second category of that variable. This is related to such areas as churn analysis, customer acquisition models and cross and up-selling models. The purpose of the paper is to present a predictive model that was built to predict the response of Internet users to banner advertising. The dataset used in the study came from an online social network which offers advertisers banner campaigns targeting its users. The advertising campaign of a cosmetics company was carried out in the autumn of 2010 and was mainly targeted at young women. A user of this service was described by 115 independent variables - 3 out of which were demographic variables (sex, age, education), and the remaining 112 referred to the user's online activity. While building the model there appeared the problem of imbalanced classes due to the low number of users who clicked on the banner ad. The number of cases amounted to 81,000, while the number of positive reactions to the banner was 207, which constitutes approximately 0.25% of the dependent variable. During the study, two popular data mining tools were utilized - the decision trees C&RT and Random Forest. The second goal of this paper is to compare the performance of the predictive models based on both these analytical tools.(original abstract)

Podczas budowy modeli predykcyjnych w analitycznym CRM badacze bardzo często napotykają na problem niezrównoważonych klas (niezbilansowanych prób), który polega na tym, że liczba obserwacji należących do jednej kategorii zmiennej zależnej jest znacznie mniejsza od liczby obserwacji należących do drugiej kategorii tej zmiennej. Dotyczy to m.in. takich obszarów, jak: analiza migracji klienta (churn analysis), pozyskiwanie klientów (customer acquisition) czy sprzedaż krzyżowa i uzupełniająca (cross- i up-selling). Celem artykułu jest prezentacja modelu predykcyjnego, który na podstawie dostępnych zmiennych objaśniających określa prawdopodobieństwo reakcji internauty na baner reklamowy. Zbiór obserwacji użyty w badaniu pochodzi z sieci społecznej on-oline, która standardowo oferuje reklamodawcom kampanie banerowe skierowane do użytkowników serwisu. Jesienią 2010 r. została przeprowadzona kampania reklamowa dla firmy kosmetycznej skierowana głównie do młodych kobiet. Użytkownik tego serwisu został opisany 115 zmiennymi objaśniającymi, na który składały się 3 zmienne demograficzne (płeć, wiek, wykształcenie) oraz 112 zmiennych charakteryzujących aktywności użytkownika w serwisie. Podczas analizy wystąpił problem silnie niezbilansowanych prób spowodowany niewielkim odsetkiem użytkowników klikających w reklamę. Liczba pozytywnych reakcji na baner wynosiła zaledwie 207 na ponad 81 tysięcy odsłon witryny, co spowodowało, że odsetek kategorii "1" w zmiennej objaśnianej był równy w przybliżeniu 0,25%. W badaniach wykorzystano dwa popularne narzędzia data mining - drzewa klasyfikacyjne C&RT oraz losowy las (Random Forest).(abstrakt oryginalny)

Słowa kluczowe

Econometrics Data Mining Classification trees Decision tree

Ekonometria Data Mining Drzewa klasyfikacyjne Drzewo decyzyjne

Czasopismo

Ekonometria / Uniwersytet Ekonomiczny we Wrocławiu

Rocznik

2015

Numer

nr 3 (49)

Strony

9--19

Opis fizyczny

Twórcy

autor

Mariusz Łapczyński

Cracow University of Economics, Poland

autor

Jerzy Surma

Warsaw School of Economics, Poland

Bibliografia

Breiman L., 2001, Random Forests, Machine Learning, 45, Kluwer Academic Publishers, pp. 5-32.
Breiman L. Friedman J.H., Olshen R.A., Stone C.J., 1984, Classification and Regression Trees, Chapman and Hall, London.
Breiman L., Cutler A., Random Forests, paper downloaded from stat-www.berkeley.edu, (15.10.2007).
Buntine W., 1993, Tree classification software, NASA, Washington, Technology 2002: The Third National Technology Transfer Conference and Exposition, Volume 1, pp. 289-298.
Chipman H.A., George E.I., McCulloch R.E., 1998, Bayesian CART models search, Journal of the American Statistical Association, September, Vol. 93, No. 443, pp. 935-960.
Chen C., Liaw A., Breiman L., 2004, Using random forest to learn unbalanced data, Technical Report, No 666, Statistics Department, University of California at Berkeley.
Chiu S. Tavella D., 2008, Data Mining and Market Intelligence for Optimal Marketing Returns, Elsevier, Amsterdam.
Crawford S.L., 1989, Extension to the CART algorithm, International Journal Man-Machine Studies, Vol. 31, pp. 197-217.
Goldfarb A., Tucker C., 2011,Online display advertising: targeting and intrusiveness, Marketing Science, Vol. 30 No. 3, May-June, pp. 389-404.
Hollis, N., 2005, Ten years of learning on how online advertising builds brands, Journal of Advertising Research, 45(2), pp. 255-268.
Ling C.X., Sheng V.S., 2008, Cost-Sensitive Learning and the Class Imbalance Problem, [in:] Encyclopedia of Machine Learning, ed. C. Sammut, Springer Verlag, Berlin, pp. 167-168.
Loh W-Y., Vanichsetakul N., 1988, Tree-structured classification via generalized discriminant analysis, Journal of the American Statistical Association, September, Vol. 83, No. 403, pp. 715-725.
Raskutti B., Kowalczyk A., 2004, Extreme rebalancing for SVMs: a case study, SIGKDD Explorations, Vol. 6, Issue 1, pp. 60-69.
Surma J., Furmanek A., 2011, Data mining in on-line social network for marketing response analysis, The Third IEEE International Conference on Social Computing (SocialCom2011), MIT, Cambridge, pp. 537-540.

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.15611/ekt.2015.3.01

Identyfikator YADDA

bwmeta1.element.ekon-element-000171409819

Komentarze

Musisz być zalogowany aby pisać komentarze.

Ekonometria / Uniwersytet Ekonomiczny we Wrocławiu

The Use of Data Mining Models in Solving the Problem of Imbalanced Classes Based on the Example of an Online Marketing Campaign

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane