PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2020 | vol. 20, iss. 1 | 232--247
Tytuł artykułu

The Influence of Unbalanced Economic Data on Feature Selection and Quality of Classifiers

Autorzy
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Research background: The successful learning of classifiers depends on the quality of data. Modeling is especially difficult when the data are unbalanced or contain many irrelevant variables. This is the case in many applications. The classification of rare events is the overarching goal, e.g. in bankruptcy prediction, churn analysis or fraud detection. The problem of irrelevant variables accompanies situations where the specification of the model is not known a priori, thus in typical conditions for data mining analysts. Purpose: The purpose of this paper is to compare the combinations of the most popular strategies of handling unbalanced data with feature selection methods that represent filters, wrappers and embedded methods. Research methodology: In the empirical study, we use real datasets with additionally introduced irrelevant variables. In this way, we are able to recognize which method correctly eliminates irrelevant variables. Results: Having carried out the experiment we conclude that over-sampling does not work in connection with feature selection. Some recommendations of the most promising methods also are given. Novelty: There are many solutions proposed in the literature concerning unbalanced data as well as feature selection. The innovative field of our interests is to examine their interactions. (original abstract)
Rocznik
Strony
232--247
Opis fizyczny
Twórcy
  • Opole University of Technology, Poland
Bibliografia
  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
  • Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
  • Chawla, N.V., Japkowicz, N., Kołcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6 (1), 1-6.
  • Chen, C., Liaw, A., Breiman, L. (2004) Using random forest to learn imbalanced data. University of California, Berkeley, 110, 1-12.
  • Dua, D., Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. Retrieved from: http://archive.ics.uci.edu/ml (17.06.2019).
  • Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861-874.
  • Fayyad, U., Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (pp. 1022-1027).
  • Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42 (4), 463-484.
  • Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (2006). Feature Extraction: Foundations and Applications. New York: Springer.
  • Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46, 389-422.
  • Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.
  • Japkowicz, N., Shah, M. (2011). Evaluating learning algorithms: a classification perspective. Cambridge University Press.
  • King, G., Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9, 137-163.
  • Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF, Proceedings of European Conference on Machine Learning (pp. 171-182).
  • Kubus, M. (2015). Rekurencyjna eliminacja cech w metodach dyskryminacji. Prace Naukowe Uniwersytetu Ekonomicznego we Wrocławiu, 384. Taksonomia, 24, 154-162. DOI: 10.15611/pn.2015.384.16.
  • Kubus, M. (2016). Lokalna ocena mocy dyskryminacyjnej zmiennych. Prace Naukowe Uniwersytetu Ekonomicznego we Wrocławiu, 427, Taksonomia 27, 143-152. DOI: 10.15611/pn.2016.427.15.
  • Longadge, R., Dongre, S.S., Malik, L. (2013). Class Imbalance Problem in Data Mining: Review. International Journal of Computer Science and Network, 2 (1), 83-87.
  • Menardi, G., Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28, 92-122.
  • Pociecha, J., Pawełek, B., Baryła, M., Augustyn, S. (2014). Statystyczne metody prognozowania bankructwa w zmieniającej się koniunkturze gospodarczej. Kraków: Fundacja Uniwersytetu Ekonomicznego w Krakowie.
  • Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
  • Tsamardinos, I., Aliferis, C.F. (2003). Towards principled feature selection: relevancy, filters and wrappers. In Proceedings of the Workshop on Artificial Intelligence and Statistics.
  • Weiss, G. (2004). Mining with rarity: A unifying framework. SIGKDD Explorations, 6 (1), 7-19.
  • Yu, L., Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5, 1205-1224.
  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67 (2), 301-320.
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.ekon-element-000171599277

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane

Musisz być zalogowany aby pisać komentarze.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.