Transformation of Nominal Features Into Numeric in Supervised Multi-Class Problems Based on the Weight of Evidence Parameter

Zdravevski, Eftim; Lameski, Petre; Kulakov, Andrea; Kalajdziski, Slobodan

doi:10.15439/2015F90

Artykuł - szczegóły

Czasopismo

Annals of Computer Science and Information Systems

2015 | 5 | 169--179

Tytuł artykułu

Transformation of Nominal Features Into Numeric in Supervised Multi-Class Problems Based on the Weight of Evidence Parameter

Autorzy

Eftim Zdravevski , Petre Lameski , Andrea Kulakov , Slobodan Kalajdziski

Warianty tytułu

Języki publikacji

Abstrakty

Machine learning has received increased interest by both the scientific community and the industry. Most of the machine learning algorithms rely on certain distance metrics that can only be applied to numeric data. This becomes a problem in complex datasets that contain heterogeneous data consisted of numeric and nominal (i.e. categorical) features. Thus the need of transformation from nominal to numeric data. Weight of evidence (WoE) is one of the parameters that can be used for transformation of the nominal features to numeric. In this paper we describe a method that uses WoE to transform the features. Although the applicability of this method is researched to some extent, in this paper we extend its applicability for multi-class problems, which is a novelty. We compared it with the method that generates dummy features. We test both methods on binary and multi-class classification problems with different machine learning algorithms. Our experiments show that the WoE based transformation generates smaller number of features compared to the technique based on generation of dummy features while also improving the classification accuracy, reducing memory complexity and shortening the execution time. Be that as it may, we also point out some of its weaknesses and make some recommendations when to use the method based on dummy features generation instead.(original abstract)

Słowa kluczowe

Data analysis Machine learning

Analiza danych Uczenie maszynowe

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2015

Tom

Strony

169--179

Opis fizyczny

Twórcy

autor

Eftim Zdravevski

Faculty of Computer Science and Engineering Ss.Cyril and Methodius University, Skopje, Macedonia

autor

Petre Lameski

Faculty of Computer Science and Engineering Ss.Cyril and Methodius University, Skopje, Macedonia

autor

Andrea Kulakov

Faculty of Computer Science and Engineering Ss.Cyril and Methodius University, Skopje, Macedonia

autor

Slobodan Kalajdziski

Faculty of Computer Science and Engineering Ss.Cyril and Methodius University, Skopje, Macedonia

Bibliografia

C. Shearer, "The crisp-dm model: the new blueprint for data mining," Journal of Data Warehousing, vol. 5, no. 4, pp. 13-19, 2000.
R. Anderson, The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford: Oxford University Press, 2007. ISBN 9780199226405
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The weka data mining software: An update," SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10-18, Nov. 2009. doi: 10.1145/1656274.1656278. [Online]. Available: http://doi.acm.org/10.1145/1656274.1656278
E. Tuv and G. Runger, "Scoring levels of categorical variables with heterogeneous data," Intelligent Systems, IEEE, vol. 19, no. 2, pp. 14-19, Mar 2004. doi: 10.1109/MIS.2004.1274906
M. Hofmann and R. Klinkenberg, Eds., RapidMiner: data mining use cases and business analytics applications, ser. Chapman & Hall/CRC data mining and knowledge discovery series. Boca Raton: CRC Press, 2014, no. 33. ISBN 9781482205497
T. W. Miller, Modeling techniques in predictive analytics: business problems and solutions with R. Upper Saddle River, New Jersey: Pearson Education, Inc, 2014. ISBN 9780133412932
M. Deza, Encyclopedia of distances. Dordrecht : New York: Springer Verlag, 2009. ISBN 9783642002335
D. W. Goodall, "A new similarity index based on probability," Biometrics, vol. 22, no. 4, pp. pp. 882-907, 1966. [Online]. Available: http://www.jstor.org/stable/ 2528080
C. Li and G. Biswas, "Unsupervised learning with mixed numeric and nominal data," Knowledge and Data Engineering, IEEE Transactions on, vol. 14, no. 4, pp. 673- 690, Jul 2002. doi: 10.1109/TKDE.2002.1019208
S. Robertson, "Understanding inverse document frequency: on theoretical arguments for idf," Journal of Documentation, vol. 60, no. 5, pp. 503-520, 2004. doi: 10.1108/00220410410560582. [Online]. Available: http://dx.doi.org/10.1108/00220410410560582
H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok, "Interpreting tf-idf term weights as making relevance decisions," ACM Trans. Inf. Syst., vol. 26, no. 3, pp. 13:1-13:37, Jun. 2008. doi: 10.1145/1361684.1361686. [Online]. Available: http: //doi.acm.org/10.1145/1361684.1361686
T. Joachims, "Text categorization with support vector machines: Learning with many relevant features," in Machine Learning: ECML-98, ser. Lecture Notes in Computer Science, C. Nedellec and C. Rouveirol, ´ Eds. Springer Berlin Heidelberg, 1998, vol. 1398, pp. 137-142. ISBN 978-3-540-64417-0. [Online]. Available: http://dx.doi.org/10.1007/BFb0026683
I. J. Good, Probability and the Weighing of Evidence. C. Griffin & Co., London, UK, 1950.
E. P. Smith, I. Lipkovich, and K. Ye, "Weight-ofevidence (woe): Quantitative estimation of probability of impairment for individual and multiple lines of evidence," Human and Ecological Risk Assessment: An International Journal, vol. 8, no. 7, pp. 1585- 1596, 2002. doi: 10.1080/20028091057493. [Online]. Available: http://dx.doi.org/10.1080/20028091057493
E. Zdravevski, P. Lameski, and A. Kulakov, "Weight of evidence as a tool for attribute transformation in the preprocessing stage of supervised learning algorithms," in Neural Networks (IJCNN), The 2011 International Joint Conference on, July 2011. doi: 10.1109/IJCNN.2011.6033219. ISSN 2161-4393 pp. 181-188.
E. Zdravevski, P. Lameski, A. Kulakov, and D. Gjorgjevikj, "Feature selection and allocation to diverse subsets for multi-label learning problems with large datasets," in Computer Science and Information Systems (FedC- SIS), 2014 Federated Conference on, Sept 2014. doi: 10.15439/ 2014F500 pp. 387-394.
D. B. Suits, "Use of dummy variables in regression equations," Journal of the American Statistical Association, vol. 52, no. 280, 1957.
M. A. Hardy, Regression with dummy variables, ser. Sage university papers series. Newbury Park: Sage Publications, 1993, no. no. 07- 093. ISBN 0803951280
N. Chater and M. Oaksford, Eds., The probabilistic mind: prospects for Bayesian cognitive science. Oxford ; New York: Oxford University Press, 2008. ISBN 9780199216093
E. Zdravevski, P. Lameski, and A. Kulakov, "Towards a general technique for transformation of nominal features into numeric features in supervised learning," in Proceedings of the 9th Conference for Informatics and Information Technology (CIIT 2012). Faculty of Computer Science and Engineering (FCSE) and Computer Society of Macedonia, 2012.
E. L. Allwein, R. E. Schapire, and Y. Singer, "Reducing multiclass to binary: A unifying approach for margin classifiers," The Journal of Machine Learning Research, vol. 1, pp. 113-141, 2001.
R. Rifkin and A. Klautau, "In defense of one-vs-all classification," J. Mach. Learn. Res., vol. 5, pp. 101-141, Dec. 2004. [Online]. Available: http: //dl.acm.org/citation.cfm?id=1005332.1005336
E. Zdravevski, P. Lameski, and A. Kulakov, "Advanced transformations for nominal and categorical data into numeric data in supervised learning problems," in Proceedings of the 10th Conference for Informatics and Information Technology (CIIT 2013). Faculty of Computer Science and Engineering (FCSE) and Computer Society of Macedonia, 2013.
K. Bache and M. Lichman, "UCI machine learning repository," 2013. [Online]. Available: http://archive.ics.uci.edu/ml
J. Huang, J. Lu, and C. Ling, "Comparing naive bayes, decision trees, and svm with auc and accuracy," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, Nov 2003. doi: 10.1109/ ICDM.2003.1250975 pp. 553-556.
J. Huang and C. Ling, "Using auc and accuracy in evaluating learning algorithms," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 3, pp. 299-310, March 2005. doi: 10.1109/TKDE. 2005.50
C. Ferri, J. Hernndez-Orallo, and R. Modroiu, "An experimental comparison of performance measures for classification," Pattern Recognition Letters, vol. 30, no. 1, pp. 27 - 38, 2009. doi: http://dx. doi.org/10.1016/j.patrec.2008.08.010. [Online]. Available: http:// www.sciencedirect.com/science/article/pii/S0167865508002687
R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, ser. IJCAI'95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995. ISBN 1-55860-363-8 pp. 1137-1143. [Online]. Available: http://dl.acm.org/citation.cfm?id=1643031.1643047
I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182, Mar. 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=944919.944968
"Pacific-asia knowledge discovery and data mining competition 2010," http://sede.neurotech.com.br/PAKDD2010/, accessed: 2015- 06-05.
D. Olson, "Data set balancing," in Data Mining and Knowledge Management, ser. Lecture Notes in Computer Science, Y. Shi, W. Xu, and Z. Chen, Eds. Springer Berlin Heidelberg, 2005, vol. 3327, pp. 71-80. ISBN 978-3-540-23987-1. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-30537-8 8

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.15439/2015F90

Identyfikator YADDA

bwmeta1.element.ekon-element-000171419340

Komentarze

Musisz być zalogowany aby pisać komentarze.

Annals of Computer Science and Information Systems

Transformation of Nominal Features Into Numeric in Supervised Multi-Class Problems Based on the Weight of Evidence Parameter

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane