PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2015 | 6 | 25--30
Tytuł artykułu

The Serialization of Heterogeneous Documents

Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentationoriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents. (original abstract)
Rocznik
Tom
6
Strony
25--30
Opis fizyczny
Twórcy
  • Ulster University, Jordanstown United Kingdom
  • Ulster University, Jordanstown United Kingdom
autor
  • Ulster University, Jordanstown United Kingdom
Bibliografia
  • Comeau, D. C., Liu, H., Dogan, R. I., & Wilbur, W. J. ˘ (2014). Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.
  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55-60).
  • Liu, M., Xu, W., Ran, Q., & Li, Y. (2015). Using Natural Language Processing Technology to Analyze Teachers' Written Feedback on Chinese Students' English Essays.
  • Douglas, S., Hurst, M., & Quinn, D. (1995). Using natural language processing for identifying and interpreting tables in plain text. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (pp. 535-546).
  • Clark, C., & Divvala, S. Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers.
  • Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.
  • Li, X., Li, F., & Chen, X. (2015, April). Distributed GIS framework design based on XML and Web Service. In 2015 International Conference on Intelligent Systems Research and Mechatronics Engineering. Atlantis Press.
  • Hwang, C. G., Yoon, C. P., & Lee, D. (2015). Exchange of Data for Big Data in Hybrid Cloud Environment.
  • Niu, Z., Yang, C., & Zhang, Y. (2014). A design of cross-terminal web system based on JSON and REST. In Software Engineering and Service Science (ICSESS), 2014 5th IEEE International Conference on (pp. 904- 907). IEEE.
  • Smith, B. (2015). Creating JSON. In Beginning JSON (pp. 49-67). Apress.
  • Ben-Kiki, O., Evans, C., & Ingerson, B. (2005). YAML Ain't Markup Language (YAMLTM) Version 1.1. yaml. org, Tech. Rep.
  • Eriksson, M., & Hallberg, V. (2011). Comparison between JSON and YAML for data serialization. The School of Computer Science and Engineering Royal Institute of Technology.
  • Bird, S. (2006). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics.
  • Smutz, C., & Stavrou, A. (2012, December). Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference (pp. 239-248). ACM.
  • Khusro, S., Latif, A., & Ullah, I. (2014). On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science.
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.ekon-element-000171422708

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane

Musisz być zalogowany aby pisać komentarze.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.