Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2015 | 6 | 25--30
Tytuł artykułu

The Serialization of Heterogeneous Documents

Warianty tytułu
Języki publikacji
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentationoriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents. (original abstract)
Opis fizyczny
  • Ulster University, Jordanstown United Kingdom
  • Ulster University, Jordanstown United Kingdom
  • Ulster University, Jordanstown United Kingdom
  • Comeau, D. C., Liu, H., Dogan, R. I., & Wilbur, W. J. ˘ (2014). Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.
  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55-60).
  • Liu, M., Xu, W., Ran, Q., & Li, Y. (2015). Using Natural Language Processing Technology to Analyze Teachers' Written Feedback on Chinese Students' English Essays.
  • Douglas, S., Hurst, M., & Quinn, D. (1995). Using natural language processing for identifying and interpreting tables in plain text. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (pp. 535-546).
  • Clark, C., & Divvala, S. Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers.
  • Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.
  • Li, X., Li, F., & Chen, X. (2015, April). Distributed GIS framework design based on XML and Web Service. In 2015 International Conference on Intelligent Systems Research and Mechatronics Engineering. Atlantis Press.
  • Hwang, C. G., Yoon, C. P., & Lee, D. (2015). Exchange of Data for Big Data in Hybrid Cloud Environment.
  • Niu, Z., Yang, C., & Zhang, Y. (2014). A design of cross-terminal web system based on JSON and REST. In Software Engineering and Service Science (ICSESS), 2014 5th IEEE International Conference on (pp. 904- 907). IEEE.
  • Smith, B. (2015). Creating JSON. In Beginning JSON (pp. 49-67). Apress.
  • Ben-Kiki, O., Evans, C., & Ingerson, B. (2005). YAML Ain't Markup Language (YAMLTM) Version 1.1. yaml. org, Tech. Rep.
  • Eriksson, M., & Hallberg, V. (2011). Comparison between JSON and YAML for data serialization. The School of Computer Science and Engineering Royal Institute of Technology.
  • Bird, S. (2006). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics.
  • Smutz, C., & Stavrou, A. (2012, December). Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference (pp. 239-248). ACM.
  • Khusro, S., Latif, A., & Ullah, I. (2014). On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science.
Typ dokumentu
Identyfikator YADDA

Zgłoszenie zostało wysłane

Zgłoszenie zostało wysłane

Musisz być zalogowany aby pisać komentarze.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.