Bridging the gap within text-data analytics: a computer environment for data analysis in linguistic research

Carlos Periñán-Pascual

Authors

Carlos Periñán-Pascual Universitat Politècnica de València

Keywords:

text-data analytics, corpus linguistics, text mining, software, DAMIEN

Abstract

Since computer technology became widespread available at universities during the last quarter of the twentieth century, language researchers have been successfully employing software to analyse usage patterns in corpora. However, although there has been a proliferation of software for different disciplines within text-data analytics, e.g. corpus linguistics, statistics, natural language processing and text mining, this article demonstrates that any computer environment intended to support advanced linguistic research more effectively should be grounded on a user-centred approach to holistically integrate cross-disciplinary methods and techniques in a linguist-friendly manner. To this end, I examine not only the tasks that are derived from linguists' needs and goals but also the technologies that appropriately deal with the properties of linguistic data. This research results in the implementation of DAMIEN, an online workbench designed to conduct linguistic experiments on corpora.

Downloads

Author Biography

Carlos Periñán-Pascual, Universitat Politècnica de València

Carlos Periñán-Pascual studied English Language and Literature at Universitat de València and received his Ph.D. degree in English Philology at UNED in Madrid (Spain). Since his doctoral dissertation on the resolution of word-sense disambiguation in machine translation, his main research interests have included knowledge engineering, natural language understanding and computational linguistics. More particularly, his research has been focused on the cognitive and computational treatment of lexical information, constructional meaning, conceptual representation, and reasoning, among many other tasks. Since 2004, he has been the director of FunGramKB, a lexico-conceptual knowledge base, together with a suite of tools, for the automatic processing of language. His scientific production includes over 50 peer-reviewed publications in the fields of linguistics, natural language processing and artificial intelligence. He has been the principal investigator in four funded research projects as well as the chair of the organizing committee in many scientific events, including international workshops and conferences. He is currently an associate professor in the Applied Linguistics Department at Universitat Politècnica de València, Spain.

References

Aguado de Cea, G., Montiel Ponsoda, E. & Ramos Gargantilla, J.A. (2007). Multilingualidad en una aplicación basada en el conocimiento. Procesamiento del Lenguaje Natural, 38, 77-97.

Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141-161.

Antworth, E.L. & Valentine, J.R. (1998). Software for doing field linguistics. In J. Lawler & H. Aristar Dry (Eds.), Using computers in linguistics: A practical guide (pp. 170-196). London-New York: Routledge.

Baker, P., Hardie, A. & McEnery, A. (2006). A glossary of corpus linguistics. Edinburgh: Edinburgh University Press.

Bentivogli, L., Forner, P., Magnini, B. & Pianta, E. (2004). Revising WordNet Domains hierarchy: Semantics, coverage, and balancing. In Proceedings of the 21st International Conference on Computational Linguistics. Workshop on Multilingual Linguistic Resources (pp. 101-108). Geneva.

Bontcheva, K., Cunningham, H., Tablan, V., Maynard, D. & Saggion, H. (2002). Developing reusable and robust language processing components for information systems using GATE. In Proceedings of the 3rd International Workshop on Natural Language and Information Systems (pp. 223-227). Los Alamitos (Ca.): IEEE Computer Society Press.

Burnard, L. (2014). What is the Text Encoding Initiative? How to add intelligent markup to digital resources. Marseille: OpenEdition Press.

Butler, Ch. (1985). Statistics in linguistics. Oxford: Basil Blackwell.

Celko, J. (2004). Trees and hierarchies in SQL for smarties. San Francisco: Elsevier.

Cooper, A., Reimann, R. & Cronin, D. (2007). About face 3: The essentials of interaction design. Indianapolis: Wiley.

Csikszentmihalyi, M. (2008). Flow: The psychology of optimal experience. New York: Harper and Row.

Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V. (2002). GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 168–175). Stroudsburg (Pa.): Association for Computational Linguistics.

Cunningham, et al. (2014). Developing language processing components with GATE version 8. <https://gate.ac.uk/sale/tao/tao.pdf> [3/10/2016].

De Kok, D., de Kok, D. & Hinrichs, M. (2014). Build your own treebank. In Proceedings of the CLARIN Annual Conference. Soesterberg.

Feinerer, I., Hornik, K. & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5): 1-54.

Feldman, R. & Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge-New York: Cambridge University Press.

Fellbaum, Ch. (1998, ed.). WordNet: An electronic lexical database. Cambridge (Mass.): MIT Press.

Fensel, D., Horrocks, I., Van Harmelen, F., Decker, S., Erdmann, M. & Klein, M. (2000). OIL in a nutshell. In R. Dieng (Ed.), Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling, and Management (pp. 1-16). Berlin-New York: Springer.

Fox, J. (2005). The R Commander: A basic-statistics graphical user interface to R. Journal of Statistical Software, 14(9): 1-42.

Grossman, T., Fitzmaurice, G. & Attar, R. (2009). A survey of software learnability: Metrics, methodologies and guidelines. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (pp. 649-658). New York: ACM.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H. (2009). The WEKA data mining software: An Update. ACM SIGKDD Explorations Newsletter, 11(1): 10-18.

Horton, T., Taylor, C., Yu, B. & Xiang, X. (2006). ‘Quite right, dear and interesting’: Seeking the sentimental in nineteenth century American fiction. Paris-Sorbonne: Digital Humanities.

Ide, N., Bonhomme, P. & Romary, L. (2000). XCES: An XML-based standard for linguistic corpora. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (pp. 825-830).

ISO (2008). Language resource management - Lexical Markup Framework (LMF). ISO 24613:2008, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.

ISO (2010). Ergonomics of Human-System Interaction - Part 210: Human-Centered Design for Interactive Systems. ISO 9241–210. Geneva: International Organization for Standardization.

ISO (2012a). Language resource management - Linguistic annotation framework (LAF). ISO 24612:2012, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.

ISO (2012b). Language resource management - Morpho-syntactic annotation framework (MAF). ISO 24611:2012, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.

Karp, R., Chaudhri, V., & Thomere, J. (1999). XOL: An XML-based ontology exchange language. Technical Report, SRI International. < http://www.ai.sri.com/~pkarp/xol/xol.html>[3/12/2016].

Keller, J.M. (1987). Development and use of the ARCS model of instructional design. Journal of Instructional Development, 10(3): 2-10.

Luyckx, K., Daelemans, W. & Vanhoutte, E. (2006). Stylogenetics: Clustering-based stylistic analysis of literary corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (pp. 30-35).

Magnini, B. & Cavaglià, G. (2000). Integrating subject field codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (pp. 1413-1418). Athens.

Martínez Cruz, C., Blanco, I.J. & Vila, M.A. (2012). Ontologies versus relational databases: Are they so different? A comparison. Artificial Intelligence Review, 38(4), 271-290.

McCormick, S., Lieske, Ch. & Culum, A. (2004). OLIF v.2: A flexible language data standard. The OLIF2 Consortium. <http://www.olif.net/documents/OLIF_Term_Journal.pdf>[22/5/2015].

Meier, J.D., Vasireddy, S., Babbar, A. & Mackman, A. (2004). Improving XML performance. In Improving .NET Application Performance and Scalability. Microsoft. <http://msdn.microsoft.com/en-us/library/ff647804.aspx>[15/11/2016].

Miller, G.A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11): 39-41.

Moore, D.S. (2000). The basic practice of statistics. New York: Freeman.

Nielsen, J. (1993). Usability engineering. Boston (Mass.): Academic Press.

Periñán-Pascual, C. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology, 21(2), 151-179.

Periñán-Pascual, C. & Mestre-Mestre, E.M. (2015). DEXTER: Automatic extraction of domain-specific glossaries for language teaching. In Proceedings of the VII Congreso Internacional de Lingüística de Corpus. Procedia - Social and Behavioral Sciences 198 (pp. 377-385).

Periñán-Pascual, C. & Mestre-Mestre, E.M. (2016). A hybrid evaluation procedure for automatic term extraction. In C. Periñán-Pascual & E.M. Mestre-Mestre (Eds.), Understanding meaning and knowledge representation: From theoretical and cognitive linguistics to natural language processing (pp. 261-282). Newcastle: Cambridge Scholars Publishing.

Pitti, D.V. (2004). Designing sustainable projects and publications. In S. Schreibman, R. Siemens & J. Unsworth (Eds.), A companion to digital humanities. Oxford: Blackwell. <http://www.digitalhumanities.org/companion/>[17/10/2016].

Plaisant, C., Rose, J., Yu, B., Auvil, L., Kirschenbaum, M.G., Smith, M.N., Clement, T. & Lord, G. (2006). Exploring erotics in Emily Dickinson’s correspondence with text mining and visual interfaces. In Proceedings of the 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 141-150). New York: ACM Press.

Preece, J., Rogers, Y. & Sharp, H. (2002). Interaction design: Beyond human-computer interaction. New York: J. Wiley & Sons.

Quasthoff, U., Richter, M. & Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (pp. 1799-1802). Genoa.

Sampson, G. (2001). Empirical linguistics. London-New York: Continuum.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47.

Simons, G.F. (1998). The nature of linguistic data and the requirements of a computing environment for linguistic research. In J. Lawler and H. Aristar Dry (Eds.), Using computers in linguistics: A practical guide (pp. 10-25). London-New York: Routledge.

Thieberger, N. & Berez, A.L. (2012). Linguistic data management. In N. Thieberger (Ed.), The Oxford handbook of linguistic fieldwork (pp. 90-118). Oxford: Oxford University Press.

Tidwell, J. (2010). Designing interfaces. Sebastopol (Ca.): O’Reilly.

Tonkin, E.L. (2016). Working with text. In E.L. Tonkin & G. Tourte (Eds.), Working with text: Tools, techniques and approaches for text mining (pp. 1-22). Cambridge: Chandos.

Tufte, E. (1997). Visual explanations. Cheshire: Graphics Press.

Weinschenk, S. (2011). 100 things every designer needs to know about people. Berkeley: New Riders.

Wiechmann, D. & Fuhs, S. (2006). Concordancing software. Corpus Linguistics and Linguistic Theory, 2(1): 109-130.

Witten, I.H. (2005). Text mining. In M.P. Singh (Ed.), Practical handbook of Internet computing (pp. 14/1-14/22). Boca Raton: Chapman & Hall/CRC Press.

Woods, A., Fletcher, P. & Hughes, A. (1986). Statistics in Language Studies. Cambridge: Cambridge University Press.

Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., & Steinberg, D. (2007). Top 10 algorithms in data mining. Knowledge Information Systems, 14(1): 1-37.