Bridging the gap within text-data analytics: a computer environment for data analysis in linguistic research
Palabras clave:
text-data analytics, corpus linguistics, text mining, software, DAMIENResumen
Since computer technology became widespread available at universities during the last quarter of the twentieth century, language researchers have been successfully employing software to analyse usage patterns in corpora. However, although there has been a proliferation of software for different disciplines within text-data analytics, e.g. corpus linguistics, statistics, natural language processing and text mining, this article demonstrates that any computer environment intended to support advanced linguistic research more effectively should be grounded on a user-centred approach to holistically integrate cross-disciplinary methods and techniques in a linguist-friendly manner. To this end, I examine not only the tasks that are derived from linguists' needs and goals but also the technologies that appropriately deal with the properties of linguistic data. This research results in the implementation of DAMIEN, an online workbench designed to conduct linguistic experiments on corpora.
Descargas
Citas
Aguado de Cea, G., Montiel Ponsoda, E. & Ramos Gargantilla, J.A. (2007). Multilingualidad en una aplicación basada en el conocimiento. Procesamiento del Lenguaje Natural, 38, 77-97.
Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141-161.
Antworth, E.L. & Valentine, J.R. (1998). Software for doing field linguistics. In J. Lawler & H. Aristar Dry (Eds.), Using computers in linguistics: A practical guide (pp. 170-196). London-New York: Routledge.
Baker, P., Hardie, A. & McEnery, A. (2006). A glossary of corpus linguistics. Edinburgh: Edinburgh University Press.
Bentivogli, L., Forner, P., Magnini, B. & Pianta, E. (2004). Revising WordNet Domains hierarchy: Semantics, coverage, and balancing. In Proceedings of the 21st International Conference on Computational Linguistics. Workshop on Multilingual Linguistic Resources (pp. 101-108). Geneva.
Bontcheva, K., Cunningham, H., Tablan, V., Maynard, D. & Saggion, H. (2002). Developing reusable and robust language processing components for information systems using GATE. In Proceedings of the 3rd International Workshop on Natural Language and Information Systems (pp. 223-227). Los Alamitos (Ca.): IEEE Computer Society Press.
Burnard, L. (2014). What is the Text Encoding Initiative? How to add intelligent markup to digital resources. Marseille: OpenEdition Press.
Butler, Ch. (1985). Statistics in linguistics. Oxford: Basil Blackwell.
Celko, J. (2004). Trees and hierarchies in SQL for smarties. San Francisco: Elsevier.
Cooper, A., Reimann, R. & Cronin, D. (2007). About face 3: The essentials of interaction design. Indianapolis: Wiley.
Csikszentmihalyi, M. (2008). Flow: The psychology of optimal experience. New York: Harper and Row.
Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V. (2002). GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 168–175). Stroudsburg (Pa.): Association for Computational Linguistics.
Cunningham, et al. (2014). Developing language processing components with GATE version 8. <https://gate.ac.uk/sale/tao/tao.pdf> [3/10/2016].
De Kok, D., de Kok, D. & Hinrichs, M. (2014). Build your own treebank. In Proceedings of the CLARIN Annual Conference. Soesterberg.
Feinerer, I., Hornik, K. & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5): 1-54.
Feldman, R. & Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge-New York: Cambridge University Press.
Fellbaum, Ch. (1998, ed.). WordNet: An electronic lexical database. Cambridge (Mass.): MIT Press.
Fensel, D., Horrocks, I., Van Harmelen, F., Decker, S., Erdmann, M. & Klein, M. (2000). OIL in a nutshell. In R. Dieng (Ed.), Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling, and Management (pp. 1-16). Berlin-New York: Springer.
Fox, J. (2005). The R Commander: A basic-statistics graphical user interface to R. Journal of Statistical Software, 14(9): 1-42.
Grossman, T., Fitzmaurice, G. & Attar, R. (2009). A survey of software learnability: Metrics, methodologies and guidelines. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (pp. 649-658). New York: ACM.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H. (2009). The WEKA data mining software: An Update. ACM SIGKDD Explorations Newsletter, 11(1): 10-18.
Horton, T., Taylor, C., Yu, B. & Xiang, X. (2006). ‘Quite right, dear and interesting’: Seeking the sentimental in nineteenth century American fiction. Paris-Sorbonne: Digital Humanities.
Ide, N., Bonhomme, P. & Romary, L. (2000). XCES: An XML-based standard for linguistic corpora. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (pp. 825-830).
ISO (2008). Language resource management - Lexical Markup Framework (LMF). ISO 24613:2008, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.
ISO (2010). Ergonomics of Human-System Interaction - Part 210: Human-Centered Design for Interactive Systems. ISO 9241–210. Geneva: International Organization for Standardization.
ISO (2012a). Language resource management - Linguistic annotation framework (LAF). ISO 24612:2012, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.
ISO (2012b). Language resource management - Morpho-syntactic annotation framework (MAF). ISO 24611:2012, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.
Karp, R., Chaudhri, V., & Thomere, J. (1999). XOL: An XML-based ontology exchange language. Technical Report, SRI International. < http://www.ai.sri.com/~pkarp/xol/xol.html>[3/12/2016].
Keller, J.M. (1987). Development and use of the ARCS model of instructional design. Journal of Instructional Development, 10(3): 2-10.
Luyckx, K., Daelemans, W. & Vanhoutte, E. (2006). Stylogenetics: Clustering-based stylistic analysis of literary corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (pp. 30-35).
Magnini, B. & Cavaglià, G. (2000). Integrating subject field codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (pp. 1413-1418). Athens.
Martínez Cruz, C., Blanco, I.J. & Vila, M.A. (2012). Ontologies versus relational databases: Are they so different? A comparison. Artificial Intelligence Review, 38(4), 271-290.
McCormick, S., Lieske, Ch. & Culum, A. (2004). OLIF v.2: A flexible language data standard. The OLIF2 Consortium. <http://www.olif.net/documents/OLIF_Term_Journal.pdf>[22/5/2015].
Meier, J.D., Vasireddy, S., Babbar, A. & Mackman, A. (2004). Improving XML performance. In Improving .NET Application Performance and Scalability. Microsoft. <http://msdn.microsoft.com/en-us/library/ff647804.aspx>[15/11/2016].
Miller, G.A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11): 39-41.
Moore, D.S. (2000). The basic practice of statistics. New York: Freeman.
Nielsen, J. (1993). Usability engineering. Boston (Mass.): Academic Press.
Periñán-Pascual, C. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology, 21(2), 151-179.
Periñán-Pascual, C. & Mestre-Mestre, E.M. (2015). DEXTER: Automatic extraction of domain-specific glossaries for language teaching. In Proceedings of the VII Congreso Internacional de Lingüística de Corpus. Procedia - Social and Behavioral Sciences 198 (pp. 377-385).
Periñán-Pascual, C. & Mestre-Mestre, E.M. (2016). A hybrid evaluation procedure for automatic term extraction. In C. Periñán-Pascual & E.M. Mestre-Mestre (Eds.), Understanding meaning and knowledge representation: From theoretical and cognitive linguistics to natural language processing (pp. 261-282). Newcastle: Cambridge Scholars Publishing.
Pitti, D.V. (2004). Designing sustainable projects and publications. In S. Schreibman, R. Siemens & J. Unsworth (Eds.), A companion to digital humanities. Oxford: Blackwell. <http://www.digitalhumanities.org/companion/>[17/10/2016].
Plaisant, C., Rose, J., Yu, B., Auvil, L., Kirschenbaum, M.G., Smith, M.N., Clement, T. & Lord, G. (2006). Exploring erotics in Emily Dickinson’s correspondence with text mining and visual interfaces. In Proceedings of the 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 141-150). New York: ACM Press.
Preece, J., Rogers, Y. & Sharp, H. (2002). Interaction design: Beyond human-computer interaction. New York: J. Wiley & Sons.
Quasthoff, U., Richter, M. & Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (pp. 1799-1802). Genoa.
Sampson, G. (2001). Empirical linguistics. London-New York: Continuum.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47.
Simons, G.F. (1998). The nature of linguistic data and the requirements of a computing environment for linguistic research. In J. Lawler and H. Aristar Dry (Eds.), Using computers in linguistics: A practical guide (pp. 10-25). London-New York: Routledge.
Thieberger, N. & Berez, A.L. (2012). Linguistic data management. In N. Thieberger (Ed.), The Oxford handbook of linguistic fieldwork (pp. 90-118). Oxford: Oxford University Press.
Tidwell, J. (2010). Designing interfaces. Sebastopol (Ca.): O’Reilly.
Tonkin, E.L. (2016). Working with text. In E.L. Tonkin & G. Tourte (Eds.), Working with text: Tools, techniques and approaches for text mining (pp. 1-22). Cambridge: Chandos.
Tufte, E. (1997). Visual explanations. Cheshire: Graphics Press.
Weinschenk, S. (2011). 100 things every designer needs to know about people. Berkeley: New Riders.
Wiechmann, D. & Fuhs, S. (2006). Concordancing software. Corpus Linguistics and Linguistic Theory, 2(1): 109-130.
Witten, I.H. (2005). Text mining. In M.P. Singh (Ed.), Practical handbook of Internet computing (pp. 14/1-14/22). Boca Raton: Chapman & Hall/CRC Press.
Woods, A., Fletcher, P. & Hughes, A. (1986). Statistics in Language Studies. Cambridge: Cambridge University Press.
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., & Steinberg, D. (2007). Top 10 algorithms in data mining. Knowledge Information Systems, 14(1): 1-37.
Descargas
Publicado
Cómo citar
Número
Sección
Licencia
Aquellos autores/as que tengan publicaciones con esta revista, aceptan los términos siguientes:
- Los autores/as conservarán sus derechos de autor y garantizarán a la revista el derecho de primera publicación de su obra, el cuál estará simultáneamente sujeto a la Licencia de reconocimiento de Creative Commons que permite a terceros compartir la obra siempre que se indique su autor y su primera publicación esta revista.
- Los autores/as podrán adoptar otros acuerdos de licencia no exclusiva de distribución de la versión de la obra publicada (p. ej.: depositarla en un archivo telemático institucional o publicarla en un volumen monográfico) siempre que se indique la publicación inicial en esta revista.
- Se permite y recomienda a los autores/as difundir su obra a través de Internet (p. ej.: en archivos telemáticos institucionales o en su página web) antes y durante el proceso de envío, lo cual puede producir intercambios interesantes y aumentar las citas de la obra publicada. (Véase El efecto del acceso abierto).
Revista de Lenguas para fines específicos is licensed under a Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional License.