Bridging the gap within text-data analytics: a computer environment for data analysis in linguistic research
Keywords:
text-data analytics, corpus linguistics, text mining, software, DAMIENAbstract
Since computer technology became widespread available at universities during the last quarter of the twentieth century, language researchers have been successfully employing software to analyse usage patterns in corpora. However, although there has been a proliferation of software for different disciplines within text-data analytics, e.g. corpus linguistics, statistics, natural language processing and text mining, this article demonstrates that any computer environment intended to support advanced linguistic research more effectively should be grounded on a user-centred approach to holistically integrate cross-disciplinary methods and techniques in a linguist-friendly manner. To this end, I examine not only the tasks that are derived from linguists' needs and goals but also the technologies that appropriately deal with the properties of linguistic data. This research results in the implementation of DAMIEN, an online workbench designed to conduct linguistic experiments on corpora.
Downloads
References
Aguado de Cea, G., Montiel Ponsoda, E. & Ramos Gargantilla, J.A. (2007). Multilingualidad en una aplicación basada en el conocimiento. Procesamiento del Lenguaje Natural, 38, 77-97.
Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141-161.
Antworth, E.L. & Valentine, J.R. (1998). Software for doing field linguistics. In J. Lawler & H. Aristar Dry (Eds.), Using computers in linguistics: A practical guide (pp. 170-196). London-New York: Routledge.
Baker, P., Hardie, A. & McEnery, A. (2006). A glossary of corpus linguistics. Edinburgh: Edinburgh University Press.
Bentivogli, L., Forner, P., Magnini, B. & Pianta, E. (2004). Revising WordNet Domains hierarchy: Semantics, coverage, and balancing. In Proceedings of the 21st International Conference on Computational Linguistics. Workshop on Multilingual Linguistic Resources (pp. 101-108). Geneva.
Bontcheva, K., Cunningham, H., Tablan, V., Maynard, D. & Saggion, H. (2002). Developing reusable and robust language processing components for information systems using GATE. In Proceedings of the 3rd International Workshop on Natural Language and Information Systems (pp. 223-227). Los Alamitos (Ca.): IEEE Computer Society Press.
Burnard, L. (2014). What is the Text Encoding Initiative? How to add intelligent markup to digital resources. Marseille: OpenEdition Press.
Butler, Ch. (1985). Statistics in linguistics. Oxford: Basil Blackwell.
Celko, J. (2004). Trees and hierarchies in SQL for smarties. San Francisco: Elsevier.
Cooper, A., Reimann, R. & Cronin, D. (2007). About face 3: The essentials of interaction design. Indianapolis: Wiley.
Csikszentmihalyi, M. (2008). Flow: The psychology of optimal experience. New York: Harper and Row.
Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V. (2002). GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 168–175). Stroudsburg (Pa.): Association for Computational Linguistics.
Cunningham, et al. (2014). Developing language processing components with GATE version 8. <https://gate.ac.uk/sale/tao/tao.pdf> [3/10/2016].
De Kok, D., de Kok, D. & Hinrichs, M. (2014). Build your own treebank. In Proceedings of the CLARIN Annual Conference. Soesterberg.
Feinerer, I., Hornik, K. & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5): 1-54.
Feldman, R. & Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge-New York: Cambridge University Press.
Fellbaum, Ch. (1998, ed.). WordNet: An electronic lexical database. Cambridge (Mass.): MIT Press.
Fensel, D., Horrocks, I., Van Harmelen, F., Decker, S., Erdmann, M. & Klein, M. (2000). OIL in a nutshell. In R. Dieng (Ed.), Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling, and Management (pp. 1-16). Berlin-New York: Springer.
Fox, J. (2005). The R Commander: A basic-statistics graphical user interface to R. Journal of Statistical Software, 14(9): 1-42.
Grossman, T., Fitzmaurice, G. & Attar, R. (2009). A survey of software learnability: Metrics, methodologies and guidelines. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (pp. 649-658). New York: ACM.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H. (2009). The WEKA data mining software: An Update. ACM SIGKDD Explorations Newsletter, 11(1): 10-18.
Horton, T., Taylor, C., Yu, B. & Xiang, X. (2006). ‘Quite right, dear and interesting’: Seeking the sentimental in nineteenth century American fiction. Paris-Sorbonne: Digital Humanities.
Ide, N., Bonhomme, P. & Romary, L. (2000). XCES: An XML-based standard for linguistic corpora. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (pp. 825-830).
ISO (2008). Language resource management - Lexical Markup Framework (LMF). ISO 24613:2008, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.
ISO (2010). Ergonomics of Human-System Interaction - Part 210: Human-Centered Design for Interactive Systems. ISO 9241–210. Geneva: International Organization for Standardization.
ISO (2012a). Language resource management - Linguistic annotation framework (LAF). ISO 24612:2012, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.
ISO (2012b). Language resource management - Morpho-syntactic annotation framework (MAF). ISO 24611:2012, ISO/TC 37/SC 4. Geneva: International Organization for Standardization.
Karp, R., Chaudhri, V., & Thomere, J. (1999). XOL: An XML-based ontology exchange language. Technical Report, SRI International. < http://www.ai.sri.com/~pkarp/xol/xol.html>[3/12/2016].
Keller, J.M. (1987). Development and use of the ARCS model of instructional design. Journal of Instructional Development, 10(3): 2-10.
Luyckx, K., Daelemans, W. & Vanhoutte, E. (2006). Stylogenetics: Clustering-based stylistic analysis of literary corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (pp. 30-35).
Magnini, B. & Cavaglià, G. (2000). Integrating subject field codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (pp. 1413-1418). Athens.
Martínez Cruz, C., Blanco, I.J. & Vila, M.A. (2012). Ontologies versus relational databases: Are they so different? A comparison. Artificial Intelligence Review, 38(4), 271-290.
McCormick, S., Lieske, Ch. & Culum, A. (2004). OLIF v.2: A flexible language data standard. The OLIF2 Consortium. <http://www.olif.net/documents/OLIF_Term_Journal.pdf>[22/5/2015].
Meier, J.D., Vasireddy, S., Babbar, A. & Mackman, A. (2004). Improving XML performance. In Improving .NET Application Performance and Scalability. Microsoft. <http://msdn.microsoft.com/en-us/library/ff647804.aspx>[15/11/2016].
Miller, G.A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11): 39-41.
Moore, D.S. (2000). The basic practice of statistics. New York: Freeman.
Nielsen, J. (1993). Usability engineering. Boston (Mass.): Academic Press.
Periñán-Pascual, C. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology, 21(2), 151-179.
Periñán-Pascual, C. & Mestre-Mestre, E.M. (2015). DEXTER: Automatic extraction of domain-specific glossaries for language teaching. In Proceedings of the VII Congreso Internacional de Lingüística de Corpus. Procedia - Social and Behavioral Sciences 198 (pp. 377-385).
Periñán-Pascual, C. & Mestre-Mestre, E.M. (2016). A hybrid evaluation procedure for automatic term extraction. In C. Periñán-Pascual & E.M. Mestre-Mestre (Eds.), Understanding meaning and knowledge representation: From theoretical and cognitive linguistics to natural language processing (pp. 261-282). Newcastle: Cambridge Scholars Publishing.
Pitti, D.V. (2004). Designing sustainable projects and publications. In S. Schreibman, R. Siemens & J. Unsworth (Eds.), A companion to digital humanities. Oxford: Blackwell. <http://www.digitalhumanities.org/companion/>[17/10/2016].
Plaisant, C., Rose, J., Yu, B., Auvil, L., Kirschenbaum, M.G., Smith, M.N., Clement, T. & Lord, G. (2006). Exploring erotics in Emily Dickinson’s correspondence with text mining and visual interfaces. In Proceedings of the 6th ACM/IEEE Joint Conference on Digital Libraries (pp. 141-150). New York: ACM Press.
Preece, J., Rogers, Y. & Sharp, H. (2002). Interaction design: Beyond human-computer interaction. New York: J. Wiley & Sons.
Quasthoff, U., Richter, M. & Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (pp. 1799-1802). Genoa.
Sampson, G. (2001). Empirical linguistics. London-New York: Continuum.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47.
Simons, G.F. (1998). The nature of linguistic data and the requirements of a computing environment for linguistic research. In J. Lawler and H. Aristar Dry (Eds.), Using computers in linguistics: A practical guide (pp. 10-25). London-New York: Routledge.
Thieberger, N. & Berez, A.L. (2012). Linguistic data management. In N. Thieberger (Ed.), The Oxford handbook of linguistic fieldwork (pp. 90-118). Oxford: Oxford University Press.
Tidwell, J. (2010). Designing interfaces. Sebastopol (Ca.): O’Reilly.
Tonkin, E.L. (2016). Working with text. In E.L. Tonkin & G. Tourte (Eds.), Working with text: Tools, techniques and approaches for text mining (pp. 1-22). Cambridge: Chandos.
Tufte, E. (1997). Visual explanations. Cheshire: Graphics Press.
Weinschenk, S. (2011). 100 things every designer needs to know about people. Berkeley: New Riders.
Wiechmann, D. & Fuhs, S. (2006). Concordancing software. Corpus Linguistics and Linguistic Theory, 2(1): 109-130.
Witten, I.H. (2005). Text mining. In M.P. Singh (Ed.), Practical handbook of Internet computing (pp. 14/1-14/22). Boca Raton: Chapman & Hall/CRC Press.
Woods, A., Fletcher, P. & Hughes, A. (1986). Statistics in Language Studies. Cambridge: Cambridge University Press.
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., & Steinberg, D. (2007). Top 10 algorithms in data mining. Knowledge Information Systems, 14(1): 1-37.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Revista de Lenguas para fines específicos is licensed under a Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional License.