The representativeness threshold for the CETA subcorpus of the Coruña Corpus

Elena Alfaya-Lamas; Menchu Garrote Espantoso

Authors

Elena Alfaya-Lamas Universidade da Coruña https://orcid.org/0000-0001-6628-6257
Menchu Garrote Espantoso Universidade da Coruña https://orcid.org/0000-0001-7918-2780

Keywords:

representativeness, ReCor, specialized Corpus, Zipf's Law, N-gram, Coruña Corpus, CETA, astronomy

Abstract

The concept of representativeness is the main distinguishing characteristic of specialised corpora in comparison to other sets of texts. The Coruña Corpus of English Scientific Writing currently comprises four published subcorpora (astronomy, life sciences, history, and philosophy) plus three others under compilation (physics, chemistry and linguistics). In this paper we aim to assess the lexical density of the text samples in CETA, the Corpus of English Texts on Astronomy, by means of the ReCor tool, a posteriori. The study is motivated by the following question: does quantitative representativeness analysis using ReCor provide, in the form of a cross-check, further validation of previous research on the representativeness of CETA? Previous work (Crespo and Moskowich, 2010) has indicated that the CETA corpus is well designed and valid for the purposes for which it was intended. We will here suggest metrics to measure these findings. The most important contribution of this study is to offer quantitative data collection results using the ReCor tool, which allows data triangulation and consequently ensures overall data quality. Results show that data analysis with the ReCor tool supports previous findings, and thus we are able to verify that CETA is indeed representative of the language of its time and register.

Downloads

Author Biographies

Elena Alfaya-Lamas, Universidade da Coruña

Elena Alfaya-Lamas obtained an MA in Germanic Philology from the University of Santiago de Compostela in 1994 and a PhD in English Historical Linguistics in 2002. From 1998 to 2000 she was a postgraduate worker and scholarship holder in the Department of Linguistics of the University of Edinburgh. In November 2001 she became an Associate Lecturer at CESUGA-University College Dubin. In October 2003 she obtained a position as an “Isidro Parga Pondal” researcher at the University of A Coruña and in October 2004 she became a Lecturer and Researcher in the area of Information and Documentation Science at the University of A Coruña.

Her main research interests are historical linguistics, cognitive linguistics, discourse analysis, gender studies and mind-consciousness studies. She studied with the Mindfulness Association and the Kagyu lineage for years, developing competence in the range of skills necessary to teach Mindfulness, passing the Universities of Bangor, Exeter and Oxford Mindfulness-based Interventions, Teaching Assessment Criteria, MBI:TAC. She is currently co-heading the Mindfulness Association in Spain.

She teaches Informational Behaviour, Historical Archives and Records Management, Scientific Research Techniques and Digital and Information Management.

Menchu Garrote Espantoso, Universidade da Coruña

Menchu Garrote se gradúa en Información y documentación en la Facultad de Humanidades y Documentación de la Universidade da Coruña en 2019. Obtiene el Premio Extraordinario Fin de Estudios (Universidade da Coruña) y Premio Excelencia Académica de Galicia (Xunta de Galicia). Actualmente cursa el Máster Universitario en Patrimonio Histórico: Investigación y Gestión en el Campus de Toledo de la Universidad de Castilla-La Mancha. Comenzó su acercamiento a la investigación cuando obtuvo la beca de colaboración en formación complementaria en los departamentos universitarios de los centros propios de la UDC durante el curso 2018/19. Tutorizada por la Dra. Alfaya-Lamas se adentró en la investigación sobre el Coruña Corpus diseñado por el grupo de investigación MUSTE de la UDC.

References

Biber, D. (1993). “Using Registered-diversified Corpora of General Language Studies”. Computational Linguistics, 19 (2), 219-241.

Biber, D., Conrad, S. & Reppen, R. (1998a). Preface. In: D. BIBER, S. Conrad & R. Reppen (eds.), Corpus Linguistics: Investigating Language Structure and Use (pp. ix-x). Cambridge: Cambridge University Press.

Biber, D., Conrad, S. & Reppen, R. (1998b). Introduction Goals and Methods of the Corpus-based Approach. In: D. Biber, S. Conrad & R. Reppen (eds.), Corpus Linguistics: Investigating Language Structure and Use (pp. 1-18). Cambridge: Cambridge University Press.

Booth, A. D. (1967). “A Law of Occurrences for Words of Low Frequency”. Information and Control, 10 (4), 386-393.

Corpas, G. y Seghiri, M. (2010). “Size Matters: A Quantitative Approach to Corpus Representativeness”. In R. Rabadán, (ed.) Lengua, traducción, recepción. En honor de Julio César Santoyo (pp. 112-146). Secretar: Universidad de Alicante.

Crespo, B. & Moskowich-Spiegel, I. (2010). “CETA in the Context of the Coruña Corpus”. Literary and Linguistic Computing, 25(2), 153-164.

Francis, W. N. (1982). Problems of Assembling and Computerizing Large Corpora. In S. Johansson (ed.) et al. Computer Corpora in English Language Research (pp. 7-24). Norway: Norwegian Computing Centre for the Humanities

Moskowich-Spiegel, I., Lareo, I., Camiña, G. & Crespo, B. (comps.) (2012). Corpus of English Texts on Astronomy. Amsterdam: John Benjamins.

Moskowich-Spiegel, I. (2011). “The Golden Rule of Divine Philosophy: Exemplified in the Coruña Corpus of English Scientific Writing”. Revista de Lenguas para Fines Específicos, 17, 167-197.

Moskowich, I. & Crespo García, B. (eds.) (2012). Astronomy ‘playne and simple’: The Writing of Science between 1700 and 1900. Amsterdam: John Benjamins

Moyotl-Hernández, E. & Macías-Pérez, M. (2016). “Método para autocompletar consultas basado en cadenas de Markov y la ley de Zipf”. Research in Computing Science, 115, 157-170.

Parapar, J. & Moskowich-Spiegel, I. (2007). “The Coruña Corpus Tool”. Revista de Procesamiento del Lenguaje Natural 39, 289–290.

Sidorov, G. (2013). “N-gramas sintácticos no-continuos”. Polibits, 48, 69-78.

Seghiri, M. (2011). “Metodología protocolizada de compilación de un corpus de seguros de viajes: aspectos de diseño y representatividad”. Revista de Lingüística teórica y Aplicada 49 (2), 13-30.

Seghiri, M. (2014). “Too Big or not too Big: Establishing the Minimum Size for a Legal ad hoc Corpus”. Hermes: Journal of Language and Communication in Business 27 (53), 85-98.

Seghiri, M. (2015). Determinación de la representatividad cuantitativa de un corpus ad hoc bilingüe (inglés-español) de manuales de instrucciones generales de lectores electrónicos. In M. T. Sánchez (ed.), Corpus-based Translation and Interpreting Studies: From description to application (125- 146). Frankfurt: Frank & Timme.

Sinclair, J. (1991). Glossary. In: J. Sinclair (ed.) Corpus, Concordance, Collocation (pp. 169-176). Oxford: Oxford University Press.

Torruella, J. & Llisterri, J. (1999). Diseño de corpus textuales y orales. In: J. M. Blecua (ed.) et al. Filología e informática. Nuevas tecnologías en los estudios filológicos (pp. 45-77). Barcelona: Universidad Autónoma de Barcelona.