The representativeness threshold for the CETA subcorpus of the Coruña Corpus
Keywords:
representativeness, ReCor, specialized Corpus, Zipf's Law, N-gram, Coruña Corpus, CETA, astronomyAbstract
The concept of representativeness is the main distinguishing characteristic of specialised corpora in comparison to other sets of texts. The Coruña Corpus of English Scientific Writing currently comprises four published subcorpora (astronomy, life sciences, history, and philosophy) plus three others under compilation (physics, chemistry and linguistics). In this paper we aim to assess the lexical density of the text samples in CETA, the Corpus of English Texts on Astronomy, by means of the ReCor tool, a posteriori. The study is motivated by the following question: does quantitative representativeness analysis using ReCor provide, in the form of a cross-check, further validation of previous research on the representativeness of CETA? Previous work (Crespo and Moskowich, 2010) has indicated that the CETA corpus is well designed and valid for the purposes for which it was intended. We will here suggest metrics to measure these findings. The most important contribution of this study is to offer quantitative data collection results using the ReCor tool, which allows data triangulation and consequently ensures overall data quality. Results show that data analysis with the ReCor tool supports previous findings, and thus we are able to verify that CETA is indeed representative of the language of its time and register.
Downloads
References
Biber, D. (1993). “Using Registered-diversified Corpora of General Language Studies”. Computational Linguistics, 19 (2), 219-241.
Biber, D., Conrad, S. & Reppen, R. (1998a). Preface. In: D. BIBER, S. Conrad & R. Reppen (eds.), Corpus Linguistics: Investigating Language Structure and Use (pp. ix-x). Cambridge: Cambridge University Press.
Biber, D., Conrad, S. & Reppen, R. (1998b). Introduction Goals and Methods of the Corpus-based Approach. In: D. Biber, S. Conrad & R. Reppen (eds.), Corpus Linguistics: Investigating Language Structure and Use (pp. 1-18). Cambridge: Cambridge University Press.
Booth, A. D. (1967). “A Law of Occurrences for Words of Low Frequency”. Information and Control, 10 (4), 386-393.
Corpas, G. y Seghiri, M. (2010). “Size Matters: A Quantitative Approach to Corpus Representativeness”. In R. Rabadán, (ed.) Lengua, traducción, recepción. En honor de Julio César Santoyo (pp. 112-146). Secretar: Universidad de Alicante.
Crespo, B. & Moskowich-Spiegel, I. (2010). “CETA in the Context of the Coruña Corpus”. Literary and Linguistic Computing, 25(2), 153-164.
Francis, W. N. (1982). Problems of Assembling and Computerizing Large Corpora. In S. Johansson (ed.) et al. Computer Corpora in English Language Research (pp. 7-24). Norway: Norwegian Computing Centre for the Humanities
Moskowich-Spiegel, I., Lareo, I., Camiña, G. & Crespo, B. (comps.) (2012). Corpus of English Texts on Astronomy. Amsterdam: John Benjamins.
Moskowich-Spiegel, I. (2011). “The Golden Rule of Divine Philosophy: Exemplified in the Coruña Corpus of English Scientific Writing”. Revista de Lenguas para Fines Específicos, 17, 167-197.
Moskowich, I. & Crespo García, B. (eds.) (2012). Astronomy ‘playne and simple’: The Writing of Science between 1700 and 1900. Amsterdam: John Benjamins
Moyotl-Hernández, E. & Macías-Pérez, M. (2016). “Método para autocompletar consultas basado en cadenas de Markov y la ley de Zipf”. Research in Computing Science, 115, 157-170.
Parapar, J. & Moskowich-Spiegel, I. (2007). “The Coruña Corpus Tool”. Revista de Procesamiento del Lenguaje Natural 39, 289–290.
Sidorov, G. (2013). “N-gramas sintácticos no-continuos”. Polibits, 48, 69-78.
Seghiri, M. (2011). “Metodología protocolizada de compilación de un corpus de seguros de viajes: aspectos de diseño y representatividad”. Revista de Lingüística teórica y Aplicada 49 (2), 13-30.
Seghiri, M. (2014). “Too Big or not too Big: Establishing the Minimum Size for a Legal ad hoc Corpus”. Hermes: Journal of Language and Communication in Business 27 (53), 85-98.
Seghiri, M. (2015). Determinación de la representatividad cuantitativa de un corpus ad hoc bilingüe (inglés-español) de manuales de instrucciones generales de lectores electrónicos. In M. T. Sánchez (ed.), Corpus-based Translation and Interpreting Studies: From description to application (125- 146). Frankfurt: Frank & Timme.
Sinclair, J. (1991). Glossary. In: J. Sinclair (ed.) Corpus, Concordance, Collocation (pp. 169-176). Oxford: Oxford University Press.
Torruella, J. & Llisterri, J. (1999). Diseño de corpus textuales y orales. In: J. M. Blecua (ed.) et al. Filología e informática. Nuevas tecnologías en los estudios filológicos (pp. 45-77). Barcelona: Universidad Autónoma de Barcelona.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Revista de Lenguas para fines específicos is licensed under a Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional License.