Technologies

HomeTechnologies

Building the Resource

In the absence of the hundreds of years of traditional text-based historical-linguistic work available for Eurasian languages, digital methods emerge as the best means available for systematically compiling and exploring the existing textual data for language families and contact in Pre-Colonial South America. 

The backbone of this project consists in bringing together most of the extant historical record for some 30 target languages, a task that builds on work already begun by local colleagues. In order to do so, we will visit and liaise with archives in Chile, Argentina and Europe to digitise and create machine-readable text from some 150 documents. 

Our next step is carefully annotate the texts for metadata and linguistic features through lemmatisation, part-of-speech tagging, sound-spelling equivalences, and morphological tagging.

Using standard annotation categories from the Text Encoding Initiative (TEI – an XML encoding) and bespoke tools developed in-house, we will create a searchable database for the textual material. 

These methods will allow the research team –and future users– to draw links between individual related features over time and across languages, thus turning back the clock as far back as the data allows, in order to probe the links between them.

Text Archiving

The project team will identify existing documentary material for the target languages, compiling these from published and archival sources. 

Metadata Gathering

We will produced a detailed account of the context of documentation for all texts, including known native and non-native contributors, dates and locations.

Digitisation & OCR

Texts and metadata will be made machine-readable through direct transcription and semi-automated text recognition using Transkribus.

TEI Tagging

All target texts will be lemmatised and part-of-speech tagged. Where possible, further morphosyntactic and phonological tagging will be provided.

Database Construction

Tagged texts will be placed in a searchable database that will allow researchers and the general public to access full texts and search and correlate all the available TEI tags.