Finno-Ugric Languages and the Internet

Finno-Ugric Languages and the Internet is a research project that is part of the Kone Foundation Language Programme 2012-2016. The purpose of this programme is to support the documentation and use of small Finno-Ugric Languages, Finnish, and all minority languages in Finland. The Finno-Ugric Languages and the Internet project started at the beginning of year 2013 and its aim is to build an automated system that searches the Internet for text written in small Uralic languages. The text will be processed into corpora and a list of links. The corpora collected will, furthermore, serve as source material for linguistic research.

The research is conducted in the Department of Modern Languages at the University of Helsinki and it is led by Research Director Krister Lindén. The project is funded by Kone Foundation and supported by The National Library of Finland. It is, moreover, carried out as part of the international CLARIN cooperation represented in Finland by the FIN-CLARIN consortium.

Web harvesting

During the project, the Internet will be crawled to find sites that have been written in small Uralic languages. For this purpose, a prototype of an automated system is build to maintain a list of links to the discovered sites. From this list, it is possible to build web portals, through which the sites written in distinct languages can be reached more easily. In this way, we can help the users of endangered languages to find each other and to uphold their linguistic culture.

The Internet Archive has built a web crawler that harvests entire web sites. As opposed to this, the prototype that will be build during this project will only harvest a small part of the sites found. Furthermore, we only want to store the textual material of the Finno-Ugric sites ignoring, for example, metadata and the numerous pictures and videos. In order to determine how the existing open source web crawlers could be modified to suit these requirements, we studied how they operate. In order to harvest Uralic sites, Heritrix, an open source web crawler created by the Internet Archive, is modified and used.

The source code of the prototype will eventually be released with an open source license to allow others to use it to build lists of links and text corpora in the languages of their choice. We will also be negotiating with different organizations that could maintain the system after the end of the project.

Language identification

On the Internet, there is a large number of text documents that contain no or hardly any metadata that might help identify the language used in the document. In order to identify the sites written in Uralic languages, we will use a language identifier, which will be build to identify as many languages as possible. Building the language models and evaluating the performance of the language identifier requires text corpora in all the languages that one wishes to identify. As part of this project, we will survey existing text corpora that might serve as language models. When searching for the language corpora, we will utilize the information on language resources provided by both the Virtual Language Observatory of CLARIN and the META-SHARE infrastructure. We also aspire to increase their information on the Uralic languages.

The source code of the language identifier will be released with an open source license and will be offered to the use of all national libraries as well as research projects utilizing the methods of language technology.

Language corpora

The automated system will create sentence, clause and word corpora for each small Uralic language found in the Internet. The polishing and the verification of the corpora can be automated by existing methods of language technology. In order to make the text corpora available for all linguists, we aim to publish them in the Language Bank of Finland developed by FIN-CLARIN. The corpora will, if possible, be published with a Creative Commons CC0-license.

Statistics

Part of the work of the research project is to produce statistics about the distribution of Uralic languages in the Internet. The results collected during the phase of development may not be consistent, but eventually, when the system is operating, we can trail changes in the number of Uralic sites, for example, over years.