Finno-Ugric Languages and the Internet

Finno-Ugric Languages and the Internet was a research project that was part of the Kone Foundation Language Programme 2012-2016. The purpose of this programme was to support the documentation and use of small Finno-Ugric Languages, Finnish, and all minority languages in Finland. The Finno-Ugric Languages and the Internet project started at the beginning of year 2013 and build an automated system that searches the Internet for text written in small Uralic languages. The text was processed into corpora and a list of links. The corpora collected, furthermore, serves as source material for linguistic research. The research continued till the end of 2018.

The research was conducted in the Department of Digital Humanities at the University of Helsinki and it was led by Research Director Krister Lindén. The project was funded by Kone Foundation and supported by The National Library of Finland. It was, moreover, carried out as part of the international CLARIN cooperation represented in Finland by the FIN-CLARIN consortium.

Web harvesting

During the project, the Internet was crawled to find sites that have been written in small Uralic languages. For this purpose, a prototype of an automated system was built to maintain a list of links to the discovered sites. From this list, it is possible to build web portals, through which the sites written in distinct languages can be reached more easily. In this way, we can help the users of endangered languages to find each other and to uphold their linguistic culture.

The Internet Archive has built a web crawler that harvests entire web sites. As opposed to this, the prototype that was built during this project only harvested a small part of the sites found. Furthermore, we only wanted to store the textual material of the Finno-Ugric sites ignoring, for example, metadata and the numerous pictures and videos. In order to determine how the existing open source web crawlers could be modified to suit these requirements, we studied how they operate. In order to harvest Uralic sites, Heritrix, an open source web crawler created by the Internet Archive, was modified and used.

The source code of the main components of the prototype were released with an open source license to allow others to use it to build lists of links and text corpora in the languages of their choice.

Language identification

On the Internet, there is a large number of text documents that contain no or hardly any metadata that might help identify the language used in the document. In order to identify the sites written in Uralic languages, we used a language identifier, which was built to identify as many languages as possible.

The source code of the language identifier was released with an open source license.

Language corpora

The automated system created sentence, clause and word corpora for each small Uralic language found in the Internet. The polishing and the verification of the corpora can be automated by existing methods of language technology. In order to make the text corpora available for all linguists, we published them in the Language Bank of Finland developed by FIN-CLARIN.