To webmasters

Finno-Ugric Languages and the Internet is a research project where the Internet is crawled in order to find sites that have been written in Uralic languages. In order to harvest the texts from such sites, Heritrix, the web crawler created by the Internet Archive, is used. Our web crawler is, hence, called sukibot_heritrix.

We do not intend to disturb the function of any webservers and our crawler obeys the robots.txt. If there, nevertheless, is a problem, we are happy to put Your site to our "do not crawl" list. You can contact us by email, either heidi.jauhiainen (at) helsinki.fi or tommi.jauhiainen (at) helsinki.fi.

The project is part of the Kone Foundation Language Programme 2012-2016. The purpose of this programme is to support the documentation and use of small Finno-Ugric Languages, Finnish, and all minority languages in Finland. The Finno-Ugric Languages and the Internet project started at the beginning of year 2013 and its aim is to search the Internet for text written in small Uralic languages. The text will be processed into corpora and a list of links. The corpora collected will, furthermore, serve as source material for linguistic research.We also work in close cooperation with the Language Bank of Finland and the Finnish National Library to collect Finnish data for research. The research will continue at least till the end of 2018 and during year 2018 the domains crawled during 2014-15 (.fi, .se, .no, .ee, .ru, .hu, and .lv) will be recrawled in order to study the changes in pages written in Uralic languages.

The research is conducted in the Department of Digital Humanities at the University of Helsinki and it is funded by Kone Foundation and supported by The National Library of Finland. It is, moreover, carried out as part of the international CLARIN cooperation represented in Finland by the FIN-CLARIN consortium.