Results of the project

The 29 sentence corpora created within the project can now be found in the Korp service provided by the Language Bank of Finland by the name Wanca 2016. The metadata of the project can be found in Metashare.

The Finno-Ugric links collected in his project can be found in the site we call Wanca (from Proto Uralic *wanča ‘root’). At the moment, Wanca consists of 56 674 links to 895 sites containing pages written in 27 of the smaller Uralic languages. When initially identifying the language of the links we found, we did not want to miss any potential texts written in small Uralic languages. Therefore, the site contains many links that turned out not to contain the target languages. We are re-identifying the links with a stricter filter, but with automatical language identification it is not possible to identify the language of all web pages correctly.

Native speakers and scholars are invited to create a personal account and help us verify the language labels given to the pages by our automatic language identifier. The verified links will be used for new crawls and to further improve our language identifier. After signing up the language experts can apply for expert rights by sending us a message through our contact form.

Web pages are changing all the time: old pages disappear or are moved to another location and new ones are created. In October 2017, we recrawled the links we had and found that 46% of the links and 36% of the sites had disappeared.

BeforeAfter
Links103 91156 674
Sites1 399895
Languages2827

When looking into individual languages we noticed that both links previously identified as being Votic were gone. Investigating further into these links showed that one of the links was unavailable at the time of the crawl. The page turns out to be written in Estonian, but when it was first found, all that could be automatically detected in between all the javascript was machine names (a typical cause for false positives). The other page has numbers written in various Finnic languages including Votic, but the page (now?) contains so much English text that, even though we accept all pages where as little as 2% of the text is written in one of the Uralic languages, the page was considered written in English. Further statistic can be found on this page

Some of the sites containing pages written in Uralic languages are linked to each other. The connections between sites that are after the October 2017 recrawl considered containing Uralic languages can be explored from this page.