An algorithm designed to expand Wikipedia in all languages
29.05.17 - An EPFL researcher has created a system that scans Wikipedia for important articles that are missing in other languages. This project could help expand the online encyclopedia’s coverage in minority languages, such as Romansh.
With 40 million articles in 293 languages, Wikipedia is the largest encyclopedia ever made. The 5.4 million pages in English are particularly varied, covering 60 times more topics than the Encyclopaedia Britannica. But not all languages enjoy such depth of coverage. “Information that some language groups need has not been translated,” says Robert West, a researcher at EPFL’s Data Science Lab. “For example, global warming is a crucial issue in Madagascar, yet there are no articles on this topic in Malagasy.”
Closer to home, only 3,400 articles are available in Romansh versus 1.8 million in French and over two million in German. And it’s hard for Wikipedia editors to know which of the millions of pages they should translate in order to really make a difference. That’s where Robert West comes in: he used machine learning to identify and rank the most important articles missing in each language. But determining how relevant a given topic is for a culture is more complex than it appears.
To help the machines assess how important an article would be in Romansch, for example, it was necessary to calculate how many views a missing article should theoretically generate. “Taylor Swift and Pokémon may be popular, but do they really count as important?” says West. “To avoid ethnocentric biases, we predicted page statistics by taking all languages into account, and the machine learning algorithms then figured out the weighting to apply to each language. For example, Japanese is more important than English when it comes to predicting the impact of a page in Chinese.”
Once the algorithms have come up with the most neutral ranking possible, the lists of missing topics are displayed on a new platform called Wikipedia GapFinder. Volunteer editors are given recommended topics based on their languages and interests. With help from a translation tool provided on the platform, the humans then finish the job – artificial intelligence is not yet ready to take over the whole process. “Human intervention is still required to meet Wikipedia’s publication standards, since machine translation is not yet up to scratch,” adds West.
The platform, which was developed together with Stanford University and the Wikimedia Foundation, is open to the public and can publish 200 new articles per week. That’s not much compared to the 7,000 texts published daily on Wikipedia, but the focus is on quality, not quantity. West is working on a second project that uses data mining to find the key paragraphs in an article. This process, once mastered, will make it even easier to expand the online encyclopedia’s content in local languages.