Collaborative Machine Translation for Wikipedia

Meta-Wiki · 12 min read · original

This document describes a proposal for a long-term strategy using several technologies for offering a machine translation system based on collaborative principles to be used in Wikipedia or other websites where text content may change.

An alternative approach for collaborative machine translation for Wikipedia is described at Community Wishlist/Wishes/Wikipedia Machine Translation Project. Translated articles would show up in Web search results of people searching the Web in their own language which could more than double Wikipedia reads/readers.

Error in machine translations would be corrected via an innovative postediting system where errors would be corrected by Wikimedians at scale where likely errors in other articles get flagged or corrected at once. Recent advances in machine translation in the 2020s and work on mw:MinT have made such a system feasible.

The technical mechanisms and challenges are described or addressed on the wish page, created in 2024, not here on this page, created in 2013.

In general terms:

At annotation level for monolingual users who wants to help to disambiguate:

At annotation level for translators:

For readers: After selecting a different content language in the wiki the reader is visiting, the following will happen

Semantic multilingual dictionary

[edit]

Currently in Wikidata there are two proposals for enabling Wiktionary support as linked data (1 and 2). If inflections are supported, Wikidata/Wiktionary could be the base to feed any machine translation engine.

Once the system is in place it will be possible to create language pairs from existing information that could be used to create translation dictionaries for Apertium. Some examples of language pair files (XML).

Semantic annotation

[edit]

For semantic annotation of entities, there is DBpedia Spotlight, however a specific solution should be implemented to cater for the needs of machine translation, or adding those capabilities to the Apertium toolbox. Basically it would need mono-lingual morphology selection, and sense selection. Later on the translation rules manually selected should also be stored as annotations. DBpedia Spotlight uses KEA and Lucene/OpenNLP with UIMA.

Other software: Maui is another software used to annotate on vocabulary concepts. They provide pre-trained concept schemes plus can be trained for new concepts. Works quite well for phrase and even topic based tagging/annotation, even if integrated with Solr facets. Same thing can be achieved via Apache Stanbol using various different engines.

Machine translation platform

[edit]

For this kind of project it is prefered to use a rule-based machine translation system, because total control is wanted over the whole process and minority languages should be accounted for (not that easy with statistical-based MT, where parallel corpora may be non-existing). Once the parallel corpora is developed, statistics methods would be implemented to improve the translation either fine-tuning the transfer rules or feeding an statistic engine (it could be Moses using a multi-engine translation synthesiser).

In the open source world, Apertium offers a reliable rule-based MT toolchain. It should be noted, however, that currently it works around XML dictionaries. On the plus side, it has a thriving community, with 11 GsoC projects running this year. Moses is, amongst other international organisations, mainly supported by EuroMatrix project and funded by the European Commission.

There is a current effort to convert the Apertium translation rules into Wikimedia templates, and then use DBpedia extraction templates to convert such templates into linked data. An example of these upcoming rule translation templates could be:

{{translation_rule
 | pair = en-es
 | phrase type = NP <!-- this could be determined from either
 source_head or target_head, but it's nicer to have it -->
 | source = determiner adjective noun
 | target = determiner noun adjective
 | alignment = 1 3 2 <!-- not necessary in this example, but would be
 if there were more than one of each PoS -->
 <!-- this could also be written as 1-1 3-2 2-3 -- it would be nice to
 be able to use that convention, to import statistically derived rules,
 but it's only necessary to know one set of positions when writing by
 hand -->
 | source head = 3
 | target head = 2 <!-- not necessary with alignment -->
 | source example = the big dog
 | target example = el perro grande
 | target attributes = {{attribs | definiteness = 1 | gender = 2, 1 |
 number = 2, 1}} <!-- the actual attributes would be those used in
 wiktionary -->
 }}

An example of a DBpedia mapping template: Mapping_en:Elementbox

At some point, it could be possible to work with these translation rules directly as linked data, and having them stored as a Wikibase repository or as another database for triples.

Translatewiki.net already offers a translation interface that eventually could be expanded to support a semantic tagging interface of source and translated text. As for the transfer rules, there is a GsoC project idea to build a visual interface for transfer rules for Apertium. Although this editor is going to be a Qt interface (i.e. not web-based), it will provide an initial work to know how it should be done.

Multilingual content synchronizer

[edit]

CoSyne is a past EU project that has devoted 3 years to the topic of "multilingual synchronisation of wikis".[1]

The prototype, run on top of MediaWiki, aimed at recognising which parts of a wiki article are already present in two languages, which are not, the latter being translated and introduced automatically in the other language. The main components are cross-lingual textual entailment (to recognise which parts of the article are already in both languages) and MT. The prototype used a SMT engine, but as it is being called as a web service, one could use any MT engine.

The project ended in 2013. As of 2015, no outputs are known and the CoSyne website is down.