August 2013
·
282 Reads
·
43 Citations
In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring par-allel text: some users create post multilingual mes-sages targeting international audiences while oth-ers "retweet" translations. We present an efficient method for detecting these messages and extract-ing parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counter-part of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields sub-stantial translation quality improvements in trans-lating microblog text and modest improvements in translating edited news commentary. The re-sources in described in this paper are available at