WMT'15 newstest dataset: .sgm formatting

769 Views Asked by At

What scripts are used (and how?) to get the newstest datasets from wmt from the .sgm format to an unformatted format (like the europarl dataset)?

e.g. the newstest dataset downloaded at: http://www.statmt.org/wmt15/test.tgz

contains (when extracted) files such as newstest2015-ende-ref.de.sgm

How do I make that similar to the europarl dataset where each line represents a sentence with no formatting?

Note:

I have found a script in the moses directory (linked from the wmt site) called wrap-xml.perl. It mentions in the test section that it is used to go to .sgm format, but the script itself contains no documentation (and I am clueless in perl)

0

There are 0 best solutions below