I want to extract data from the wikilinks returned by the mwparserfromhell lib. I want for instance to parse the following string:
[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]
If I split the string using the character |, it doesn't work as there is a link inside the description of the image that uses the | as well: [[Maria Skłodowska-Curie Museum|Birthplace]].
I'm using regexp to first replace all links in the string before spliting it. It works (in this case) but it doesn't feel clean (see code bellow). Is there a better way to extract information from such a string?
import re
wiki_code = "[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]"
# Remove [[File: at the begining of the string
prefix = "[[File:"
if (wiki_code.startswith(prefix)):
wiki_code = wiki_code[len(prefix):]
# Remove ]] at the end of the string
suffix = "]]"
if (wiki_code.endswith(suffix)):
wiki_code = wiki_code[:-len(suffix)]
# Replace links with their
link_pattern = re.compile(r'\[\[.*?\]\]')
matches = link_pattern.findall(wiki_code)
for match in matches:
content = match[2:-2]
arr = content.split("|")
label = arr[-1]
wiki_code = wiki_code.replace(match, label)
print(wiki_code.split("|"))
The links returned by
.filter_wikilinks()are instances of theWikilinkclass, which havetitleandtextproperties.titlereturns the title of the link:File:Warszawa, ul. Freta 16 20170516 002.jpgtextreturns the rest of the link:thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].These are returned as
Wikicodeobjects.Since the actual text is always the last fragment, first you need to find other fragments with the following regex:
([^\[\]|]*\|)+(): Group of[^\[\]|]*: 0 or more characters that is not square brackets or pipes\|: a literal pipe+: 1 or moreEverything else from the ending index of the last match until the end of the string is the last fragment.
When the caption is not the last fragment
For such edge cases we can parse the
textproperty again usingitertoolsfunctions: