I want to print all the tokens which are labellad with the morphological tag in a file. So far I wrote the code shown below.
def index(filepath, string):
import re
pattern = re.compile(r'(\w+)+')
StringList = []
StringList.append(string)
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = set(m.group(1) for m in pattern.finditer(line))
matches = [keyword for keyword in StringList if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
StringList.clear()
index('deneme.txt', '+Noun')
The output is like this, I can find the Noun in the token and the line number but can't print the part which I wanted. I only want the word part which is before + sign.
Noun 1
Noun 2
Noun 3
Noun 4
Noun 5
Noun 6
Noun 7
The lines in my file is like this:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl
club+Noun toplantı+Noun+A3pl+P3sg
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
club+Noun toplantı+Noun+A3pl+P3sg
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
I want to get the tokens forexample when i write a tag. Forexample when I write +Adj I want to get all the tokens which include +Adj (nispi, izafi .... (forexample)).
I think, your concept how to use regexes needs some improvement.
Note that each input line contains a number of "tokens", e.g.
terörizm+Noun+Gen. As you can see, it contains:+char.So:
+char,+) are classification symbols.A good habit it to strip the terminating blank chars (at least
\n).Note also that your code contains
StringList, so you are aware of the case that this function may look for one or more of multiple classification words.I programmed it a slightly different way:
lookFor) is a list of words, which is converted into a set (lookForSet).The decision whether to print a word (the first word from a token) is based on whether at least one of its classification symbols can be found in
lookForSet. To put it another way - whetherlookForSetandwordSethave some common elements (set intersection).So the whole script can look like below:
A piece of output from it, for my input data (slightly shortened version of your) is like below:
It contains:
lookForhave been found in this token.