Print all the tokens in the file that are labelled with the morphological tag

Question

Print all the tokens in the file that are labelled with the morphological tag

291 Views Asked by erkevarol At 27 October 2018 at 15:20

I want to print all the tokens which are labellad with the morphological tag in a file. So far I wrote the code shown below.

def index(filepath, string):

    import re
    pattern = re.compile(r'(\w+)+')
    StringList = []
    StringList.append(string)

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = set(m.group(1) for m in pattern.finditer(line))
            matches = [keyword for keyword in StringList if keyword in words]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)

    StringList.clear()



index('deneme.txt', '+Noun')

The output is like this, I can find the Noun in the token and the line number but can't print the part which I wanted. I only want the word part which is before + sign.

Noun            1
Noun            2
Noun            3
Noun            4
Noun            5
Noun            6
Noun            7

The lines in my file is like this:

Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc 
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl 
club+Noun toplantı+Noun+A3pl+P3sg 
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
club+Noun toplantı+Noun+A3pl+P3sg 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj

I want to get the tokens forexample when i write a tag. Forexample when I write +Adj I want to get all the tokens which include +Adj (nispi, izafi .... (forexample)).

Original Q&A

There are 2 best solutions below

Jongware On 27 October 2018 at 16:02

Splitting on \w+ removed the + part from what you are looking for, so I split on the spaces in between instead. Then it was just a case of wrestling the for and in into the right order for the list comprehension.

def index(filepath, string):
    StringList = [string]

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = line.split(' ')
            matches = [word for keyword in StringList for word in words if keyword in word]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)


index('deneme.txt', '+Adj')

Which leads to the result:

küresel+Adj,karşı+Adj+P3sg+Loc,samimi+Adj 1
ekonomik+Adj,insani+Adj,aktif+Adj,seçkin+Adj 2
yeterli+Adj,haiz+Adj,müttefik+Adj+A3pl+P3sg+Ins 3
kurumsal+Adj    4
sayılı+Adj      6
nispi+Adj       8
nisbi+Adj       9
görece+Adj+With 10
izafi+Adj       11
obur+Adj        12

I removed the line StringList.clear() as it somehow gave an error, though.

Works with both Python 2.7 and 3.6+, although the extended Unicode characters in your text will throw off the alignment using 2.7.

**Valdi_Bo** · Accepted Answer · 2018-10-27T17:59:35.503000

I think, your concept how to use regexes needs some improvement.

Note that each input line contains a number of "tokens", e.g. terörizm+Noun+Gen. As you can see, it contains:

the first word - actual word from text,
a number of classification symbols, each preceded with a + char.

So:

each line should be split into tokens, on a sequence of blank chars,
each token should be split into words, on + char,
the first from these words is the "actual" word,
the remaining words (without +) are classification symbols.

A good habit it to strip the terminating blank chars (at least \n).

Note also that your code contains StringList, so you are aware of the case that this function may look for one or more of multiple classification words.

I programmed it a slightly different way:

The second parameter (lookFor) is a list of words, which is converted into a set (lookForSet).
The set of words (result of splitting of a token, minus the first word) is also converted into a set.

The decision whether to print a word (the first word from a token) is based on whether at least one of its classification symbols can be found in lookForSet. To put it another way - whether lookForSet and wordSet have some common elements (set intersection).

So the whole script can look like below:

import re

def index(fileName, lookFor):
    lookForSet = set(lookFor)  # Set of classification symbols to look for
    pat1 = re.compile(r'\s+')  # Regex to split line into tokens
    pat2 = re.compile(r'\+')   # Regex to split a token into words
    with open(fileName) as f:
        for lineNo, line in enumerate(f, start=1):
            line = line.rstrip()
            tokens = pat1.split(line)
            for token in tokens:
                words = pat2.split(token)
                word1 = words.pop(0)  # Initial word
                wordSet = set(words)  # Classification words
                commonWords = lookForSet.intersection(wordSet)
                if commonWords:
                    print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))

index('lines.txt', ['Noun', 'Gen'])

A piece of output from it, for my input data (slightly shortened version of your) is like below:

1: Türkiye         Noun
1: terörizm        Noun, Gen
1: kitle           Noun
1: imha            Noun
2: Türkiye         Noun, Gen
2: potansiyel      Noun

It contains:

the number of source line,
the fist word of a token,
which classification words from lookFor have been found in this token.

Print all the tokens in the file that are labelled with the morphological tag

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PYTHON-3.X

Related Questions in FILE

Related Questions in MORPHOLOGICAL-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions