CYK algorithm implementation

1.5k Views Asked by At

I'm trying to implement the CYK pseudo code provided by wikipedia. The example sentence I input should be outputting True, however it is outputting false. I think I'm getting issues on the indexing considering the provided example starts from 1.

Code:

def is_in_language(self, tokens):
    n = len(tokens)
    rules = self.grammar.lhs_to_rules
    table = defaultdict(lambda: defaultdict(dict))
    #Initialize dictionary table[row][column][nonterminal r] = boolean
    for row in range(n+1):
        for col in range(n+1):
            for r in rules:
                table[row][col][r] = False

    for i in range(n):
        nonTerminalList = self.grammar.rhs_to_rules[(tokens[i],)]
        print(nonTerminalList)
        for nonTerminal in nonTerminalList:
            (r,right,prob) = nonTerminal
            table[0][i][r] =  True


    for l in range(2,n+1):
        for s in range(n-l+1):
            for p in range(l-1+1):
                for B in rules:
                    for C in rules:
                        AList = self.grammar.rhs_to_rules[B,C]
                        if(len(AList) > 0):
                            for A in AList:
                                (leftA, rightBC, prob) = A


                                try:
                                    if(table[p][s][B] and table[l-p][s+p][C]):
                                        table[l][s][leftA] = True
                                except:
                                    pass

    print(table[n][0][self.grammar.startsymbol])
    return table
1

There are 1 best solutions below

0
Sandipan Dey On

The following code in python implements the CYK dynamic programming algorithm (described here), which can be used to solve the membership problem for CFG, i.e., given an input string w and a CFG grammar G in chomosky normal form (CNF) it can find whether w is in L(G) in O(n^3|w|) time.

import numpy as np
import pandas as pd

def is_in_cartesian_prod(x, y, r):
    return r in [i+j for i in x.split(',') for j in y.split(',')]

def accept_CYK(w, G, S):
    if w == 'ε':
        return 'ε' in G[S]
    n = len(w)
    DP_table = [['']*n for _ in range(n)]
    for i in range(n):
        for lhs in G.keys():
            for rhs in G[lhs]:
                 if w[i] == rhs: # rules of the form A -> a
                    DP_table[i][i] = lhs if not DP_table[i][i] else DP_table[i][i] + ',' + lhs
                    
    for l in range(2, n+1):       # span
        for i in range(n-l+1):    # start
            j = i+l-1             # right
            for k in range(i, j): # partition
                for lhs in G.keys():
                    for rhs in G[lhs]:
                        if len(rhs) == 2: #rules of form A -> BC
                            if is_in_cartesian_prod(DP_table[i][k], DP_table[k+1][j], rhs):
                                if not lhs in DP_table[i][j]:
                                    DP_table[i][j] = lhs if not DP_table[i][j] else DP_table[i][j] + ',' + lhs

    return S in DP_table[0][n-1]  

Now, let's use the above implementation of algorithm for the following simple CFG G (already in CNF):

S -> AB | BC

A -> BA | a

B -> CC | b

C -> AB | a

and the input string w = baaba to test membership of w in L(G).

# let's define the grammar productions and symbols first 
NTs = ['S', 'A', 'B', 'C', 'D']
Ts = ['a', 'b']
rules = ['S -> AB | BC', 'A -> BA | a', 'B -> CC | b', 'C -> AB | a'] #, 'D -> ϵ']
G = get_grammar(rules)
print(G)
# {'S': ['AB', 'BC'], 'A': ['BA', 'a'], 'B': ['CC', 'b'], 'C': ['AB', 'a']}

# now check if the string w is a member of G
accept_CYK('baaba', G, 'S')
# True

enter image description here

The following animation shows how the DP table gets constructed:

enter image description here

A probabilistic version of this DP algorithm with back-pointers can be used for the PCFGs to construct parse trees for natural language sentences in NLP.