Whoosh library, weird behavior of Sequence query with wildcards

28 Views Asked by At

Problem

I am trying to implement behaviour similar to sphinx search engine handling phrases with wildcards. For this i use whoosh library. But when i use sequence queries with short words (2 chars length) and wildcards i get an error:

    289     return [Span(pos) for pos in self.value_as("positions")]
    290 else:
--> 291     raise Exception("Field does not support spans")

Exception: Field does not support spans

I noticied this happens when i add a lot of documents to the index, it doesn't happn with small number of documents though.

I want be able to search with queries like:

  • "найденный" AND ("по* мест* проживания" OR "рядом с домом")

Here "по* мест* проживания" causes the error. When i reduce it to "по*" it runs well, if i change it a bit to "по* дороге" i am still getting the same error.

Code and expected results

from whoosh.fields import Schema, TEXT, NUMERIC
from whoosh.qparser import QueryParser, PhrasePlugin, SequencePlugin, OperatorsPlugin
from whoosh import analysis
from whoosh.filedb.filestore import RamStorage


analyzer = analysis.StandardAnalyzer(minsize=None, stoplist=None)
schema = Schema(item_id=NUMERIC(stored=True, bits=64), type=NUMERIC(stored=True), content=TEXT(analyzer=analyzer, stored=True, phrase=True))
storage = RamStorage()

    
ix = storage.create_index(schema)
writer = ix.writer()

with get_db() as db:
    for item in db["items"][0:500]:
        writer.add_document(
            item_id=item["id"], type=item["type"], content=item["content"]
        )
writer.commit(optimize=True)




parser = QueryParser("content", schema=schema)
op = OperatorsPlugin(
    And="AND", Or="OR", AndNot="ANT", Not=None, AndMaybe=None, Require=None
)
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
parser.replace_plugin(op)


with ix.searcher() as searcher:
    query = '"найденный" AND ("по* мест* проживания" OR "рядом с домом")'
    query = parser.parse(query, debug=True)
    hits = searcher.search(query, terms=True, limit=None)
    pprint(list(hits))

Expecting to get a list of hits but I am getting the Exception: Field does not support spans instead.

My content is text of variable length in different languages. Queries are also might be in different languages.

0

There are 0 best solutions below