Stop OpenSearch from counting mispelled words (from fuzziness) in score when a correct word is matched

Question

Stop OpenSearch from counting mispelled words (from fuzziness) in score when a correct word is matched

13 Views Asked by Calebmer At 19 March 2024 at 17:03

I have the following OpenSearch query:

{
  "bool": {
    "minimum_should_match": 1,
    "should": [
      {
        "multi_match": {
          "query": "collections",
          "type": "most_fields",
          "fields": ["title^1.8", "body"],
          "fuzziness": "AUTO",
          "prefix_length": 1
        }
      },
      {
        "multi_match": {
          "query": "collections",
          "type": "most_fields",
          "fields": [
            "title._2gram^1.8",
            "title._3gram^1.8",
            "body._2gram",
            "body._3gram"
          ]
        }
      }
    ]
  }
}

title and body are, more or less, the search-as-you-type field type with the english language analyzer. The goal of this query is to allow fuzzy matching on individual words but not on 2gram or 3gram word combinations.

This query gives me a hit which when explain: true gives the following explanation:

19.278 sum of:
├── 2.650 weight(body:collect in 7650) [PerFieldSimilarity], result of:
│   └─ 2.650 score(freq=23.0), computed as boost * idf * tf from:
│      ├─ 2.200 boost
│      │
│      ├─ 1.240 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
│      │  ├─ 14964.000 n, number of documents containing term
│      │  └─ 51694.000 N, total number of documents with field
│      │
│      └─ 0.972 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
│         ├─── 23.000 freq, occurrences of term within document
│         ├──── 1.200 k1, term saturation parameter
│         ├──── 0.750 b, length normalization parameter
│         ├─ 1176.000 dl, length of field (approximate)
│         └─ 2839.879 avgdl, average length of field
│
├── 1.458 weight(body:collector in 7650) [PerFieldSimilarity], result of:
│   └─ 1.458 score(freq=2.0), computed as boost * idf * tf from:
│      ├─ 1.571 boost
│      │
│      ├─ 1.240 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
│      │  ├─ 14964.000 n, number of documents containing term
│      │  └─ 51694.000 N, total number of documents with field
│      │
│      └─ 0.748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
│         ├──── 2.000 freq, occurrences of term within document
│         ├──── 1.200 k1, term saturation parameter
│         ├──── 0.750 b, length normalization parameter
│         ├─ 1176.000 dl, length of field (approximate)
│         └─ 2839.879 avgdl, average length of field
│
├── 1.165 weight(body:correct in 7650) [PerFieldSimilarity], result of:
│   └─ 1.165 score(freq=1.0), computed as boost * idf * tf from:
│      ├─ 1.571 boost
│      │
│      ├─ 1.240 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
│      │  ├─ 14964.000 n, number of documents containing term
│      │  └─ 51694.000 N, total number of documents with field
│      │
│      └─ 0.598 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
│         ├──── 1.000 freq, occurrences of term within document
│         ├──── 1.200 k1, term saturation parameter
│         ├──── 0.750 b, length normalization parameter
│         ├─ 1176.000 dl, length of field (approximate)
│         └─ 2839.879 avgdl, average length of field
│
└─ 14.006 sum of:
   └─ 14.006 weight(title:collect in 7650) [PerFieldSimilarity], result of:
      └─ 14.006 score(freq=1.0), computed as boost * idf * tf from:
         ├─ 3.960 boost
         │
         ├─ 6.737 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
         │  ├──── 67.000 n, number of documents containing term
         │  └─ 56901.000 N, total number of documents with field
         │
         └─ 0.525 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
            ├─ 1.000 freq, occurrences of term within document
            ├─ 1.200 k1, term saturation parameter
            ├─ 0.750 b, length normalization parameter
            ├─ 2.000 dl, length of field
            └─ 2.976 avgdl, average length of field

There’s a body match on “collect”, “collector”, and “correct”. There’s a title match on “collect”.

What’s surprising to me is that that the fuzzy matches of “collector” and “correct” add to the score. This places this hit above other, better, matches which only match “collect” (“collect” is the stemmed version of the queried word “collection”).

If the word already matched, I don’t want to count fuzzy matches against the score. So the 1.458 score from matching “collector” and the 1.165 score of matching “correct” should be zeroed out since “collect” matched. The final score should be 16.655 not 19.278.

I thought this might be related to "type": "most_fields". I still want to sum the scores if there was both a body and title match so I tried:

{
  "bool": {
    "minimum_should_match": 1,
    "should": [
      {
        "multi_match": {
          "query": "collections",
          "type": "best_fields",
          "fields": ["title^1.8"],
          "fuzziness": "AUTO",
          "prefix_length": 1
        }
      },
      {
        "multi_match": {
          "query": "collections",
          "type": "best_fields",
          "fields": ["body"],
          "fuzziness": "AUTO",
          "prefix_length": 1
        }
      },
      {
        "multi_match": {
          "query": "collections",
          "type": "most_fields",
          "fields": [
            "title._2gram^1.8",
            "title._3gram^1.8",
            "body._2gram",
            "body._3gram"
          ]
        }
      }
    ]
  }
}

…but that gives the same result. (I realize now that multi_match with a single field is unnecessary. It could be match.)

Original Q&A

Stop OpenSearch from counting mispelled words (from fuzziness) in score when a correct word is matched

There are 0 best solutions below

Related Questions in ELASTICSEARCH

Related Questions in OPENSEARCH

Related Questions in LEVENSHTEIN-DISTANCE

Related Questions in FUZZY-SEARCH

Trending Questions

Popular # Hahtags

Popular Questions