How to handle Full Text Search Index across multiple languages in a single table?

35 Views Asked by At

I have a table in Postgres where I store articles for multiple languages. For a full-text search, I have a separate column that is enabled for full-text search using GIN indexes. Depending on the article language, I store the corresponding body for the corresponding dictionary: German, English, Spanish etc

However, when I want to query the database, it makes a significant difference for which dictionary I search. Actually, I want to query all languages and not only specific ones. So I am curious about how I can achieve that.

This is my table:

create table if not exists public.article
(
    id                        serial constraint "pk_article_id" primary key,
    title                     varchar(1024),
    body                      text,
    lang                      varchar(2),
    fts_dict                  regconfig generated always as (
                                  CASE
                                      WHEN ((lang)::text = 'en'::text) THEN 'english'::regconfig
                                      WHEN ((lang)::text = 'de'::text) THEN 'german'::regconfig
                                      WHEN ((lang)::text = 'es'::text) THEN 'spanish'::regconfig
                                      ELSE 'simple'::regconfig
                                      END) stored,
    fts_body                  tsvector generated always as (
                                  CASE
                                      WHEN ((lang)::text = 'en'::text) THEN to_tsvector('english'::regconfig, body)
                                      WHEN ((lang)::text = 'de'::text) THEN to_tsvector('german'::regconfig, body)
                                      WHEN ((lang)::text = 'es'::text) THEN to_tsvector('spanish'::regconfig, body)
                                      ELSE to_tsvector('simple'::regconfig, body)
                                      END) stored
);

And here comes the GIN-Index:

create index if not exists idx_article_fts_body
    on public.article using gin (fts_body);

So this table contains article in several languages. I have an auto-generated column that stores the body as tsvector in its specific dictionary, depending on the original language.

While searching for articles, I come across, that the dictionary matters heavily:

SELECT a.id, a.lang, a.title, a.body
FROM article a
         INNER JOIN summaries s ON a.id = s.article_id
WHERE a.fts_body @@ plainto_tsquery('german', 'Grünheide')

delivers only all articles with a german locale, but not the english or spanish articles containing 'Grünheide'.

If I want to include all english or spanish articles, that contain the same word "grünheide", I need to query for the corresponding dictionary additionally:

SELECT a.id, a.lang, a.title, a.body
FROM article a
         INNER JOIN summaries s ON a.id = s.article_id
WHERE a.fts_body @@ plainto_tsquery('english', 'Grünheide')

SELECT a.id, a.lang, a.title, a.body
FROM article a
         INNER JOIN summaries s ON a.id = s.article_id
WHERE a.fts_body @@ plainto_tsquery('spanish', 'Grünheide')

Omitting the dictionary in the query seems to default to "simple" or "english", which makes the result similar to if I had asked for "english".

My question is: do I really need to iterate over all dictionary-types (languages) and re-query again and again to get the full list of all articles that contains a specific term?

Or is there any other technique I should know?

1

There are 1 best solutions below

1
JGH On

You must use the same language in the query as for the ts_vector

SELECT a.id, a.lang, a.title, a.body
FROM article a
         INNER JOIN summaries s ON a.id = s.article_id
WHERE a.fts_body @@ plainto_tsquery(a.fts_dict, 'Grünheide')