How to find real lexeme start position in source document from tsvector?

Question

How to find real lexeme start position in source document from tsvector?

105 Views Asked by Kasbolat Kumakhov At 13 September 2023 at 13:34

A sample document and tsvector out of it:

select
    *
from
    unnest(to_tsvector('english', 'something wide this more wider and wider social-economy wide somethings'))

lexeme	positions
economi	10
social	9
social-economi	8
someth	1,12
wide	2,11
wider	5,7

How do I interpret these positions on a client? The docs says that:

A position normally indicates the source word's location in the document.

(emphasis is mine)

This 'normally' part is what confuses me.

A location in my understanding is a starting symbol index in a document. But here it seems to be just a lexeme's order index, not symbol index or even word order index.

I need to highlight the lexemes in a source document just like they are processed by Postgres (without ts_headline). But using the sample above word wide should be found at locations 2 and 11. However, since Postgres uses stemming, stop-word dictionaries, etc. the real word locations are different.

So how do I 'map' lexeme's positions to source document to highlight them? On a client. I need Postgres to just return the data it uses internally and interpret it somehow.

Basically, in the end I need to see something like this:

something wide this more wider and wider social-economy wide somethings

My first approach was to split the source document by whitespaces in tokens on a client, retrieve the tsvector, extract lexemes and compare each token with a lexeme using something like a string.StartsWith (currently on C#). But the problem is that some lexemes are a bit different from the source document's words. Note the 'economi' lexeme vs 'economy' token in source document. Also in a real project there are some extra synonyms used and so using string.StartsWith won't work.

That's why I needed real symbol positions. Is it possible to get them somehow?

UPDATE_01

Here is a sample of how I tried to split the source on a client to map locations (using C#):

var source = "something wide this more wider and wider social-economy wide somethings";
source
    .Split(new[] { ' ', '-' }, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
    .Select((w, i) => new
    {
        Word = w,
        Pos = i + 1,
    })
    .OrderBy(w => w.Word)
;

lexeme	positions
and	6
economy	9
more	4
social	8
something	1
somethings	11
this	3
wide	2
wide	10
wider	5
wider	7

Some of them do match, others mostly not.

Original Q&A

There are 1 best solutions below

**Senthil P Nathan** · Accepted Answer · 2023-09-23T14:15:53.757000

I see Postgres considering social-economi(y) as one word first and providing the position as 8 and then social as 9 and economi 10 (repeating it as a different word and doing plus 1) and the remaining words have +1 added to it followed by economi(y). As @KasbolatKumakhov said we can use \W regex pattern to replace all non word characters with spaces and use that.

SELECT *
FROM unnest(to_tsvector('english',
             regexp_replace('something;wide {}{}{ this.more wider and\wider social-economy wide/somethings
               wide;somethings wide&somethings wide|somethings wide+somethings wide"somethings wide.somethings 
               wide=somethings wide#somethings wide@somethings wide*somethings
               wide~somethings wide$somethings wide:somethings wideËsomethings wide wideËsomethings',
                            E'\\W', ' ', 'g')));

How to find real lexeme start position in source document from tsvector?

UPDATE_01

There are 1 best solutions below

Related Questions in POSTGRESQL

Related Questions in FULL-TEXT-SEARCH

Related Questions in TSVECTOR

Trending Questions

Popular # Hahtags

Popular Questions