A sample document and tsvector out of it:
select
*
from
unnest(to_tsvector('english', 'something wide this more wider and wider social-economy wide somethings'))
| lexeme | positions |
|---|---|
| economi | 10 |
| social | 9 |
| social-economi | 8 |
| someth | 1,12 |
| wide | 2,11 |
| wider | 5,7 |
How do I interpret these positions on a client? The docs says that:
A position normally indicates the source word's location in the document.
(emphasis is mine)
This 'normally' part is what confuses me.
A location in my understanding is a starting symbol index in a document. But here it seems to be just a lexeme's order index, not symbol index or even word order index.
I need to highlight the lexemes in a source document just like they are processed by Postgres (without ts_headline). But using the sample above word wide should be found at locations 2 and 11. However, since Postgres uses stemming, stop-word dictionaries, etc. the real word locations are different.
So how do I 'map' lexeme's positions to source document to highlight them? On a client. I need Postgres to just return the data it uses internally and interpret it somehow.
Basically, in the end I need to see something like this:
something wide this more wider and wider social-economy wide somethings
My first approach was to split the source document by whitespaces in tokens on a client, retrieve the tsvector, extract lexemes and compare each token with a lexeme using something like a string.StartsWith (currently on C#). But the problem is that some lexemes are a bit different from the source document's words. Note the 'economi' lexeme vs 'economy' token in source document. Also in a real project there are some extra synonyms used and so using string.StartsWith won't work.
That's why I needed real symbol positions. Is it possible to get them somehow?
UPDATE_01
Here is a sample of how I tried to split the source on a client to map locations (using C#):
var source = "something wide this more wider and wider social-economy wide somethings";
source
.Split(new[] { ' ', '-' }, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
.Select((w, i) => new
{
Word = w,
Pos = i + 1,
})
.OrderBy(w => w.Word)
;
| lexeme | positions |
|---|---|
| and | 6 |
| economy | 9 |
| more | 4 |
| social | 8 |
| something | 1 |
| somethings | 11 |
| this | 3 |
| wide | 2 |
| wide | 10 |
| wider | 5 |
| wider | 7 |
Some of them do match, others mostly not.
I see Postgres considering social-economi(y) as one word first and providing the position as 8 and then social as 9 and economi 10 (repeating it as a different word and doing plus 1) and the remaining words have +1 added to it followed by economi(y). As @KasbolatKumakhov said we can use \W regex pattern to replace all non word characters with spaces and use that.