I have two tables keywords and posts in my PieCloudDB Database.
Each topic can be expressed by one or more keywords. If a keyword of a certain topic exists in the content of a post (case insensitive) then the post has this topic.
For example:
| topic_id | keyword |
|---|---|
| 1 | basketball |
| 2 | music |
| 3 | food |
| 4 | war |
| post_id | content |
|---|---|
| 1 | A typhoon warning has been issued in southern Japan |
| 2 | We are going to play neither basketball nor volleyball |
| 3 | I am indulging in both the delightful music and delectable food |
| 4 | That basketball player fouled again |
Now I want to find the topics of each post according to the following rules:
If the post does not have keywords from any topic, its topic should be "
Vague!".If the post has at least one keyword of any topic, its topic should be a string of the IDs of its topics sorted in ascending order and separated by commas ','.
For the above example data, the results should be:
| post_id | topics |
|---|---|
| 1 | Vague! |
| 2 | 1 |
| 3 | 2,3 |
| 4 | 1 |
SELECT post_id, COALESCE(array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ','), 'Vague!') AS topic
FROM (
SELECT p.post_id, k.topic_id
FROM Posts p
LEFT JOIN Keywords k
ON LOWER(content) LIKE '% ' || keyword || ' %' OR content LIKE keyword || ' %' OR content LIKE '% ' || keyword
) a
GROUP BY post_id
ORDER BY post_id
I tried this query but the results I got were not exactly correct. I don't know why the output of post 1 is null:
| post_id | topics |
|---|---|
| 1 | |
| 2 | 1 |
| 3 | 2,3 |
| 4 | 1 |
Can anyone give me a correct answer?
(If you don’t know the database I use, you can use PostgreSQL instead.)
Simple regex without index support
Without index support (only sensible for trivial cardinalities!) this is fastest in Postgres, while doing exactly what you ask for:
~*is the case-insensitive regular-expression match operator. See:Concerning my regex, I cite the manual:
This covers the start (
^) and end ($) of the string implicitly.\W(as suggested in another answer) matches any non-word character, and is wrong for the task.)Note how I apply
ORDER BYonce in a subquery instead of per-aggregate, because that's faster. See:In this constellation, a simple
COALESCEcatches the case of no matches.FTS index for strict matching
The simple (naive) approach above scales with O(N*M), i.e. terribly with a non-trivial number of rows in each table. Typically, you want index support.
While strictly matching keywords, the best index should be a Full Text Search index with the
'simple'dictionary, and a query that can actually use that index:FTS index for matching English words
To match natural language words with built-in stemming, use a matching dictionary,
'english'in the example:fiddle
For fuzzy matching consider a trigram index. See:
Related: