I have a main table of multiple string tags:
["A", "B", "C", "D"]
["A", "C", "D", "G"]
["A", "F", "G", "H"]
["A", "B", "G", "H"]
...
When I create a new row and insert the first tag (by example "A"), I want to get suggested the most frequent tags related to it by looking in the existing rows.
In other words, I want to know for each tag (by example "A"), the frequency of related tags and get a list of related tags ordered by most frequents.
For example:
"A".get_most_frequently_related_tags()
= {"G": 3, "B": 2, "C": 2, "H": 2}
My approach is to iterate the main table and create dinamically a new table with this contents:
[ tag, related_tag, freq ]
[ "A", "B", 2 ]
[ "A", "G", 3 ]
[ "A", "H", 2 ]
...
and then select only rows with tag "A" to extract an hash of ordered [related_tag: freq].
Is that the best approach? I don't know if there's a better algorithm (or using machine learning?)...
Instead of a new table with one row per pair (tag, related_tag), I suggest a mapping with one row per tag, but this row maps the tag to the whole list of all its related tags (and their frequencies).
Most programming languages have a standard "map" in their standard library: in C++, it's
std::maporstd::unordered_map; in Java, it's the interfacejava.util.Map, implemented asjava.util.HashMaporjava.util.TreeMap; in python, it'sdict.Here is a solution in python. The map is implemented with
collections.defaultdict, and it maps each tag to acollections.Counter, the python tool of choice to count frequencies.