Get frequency of related tags in a table - calculated table or ml?

78 Views Asked by At

I have a main table of multiple string tags:

["A", "B", "C", "D"]
["A", "C", "D", "G"]
["A", "F", "G", "H"]
["A", "B", "G", "H"]
...

When I create a new row and insert the first tag (by example "A"), I want to get suggested the most frequent tags related to it by looking in the existing rows.

In other words, I want to know for each tag (by example "A"), the frequency of related tags and get a list of related tags ordered by most frequents.

For example:

"A".get_most_frequently_related_tags()
= {"G": 3, "B": 2, "C": 2, "H": 2}

My approach is to iterate the main table and create dinamically a new table with this contents:

[ tag, related_tag, freq ]
[ "A", "B", 2 ]
[ "A", "G", 3 ]
[ "A", "H", 2 ]
...

and then select only rows with tag "A" to extract an hash of ordered [related_tag: freq].

Is that the best approach? I don't know if there's a better algorithm (or using machine learning?)...

2

There are 2 best solutions below

0
Stef On

Instead of a new table with one row per pair (tag, related_tag), I suggest a mapping with one row per tag, but this row maps the tag to the whole list of all its related tags (and their frequencies).

Most programming languages have a standard "map" in their standard library: in C++, it's std::map or std::unordered_map; in Java, it's the interface java.util.Map, implemented as java.util.HashMap or java.util.TreeMap; in python, it's dict.

Here is a solution in python. The map is implemented with collections.defaultdict, and it maps each tag to a collections.Counter, the python tool of choice to count frequencies.

from collections import Counter, defaultdict

table = [
    ["A", "B", "C", "D"],
    ["A", "C", "D", "G"],
    ["A", "F", "G", "H"],
    ["A", "B", "G", "H"],
]

def build_frequency_table(table):
    freqtable = defaultdict(Counter)
    for row in table:
        for tag in row:
            freqtable[tag].update(row)
    for c,freq in freqtable.items():
        del freq[c]
    return freqtable

freqtable = build_frequency_table(table)
print( freqtable )
# defaultdict(<class 'collections.Counter'>,
#   {'A': Counter({'G': 3, 'B': 2, 'C': 2, 'D': 2, 'H': 2, 'F': 1}),
#    'B': Counter({'A': 2, 'C': 1, 'D': 1, 'G': 1, 'H': 1}),
#    'C': Counter({'A': 2, 'D': 2, 'B': 1, 'G': 1}),
#    'D': Counter({'A': 2, 'C': 2, 'B': 1, 'G': 1}),
#    'G': Counter({'A': 3, 'H': 2, 'C': 1, 'D': 1, 'F': 1, 'B': 1}),
#    'F': Counter({'A': 1, 'G': 1, 'H': 1}),
#    'H': Counter({'A': 2, 'G': 2, 'F': 1, 'B': 1})})

print(freqtable['A'].most_common())
# [('G', 3), ('B', 2), ('C', 2), ('D', 2), ('H', 2), ('F', 1)]
0
Astrid E. On

I've had a go at finding a solution for this in C#. I cannot defend this approach performance-wise, but 1) it serves the purpose (at least for inputs that are not too large); and 2) I found it to be an interesting challenge personally.

As in Stef's answer, a dictionary is created and may be used to look up any wanted tag to see all of the tag's related tags, ordered by frequency.

I've placed the dictionary creation inside an extension method:

public static IDictionary<string, List<(string Tag, int Count)>> AsRelatedTagWithFrequencyMap
    (this IEnumerable<IEnumerable<string>> relatedTags)
{
    return relatedTags
        .SelectMany(row => row
            .Select(targetTag =>
                (TargetTag: targetTag,
                RelatedTags: row.Where(tag => tag != targetTag))))
        .GroupBy(relations => relations.TargetTag)
        .ToDictionary(
            grouping => grouping.Key,
            grouping => grouping
                .SelectMany(relations => relations.RelatedTags)
                .GroupBy(relatedTag => relatedTag)
                .Select(grouping => (RelatedTag: grouping.Key, Count: grouping.Count()))
                .OrderByDescending(relatedTag => relatedTag.Count)
                .ToList());
}

It is used as follows:

var tagsUsedWithTags = new List<string[]>
{
    new[] { "A", "B", "C", "D" },
    new[] { "A", "C", "D", "G" },
    new[] { "A", "F", "G", "H" },
    new[] { "A", "B", "G", "H" }
};

var relatedTagsOfTag = tagsUsedWithTags.AsRelatedTagWithFrequencyMap();

Printing the dictionary content:

foreach (var relation in relatedTagsOfTag)
{
    Console.WriteLine($"{relation.Key}: [ {string.Join(", ", relation.Value.Select(related => $"({related.Tag}: {related.Count})"))} ]");
}

A: [ (G: 3), (B: 2), (C: 2), (D: 2), (H: 2), (F: 1) ]
B: [ (A: 2), (C: 1), (D: 1), (G: 1), (H: 1) ]
C: [ (A: 2), (D: 2), (B: 1), (G: 1) ]
D: [ (A: 2), (C: 2), (B: 1), (G: 1) ]
F: [ (A: 1), (G: 1), (H: 1) ]
G: [ (A: 3), (H: 2), (C: 1), (D: 1), (F: 1), (B: 1) ]
H: [ (A: 2), (G: 2), (F: 1), (B: 1) ]