In my dataset, i have product descriptions that appear as:
- Product A, Product A, Product A
and in other rows as
- Product A, Product B, Product A, Product B
and in some rows, as just
- Product A
Initially, my dataset had strings in the format:
- Product A, Product B, Product A, Product B, Product A, Product B
and
- Product A, Product A, Product A
Since I wanted just one instance of each product, I resolved this issue by using the following code:
df$lengths <- str_length(df$items)
df$new_items <- str_sub(df$items, 1, df$lengths/3)
Is there a way to solve the above problem by modifying this code?
df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)",
"Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)",
"Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)",
"samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s(3+32)",
"Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)"
)), class = "data.frame", row.names = c(NA, -6L))
EDIT:
If the comma-separated strings do not always contain identical elements, more complex solutions are in order:
Data:
Solution 1: A regex solution based on negative character class, negative lookahead, and backreference -- basically, a one-liner:
Solution 2: Based on
tidyrfunctionalityBefore edit:
This solution is based on the assumption that the comma-separated elements in the strings are always identical:
Data: