Find most repeated log lines in a large log file, fuzzy match

27 Views Asked by Kashyap At 25 March 2024 at 16:37

I have a program that produces about ~5 million lines of logs per hour (~1400 lines / second). That's about 650MB. About 99.9999% of it is from library we use (apache-spark / Databricks).

How can I run some kind of fuzzy duplicate identifier on this file to identify most repeated log lines? So I can silence them in my log4j2 config.

Most of these logs are useless even when something does go wrong, and it's causing a lot of problems with disk space, log processing (grafana), alerts etc.

I saw Pandas fuzzy detect duplicates, but it seems like something that works on a lot more structured data.

Original Q&A

Find most repeated log lines in a large log file, fuzzy match

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in LOGGING

Related Questions in DUPLICATES

Related Questions in FUZZY-SEARCH

Trending Questions

Popular # Hahtags

Popular Questions