Find most repeated log lines in a large log file, fuzzy match

27 Views Asked by At

I have a program that produces about ~5 million lines of logs per hour (~1400 lines / second). That's about 650MB. About 99.9999% of it is from library we use (apache-spark / Databricks).

How can I run some kind of fuzzy duplicate identifier on this file to identify most repeated log lines? So I can silence them in my log4j2 config.

Most of these logs are useless even when something does go wrong, and it's causing a lot of problems with disk space, log processing (grafana), alerts etc.

I saw Pandas fuzzy detect duplicates, but it seems like something that works on a lot more structured data.

0

There are 0 best solutions below