I have a program that produces about ~5 million lines of logs per hour (~1400 lines / second). That's about 650MB. About 99.9999% of it is from library we use (apache-spark / Databricks).
How can I run some kind of fuzzy duplicate identifier on this file to identify most repeated log lines? So I can silence them in my log4j2 config.
Most of these logs are useless even when something does go wrong, and it's causing a lot of problems with disk space, log processing (grafana), alerts etc.
I saw Pandas fuzzy detect duplicates, but it seems like something that works on a lot more structured data.