How to share sensitive data among programs while keeping the possibility of comparing them with other local data?

61 Views Asked by At

Context

As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites.
Currently, I have a list of domain names saved in a CSV file, containing both known domains considered safe (discord.com, google.com, etc.), and known fraudulent domains (free-nitro.ru etc.)

To share this list between my personal computer and my server, I regularly "deploy" it in ftp. But since my bot also uses GitHub and a MySQL database, I'm looking for a better system to synchronize this list of domain names without allowing anyone to access it.
I feel like I'm looking for a miracle solution that doesn't exist, but I don't want to overestimate my knowledge so I'm coming to you for advice, thanks in advance!

My considered solutions:

  • Put the domain names in a MySQL table
    Advantages: no public access, live synchronization
    Disadvantages: my scam detection script should be able to work offline

  • Hash the domain names before putting them on git
    Advantages: no public access, easy to do, supports equality comparison
    Disadvantages: does not support similarity comparison, which is an important part of the program

  • Hash domain names with locality-sensitive hashing
    Advantages: no easy public access, supports equality and similarity comparison
    Disadvantages : similarities less precise than in clear, and impossible to hash a new string from the server without knowing at least the seed of the random, so public access problems

My opinion

It seems to me that the last solution, with the LSH, is the one that causes the least problems. But it is far from satisfying me, and I hope to find better. For the LSH algorithm, I have reproduced it here (from this notebook). I get similarity coefficients between 10% and 40% lower than those obtained with the current plain method.

EDIT: for clarification purpose, maybe my intentions weren’t clear enough (I’m sorry, English is not my native language and I’m bad at explaining things lol). The database or GitHub are just convenient ways to share info between my different bot instances. I could have one locally running on my pc, one on my VPS, one other god know where… and this is why I don’t want a FTP or any kind of synchronisation process involving an IP and/or a fixed destination folder. Ideally I’d like to just take my program at any time, download it wherever I want (by git clone) and just run it.
Please tell me if this isn’t clear enough, thanks :)

1

There are 1 best solutions below

0
Z_runner On

At the end I think I'll use yet another solution. I'm thinking of using the MySQL database to store domain names, but only use it in my script to synchronize to it, keeping a local CSV version.

In short, the workflow I'm imagining:

  • I edit my SQL table when I want to add/remove items to it
  • When the bot is launched, the script connects to the DB and retrieves all the information from the table
  • Once the information is retrieved, it saves it in a CSV file and finishes running the rest of the script
  • If at launch no internet connection is available, the synchronization to the DB is not done and only the CSV file is used.

This way I have the advantages of no public access, an automatic synchronization, an access even offline after the first start, and I keep the support of comparison by similarity since no hash is done.

If you think you can improve my idea, I'm interested!