I have a big jsn list which contains a lot of string elements with possible duplicate values. I need to check each element for similarity and add duplicate list item keys in dubs list to remove these items from jsn list.
Because of size of jsn list i decided to use Threading in my code to speed up second for loop execution and waiting time
But Thread/Process is not working as i expected.
The code below with Thread inside changes nothing in performance and also dubs list is empty after Threads join is finished
I tried without success.join() but i still got empty dubs list and no change in performance.
The main problem -> dubs list is empty before starting deleting duplicates.
from threading import Thread
from multiprocessing import Process
from difflib import SequenceMatcher
# Searching for dublicates in array
def finddubs(jsn,dubs,a):
for b in range(len(jsn)):
if ((jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40)):
dubs.append(b) # add dublicate list element keys to dublicates array
# Start threading
threads = []
for a in range(len(jsn)):
t = Thread(target=finddubs, args=(jsn,dubs,a))
threads.append(t)
t.start()
for thr in threads:
thr.join()
# Delete duplicate list items
for d in dubs:
k = int(d)
del jsn[k]
Without threading code is working
You need to use
multiprocessinginstead ofthreadingif you want to speedup your computations. Please read about GIL for detailed information on topic.An example of how
multiprocessingcan be used for this task:modification of your code that should work: