I was curious about the performance benefits on dictionary lookups when using Python's sys.intern. As a toy example to measure that, I implemented the following code snippets:
Without interning:
import random
from uuid import uuid4
keys = [str(uuid4()) for _ in range(1_000_000)]
values = [random.random() for _ in range(1_000_000)]
my_dict = dict(zip(keys, values))
keys_sample = random.choices(keys, k=50_000)
def get_values(d, ks):
return [d[k] for k in ks]
Test results using ipython:
%timeit get_values(my_dict, keys_sample)
8.92 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With interning:
import sys
import random
from uuid import uuid4
keys = [sys.intern(str(uuid4())) for _ in range(1_000_000)]
values = [random.random() for _ in range(1_000_000)]
my_dict = dict(zip(keys, values))
keys_sample = random.choices(keys, k=50_000)
def get_values(d, ks):
return [d[k] for k in ks]
Test results:
%timeit get_values(my_dict, keys_sample)
8.83 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, no meaningful difference between both cases. I tried pumping up the dict size and sampling, but the results stayed on par. Am I using sys.intern incorrectly? Or is the testing flawed?
I assume you're referring to this section of the Python manual:
(Source.)
In this benchmark, you are running into two optimizations of Python's string handling which make it impossible to gain any performance from intern().
random.choice()on the list of strings, the sampled list is a list of references to the same strings. When Python compares two references to the same string, no string comparison is done - two references pointing to the same string must be equal. In other words, ifa is b, thena == b.a is not b, Python does one other check to avoid a string compare. It checks the hash of a and b, and ifhash(a) != hash(b), then the strings cannot be equal. Also, hashes are cached within the string object to avoid repeatedly computing them.So there are two cases during this dictionary lookup. Either it is comparing two of the same strings, in which case the reference equality shortcut prevents a string comparison, or it is comparing two different strings, in which case the hash comparison shortcut has a high probability of preventing a string comparison.
This gives us a path to create a (somewhat contrived) benchmark where
sys.intern()does help. Imagine you have a server which receives strings from a socket and compares them to a hard-coded dictionary. In this case, the strings from the socket are not going to have reference equality with the strings from the dictionary.To simulate this case, I have created a function which copies each of the strings in keys_sample, so that Python cannot apply optimization #1. It also prefixes each string with a thousand a's, to make string comparisons as expensive as possible.
Results: