I want to keep a list of words (let's say 20k). The only purpose of it is to check if the given word exists there. So, it's going be something known as a hash set.
-- dict here is just an example, it's going to be loadaed from a text file
dict = {["foo"] = true, ["bar"] = true, ...}
function dict.contains(self, word) do
return self[word]
end
Is this option good enough? Or maybe this one would be better:
-- hash(string) is an exisiting external function
dict = {hash("foo") = true, hash("bar") = true, ... up to 20k}
function dict.contains(self, word) do
return self[hash(word)]
end
I am inclining to vote for the 2nd option but thinking of internal implementation of lua hashes makes me a little bit confused. If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)? But from the other side, it should be good for statically defined tables (with interned strings). While I am going to load keys, maybe the 2nd options is still better?
Hash sets in Lua
Yes, a table with string keys and truthy values (usually
true) is the idiomatic, and most likely most efficient, way to implement a hash set of strings (or numbers, or "pointers" (like functions or tables, which are compared by reference)) in Lua.Method - data name collisions
This however:
is a bad idea, because suddenly
dict:contains("contains")will be truthy (it will be thecontainsfunction).Similarly,
dict.wordwill suddenly be truthy as well, even though you probably want all accesses to go throughcontains.I'd recommend just directly doing something like
if words[word] thenrather than unnecessarily complicating this by forcing it into an OOP style when it is not needed or useful. If you do want this, encapsulate the internal "set" of words, so store it in a field likedict._wordsto not have the methods conflict with the actual data you're storing.Premature optimization concerns
It most probably is - try it and see. "20k" is a very small number of words for Lua to handle. I see no reason why the first solution shouldn't be fully satisfying.
In asymptotical terms, accessing this table - both to insert new strings, or to get strings - has expected constant time complexity.
Using the optimized hash table implementation provided by your Lua implementation is most probably a better idea than rolling your own.
So far you have provided no reason why it should be better, and barring exceptional circumstances, I don't think it will be. In many aspects, it's very blatantly worse:
Yes, on Lua 5.2 and earlier, all strings are interned and there is no way to turn this off. What this practically means is that different strings have different addresses, same strings reuse the same memory. String comparison is a simple address comparison and thus constant time. This is what enables table access to always be O(1) instead of O(length of string).
So indeed, on Lua 5.2 and earlier, the first option is most certainly better, and by a lot.
On Lua 5.3 and later, things get a bit more complicated: Very "long" strings (think thousands of characters) are not interned anymore, but instead are hashed on-demand, meaning comparison for these long strings is O(n), as is table access.
Now there are two "exceptional circumstances" where the custom hashing could be justified, but I consider it very unlikely that you find yourself in either one:
If anything, this is another plus point for using Lua's implementation.
No.
TL;DR: Don't optimize prematurely; you'd most likely end up with more complex, less efficient code. Simply use a table with string keys.