Why do these two functions not return the same result?

78 Views Asked by At

I'm trying to update a function to improve performance. The old function takes ~32s and the new one takes ~350ms for the same call, but right now they are not returning the same result, and I'm not sure why.

Function parameters:

  • md5Hashes: array of strings
  • excludedTextId: string

What the function is supposed to do:

It is using the following mongoose schema:

const sentenceFeatureSchema = new mongoose.Schema({
    text_id: { type: String, required: true },
    md5_hash: { type: String, required: true },
    offset: { type: Number, required: true },
    length: { type: Number, required: true },
    num_words: { type: Number, required: true },
});

Given an array of hashes, query mongodb for all documents which match any of the md5 hashes in the given array. If excludedTextId is provided, do not include those documents which have the same textId. The hashes and textids are not exclusive, meaning that any document can have any combination of textid and hash (there can even be multiple documents which have both the same id and hash, but will differ in other values).

Note: It should only be querying for the data it needs, because there are a lot of documents in the system, so trying to query for more than we need and then using only what we need must be avoided.

Problem

My new implementation is not returning the same result as the original one. I believe the new implementation is correct and the old one is not, but I may be wrong. Either way, I'd like to understand exactly what and why they are returning different results.

Original function

const getMatchingHashSentences = async (md5Hashes, excludedTextId = null) => {
    try {
        const entries = [];

        for (const md5Hash of md5Hashes) {
            const queryConditions = { md5_hash: md5Hash };
            if (excludedTextId) {
                queryConditions.text_id = { $ne: excludedTextId };
            }
            const result = await SentenceFeature.find(queryConditions);
            entries.push(...result);
        }
        logger.debug(`entries length: ${entries.length}`);
        return entries;
    } catch (error) {
        throw error;
    }
};

New function

I tried to make it into a single db request.

const getMatchingHashSentences = async (md5Hashes, excludedTextId = null) => {
    try {
        const query = SentenceFeature.find({ md5_hash: { $in: md5Hashes } });

        if (excludedTextId) {
            query.where("text_id").ne(excludedTextId);
        }

        const results = await query.exec();
        return results;
    } catch (error) {
        throw error;
    }
};

EDIT

As was requested, here is demo code that will create 10 elements and use both functions to query for them:

demo directory: mkdir demo && cd demo

barebones npm project: npm init -y && npm i mongoose mongodb

create demo.js:

const mongoose = require("mongoose");

// Define schema

const sentenceFeatureSchema = new mongoose.Schema({
    text_id: { type: String, required: true },
    md5_hash: { type: String, required: true },
    offset: { type: Number, required: true },
    length: { type: Number, required: true },
    num_words: { type: Number, required: true },
});

// Define model
const SentenceFeature = mongoose.model("features", sentenceFeatureSchema);

const getMatchingHashSentences = async (md5Hashes, excludedTextId = null) => {
    try {
        const entries = [];

        for (const md5Hash of md5Hashes) {
            const queryConditions = { md5_hash: md5Hash };
            if (excludedTextId) {
                queryConditions.text_id = { $ne: excludedTextId };
            }
            const result = await SentenceFeature.find(queryConditions);
            entries.push(...result);
        }
        return entries;
    } catch (error) {
        throw error;
    }
};

const getMatchingHashSentences2 = async (md5Hashes, excludedTextId = null) => {
    try {
        const query = SentenceFeature.find({ md5_hash: { $in: md5Hashes } });

        if (excludedTextId) {
            query.where("text_id").ne(excludedTextId);
        }

        const results = await query.exec();
        return results;
    } catch (error) {
        throw error;
    }
};

// Connect to MongoDB
mongoose
    .connect("mongodb://localhost:27017/mydatabase", { useNewUrlParser: true, useUnifiedTopology: true })
    .then(() => {
        console.log("Connected to MongoDB");

        const feature = new Feature({
            text_id: "1",
            md5_hash: "h1",
            offset: 0,
            length: 0,
            num_words: 0,
        });

        feature
            .save()
            .then(() => {
                console.log("Entry saved successfully");
            })
            .catch((err) => {
                console.error("Error saving entry:", err);
            });
    })
    .catch((err) => {
        console.error("Error connecting to MongoDB:", err);
    });

async function main() {
    try {
        // Connect to MongoDB
        await mongoose.connect("mongodb://localhost:27017/mydatabase", {
            useNewUrlParser: true,
            useUnifiedTopology: true,
        });
        console.log("Connected to MongoDB");

        // Save 10 features

        for (let i = 1; i <= 10; i++) {
            const feature = new SentenceFeature({
                text_id: `${i}`,
                md5_hash: `h${i}`,
                offset: 0,
                length: 0,
                num_words: 0,
            });
            await feature.save();
            console.log(`Saved Feature ${i}`);
        }

        // Query for saved features
        const oldres = await getMatchingHashSentences(["h1", "h3", "h6"], "3");
        console.log("old:", oldres);

        const newres = await getMatchingHashSentences2(["h1", "h3", "h6"], "3");
        console.log("new:", newres);
    } catch (error) {
        console.error("Error:", error);
    } finally {
        // Close connection
        await mongoose.disconnect();
        console.log("Disconnected from MongoDB");
    }
}

main();

// OUTPUT:

// Saved Feature 1
// Saved Feature 2
// Saved Feature 3
// Saved Feature 4
// Saved Feature 5
// Saved Feature 6
// Saved Feature 7
// Saved Feature 8
// Saved Feature 9
// Saved Feature 10
// old: [
//   {
//     _id: new ObjectId('6605aa98790f41e69f3f966b'),
//     text_id: '1',
//     md5_hash: 'h1',
//     offset: 0,
//     length: 0,
//     num_words: 0,
//     __v: 0
//   },
//   {
//     _id: new ObjectId('6605aa98790f41e69f3f9676'),
//     text_id: '6',
//     md5_hash: 'h6',
//     offset: 0,
//     length: 0,
//     num_words: 0,
//     __v: 0
//   }
// ]
// new: [
//   {
//     _id: new ObjectId('6605aa98790f41e69f3f966b'),
//     text_id: '1',
//     md5_hash: 'h1',
//     offset: 0,
//     length: 0,
//     num_words: 0,
//     __v: 0
//   },
//   {
//     _id: new ObjectId('6605aa98790f41e69f3f9676'),
//     text_id: '6',
//     md5_hash: 'h6',
//     offset: 0,
//     length: 0,
//     num_words: 0,
//     __v: 0
//   }
// ]
// Disconnected from MongoDB

start up one-time docker container for mogodb:
docker run --rm -d -p 27017:27017 --name my-mongodb mongo

finally, run the demo: node demo.js

You will notice that the output of the 2 functions is the same for such a small example, which is normal. There must be some strange edge case that results in the difference of output when running the queries on larger amounts of data.

1

There are 1 best solutions below

1
no_rasora On BEST ANSWER

I ended up doing a bunch of manual testing, and in the end the findings are as follows:

  • The original function does not work as it should based on the requirement, because it sometimes produces duplicate results. (I did not investigate exactly in which cases, but it has to do with overlap between the different calls, because it would never occur when using a one element hash array).

  • The new function is much faster and works correctly. Keep in mind that there are multiple ways of implementing it, like constructing the whole query condition first and applying it in the end, or like I did it, creating a query object first and conditionally modifying it later (further reading).