I want to create a corpus for a machine learning task. I have a small textual dataset and want to crawl similar sentences from web. I used sentence_transformers package with Bert pertained model, doc2vec and spacy similarity to measure similarity. I set the threshold to 85%, but the sentences with the similarity score higher than the threshold weren't really relevant. how can I crawl similar sentences from web in python?
how to crawl semantically similar sentences
155 Views Asked by Laure At
1
There are 1 best solutions below
Related Questions in PYTHON
- How to store a date/time in sqlite (or something similar to a date)
- Instagrapi recently showing HTTPError and UnknownError
- How to Retrieve Data from an MySQL Database and Display it in a GUI?
- How to create a regular expression to partition a string that terminates in either ": 45" or ",", without the ": "
- Python Geopandas unable to convert latitude longitude to points
- Influence of Unused FFN on Model Accuracy in PyTorch
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Writes to child subprocess.Popen.stdin don't work from within process group?
- Conda has two different python binarys (python and python3) with the same version for a single environment. Why?
- Problem with add new attribute in table with BOTO3 on python
- Can't install packages in python conda environment
- Setting diagonal of a matrix to zero
- List of numbers converted to list of strings to iterate over it. But receiving TypeError messages
- Basic Python Question: Shortening If Statements
- Python and regex, can't understand why some words are left out of the match
Related Questions in NLP
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Clarification on T5 Model Pre-training Objective and Denoising Process
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
- Output of Cosine Similarity is not as expected
- Getting an error while using the open ai api to summarize news atricles
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Should I use beam search on validation phase?
- Dialogflow failing to dectect the correct intent
- How to detect if two sentences are simmilar, not in meaning, but in syllables/words?
- Is BertForSequenceClassification using the CLS vector?
- Issue with memory when using spacy_universal_sentence_encoder for similarity detection
- Why does the Cloud Natural Language Model API return so many NULLs?
- Is there any OCR or technique that can recognize/identify radio buttons printed out in the form of pdf document?
- Model, lexicon to do fine grained emotions analysis on text in r
Related Questions in DATA-SCIENCE
- KEDRO - How to specify an arbitrary binary file in catalog.yml?
- Struggling to set up a sparse matrix problem to complete data analysis
- How do I remove slashes and copy the values into many other rows in pandas?
- Downloading full records from Entrez
- Error While calling "from haystack.document_stores import ElasticsearchDocumentStore"
- How to plot time series from 2 columns (Date and Value) by Python google colab?
- How to separate Hijri (Arabic) and Gregorian date ranges from on column to separate columns
- How to wait the fully download of a file with selenium(firefox) in python
- Survey that collects anonymous results, but tracks which recipient have responded
- Dataframe isin function Buffer was wrong number of dimensions error
- How to add different colours in an Altair grouped bar chart in python?
- Python Sorting list of dictionaries with nested list
- Float Division by Zero Error with Function Telling Greatest Power of a Number Dividing Another Number
- If a row contains at least two not NaN values, split the row into two separate ones
- DATA_SOURCE_NOT_FOUND Failed to find data source: mlflow-experiment. Please find packages at `https://spark.apache.org/third-party-projects.html
Related Questions in SENTENCE-SIMILARITY
- How to detect if two sentences are simmilar, not in meaning, but in syllables/words?
- Project idea about clustering and sentences similarity
- Batched BM25 search in PySpark
- Searching existing ChromaDB database using cosine similarity
- Sentence Similarity between a phrase with 2-3 words and documents with multiple sentences
- indexing does not speed up retrival of numpy array from sqlite3
- Hugging Face Sentence Transformers API is throwing "Internal Server Error" frequently
- How do I use a vector search to find a matching combination of vectors?
- Filtering Documents Using Word Embeddings: Keep Job Postings, Exclude Resumes
- How to deal with Interference in Large Model-Driven Vector Databases for Textual Similarity?
- String Similarity for all possible combination in Optimised fashion
- Facing accuracy issue with sentence transformers
- What is the best distance measure to use when doing semantic search on the embeddings generated by sentence transformers?
- HDBSCAN clusters sentence embeddings in one cluster that are way too far apart
- String Match using Fuzzy Lookup in Excel
Related Questions in SEMANTIC-ANALYSIS
- Is there any technology that could automatically generate the semantic level features needed for programming language?
- unsupported class file major version 63
- BUG! exception in phase 'semantic analysis' in source unit '_BuildScript_' Unsupported class file major version 63 IntelliJ Idea Error
- i want to intialize number of variables according to number of cases
- Issue regarding the use of Ontology in python code
- Finding contradictory semantic sentences through natural language processing
- BUG! exception in phase 'semantic analysis' in source unit '_BuildScript_' Unsupported class file major version 61 in intellija idea
- Using BERT model for parsing and doing bigram or multi-gram
- Is it common to have two semantic analysis phases within compiler construction?
- What kind of data structure should I provide to handle scopes in my compiler?
- Bison if statements - setting symbol table prior to parsing block statements
- How to avoid ambiguity within symbol table lookups?
- Simple semantic analyser for multiple and redeclaration in yacc gives parsing error
- Can an attribute be synthesized and inherited at the same time?
- how to crawl semantically similar sentences
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I think you should train a big model on a big corpus and then use that model to generate random sentences. The
gensimlibrary has severalcorporalink that you can use to find similar sentences or to train a model that generates similar sentences , here is how to do it.