I have a corpus of newspaper articles in a .txt file, and I'm trying to split the sentences from it to a .csv in order to annotate each sentence.
I was told to use NLTK for this purpose, and I found the following code for sentence splitting:
import nltk
from nltk.tokenize import sent_tokenize
sent_tokenize("Here is my first sentence. And that's a second one.")
However, I'm wondering:
- How does one use a
.txtfile as an input for the tokenizer (so that I don't have to just copy and paste everything), and - How does one output a
.csvfile instead of just printing the sentences in my terminal.
Reading a
.txtfile & tokenizing its sentencesAssuming the
.txtfile is located in the same folder as your Python script, you can read a.txtfile and tokenize the sentences usingNLTKas shown below:Writing a list of sentence tokens to
.csvfileThere are a number of options for writing a
.csvfile. Pick whichever is more convenient (e.g. if you already havepandasloaded, use thepandasoption).To write a
.csvfile using thepandasmodule:To write a
.csvfile using thenumpymodule:To write a
.csvfile using thecsvmodule: