I am using openNLP to annotate words within sentences throughout a text. As a final result, I would like word ID to match their order within in each sentences, with the order starting from 1 each time we enter a new sentence). Here is what I have so far:
#create string
string <- paste0("Last morning, I went to the lake and sat. My dog is the cutest.")
ex_string <- as.String(string)
#annotate words and sentences
init_s_w <- annotate(ex_string, list(Maxent_Sent_Token_Annotator(probs=TRUE),
Maxent_Word_Token_Annotator(probs=TRUE)))
init_s_w
| id | type | start | end |
|---|---|---|---|
| 1 | sentence | 1 | 41 |
| 2 | sentence | 43 | 63 |
| 3 | word | 1 | 4 |
| 4 | word | 6 | 12 |
| 5 | word | 13 | 13 |
| 6 | word | 15 | 15 |
| 7 | word | 17 | 20 |
| 8 | word | 22 | 23 |
| 9 | word | 25 | 27 |
| 10 | word | 29 | 32 |
| 11 | word | 34 | 36 |
| 12 | word | 38 | 40 |
| 13 | word | 41 | 41 |
| 14 | word | 43 | 44 |
| 15 | word | 46 | 48 |
| 16 | word | 50 | 51 |
| 17 | word | 53 | 55 |
| 18 | word | 57 | 62 |
| 19 | word | 63 | 63 |
Here is what I want:
| id | type | start | end |
|---|---|---|---|
| 1 | sentence | 1 | 41 |
| 2 | sentence | 43 | 63 |
| 1 | word | 1 | 4 |
| 2 | word | 6 | 12 |
| 3 | word | 13 | 13 |
| 4 | word | 15 | 15 |
| 5 | word | 17 | 20 |
| 6 | word | 22 | 23 |
| 7 | word | 25 | 27 |
| 8 | word | 29 | 32 |
| 9 | word | 34 | 36 |
| 10 | word | 38 | 40 |
| 11 | word | 41 | 41 |
| 1 | word | 43 | 44 |
| 2 | word | 46 | 48 |
| 3 | word | 50 | 51 |
| 4 | word | 53 | 55 |
| 5 | word | 57 | 62 |
| 6 | word | 63 | 63 |
By manipulating your input table:
By starting from the beginning and building a more comprehensive dataset: