I have created dataset using Google NLP Entity extraction and I uploaded input data's(train, test, validation jsonl files) like NLP format that will be stored in google storage bucket.
Sample Annotation:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 10,
"start_offset": 0
}
},
"display_name": "Name"
}],
"text_snippet": {
"content": "JJ's Pizza\n "
}
} {
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 9,
"start_offset": 0
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "San Francisco\n "
}
}
Here is the input text to predict the label as "Name", "City" and "State"
Best J J's Pizza in San Francisco, CA
Result in the following screenshot,
I expect the predicted results would be in the following,
Name : JJ's Pizza City : San Francisco State: CA

According to the sample annotation you provided, you're setting the whole
text_snippetto be aname(or whatever field you want to extract).This can confuse the model in understanding that all the text is that entity.
It would be better to have training data similar to the one in the documentation. In there, there is a big chunk of text and then we annotate the entities that we want extracted from there.
As an example, let's say that from these text snippets I tell the model that the cursive part is an entity named
a, while the bold part is an entity calledb:Then, when then the model reads Best JJ Pizza, it thinks all is a single entity (we trained the model with this assumption), and it will just choose the one it matches the best (in this case, it would likely say it's an
aentity).However, if I provide the following text sample (also annotated like cursive is entity
aand bold is entityb):You can see how you're training the model to find the entities within a piece of text, and it will try to extract them according to the context.
The important part about training the model is providing training data as similar to real-life data as possible.
In the example you provided, if the data in your real-life scenario is going to be in the format
<ADJECTIVE> <NAME> <CITY>, then your training data should have that same format:Note that the point of a Natural Language ML model is to process natural language. If your inputs are going to look as similar/simple/short as that, then it might not be worth going the ML route. A simple regex should be enough. Without the natural language part, it is going to be hard to properly train a model. More details in the beginners guide.