How to extract more than label text items in a single annotation using Google NLP

Question

How to extract more than label text items in a single annotation using Google NLP

268 Views Asked by veilupearl At 04 May 2020 at 17:02

I have created dataset using Google NLP Entity extraction and I uploaded input data's(train, test, validation jsonl files) like NLP format that will be stored in google storage bucket.

Sample Annotation:

   {
    "annotations": [{
        "text_extraction": {
            "text_segment": {
                "end_offset": 10,
                "start_offset": 0
            }
        },
        "display_name": "Name"
    }],
    "text_snippet": {
        "content": "JJ's Pizza\n "
    }
} {
    "annotations": [{
        "text_extraction": {
            "text_segment": {
                "end_offset": 9,
                "start_offset": 0
            }
        },
        "display_name": "City"
    }],
    "text_snippet": {
        "content": "San Francisco\n "
    }
}

Here is the input text to predict the label as "Name", "City" and "State"

Best J J's Pizza in San Francisco, CA

Result in the following screenshot,

I expect the predicted results would be in the following,

Name : JJ's Pizza City : San Francisco State: CA

Original Q&A

There are 1 best solutions below

**Jofre** · Answer 1 · 2020-08-12T15:05:55.073000

According to the sample annotation you provided, you're setting the whole text_snippet to be a name (or whatever field you want to extract).

This can confuse the model in understanding that all the text is that entity.

It would be better to have training data similar to the one in the documentation. In there, there is a big chunk of text and then we annotate the entities that we want extracted from there.

As an example, let's say that from these text snippets I tell the model that the cursive part is an entity named a, while the bold part is an entity called b:

JJ Pizza
LL Burritos
Kebab MM
Shushi NN
San Francisco
NY
Washington
Los Angeles

Then, when then the model reads Best JJ Pizza, it thinks all is a single entity (we trained the model with this assumption), and it will just choose the one it matches the best (in this case, it would likely say it's an a entity).

However, if I provide the following text sample (also annotated like cursive is entity a and bold is entity b):

The best pizza place in San Francisco is JJ Pizza.
For a luxurious experience, do not forget to visit LL Burritos when you're around NY.
I once visited Kebab MM, but there are better options in Washington.
You can find Shushi NN in Los Angles

You can see how you're training the model to find the entities within a piece of text, and it will try to extract them according to the context.

The important part about training the model is providing training data as similar to real-life data as possible.

In the example you provided, if the data in your real-life scenario is going to be in the format <ADJECTIVE> <NAME> <CITY>, then your training data should have that same format:

{
    "annotations": [{
        "text_extraction": {
            "text_segment": {
                "end_offset": 16,
                "start_offset": 6
            }
        },
        "display_name": "Name"
    },
    {
        "text_extraction": {
            "text_segment": {
                "end_offset": 30,
                "start_offset": 21
            }
        },
        "display_name": "City"
    }],
    "text_snippet": {
        "content": "Worst JJ's Pizza in San Francisco\n "
    }
}

Note that the point of a Natural Language ML model is to process natural language. If your inputs are going to look as similar/simple/short as that, then it might not be worth going the ML route. A simple regex should be enough. Without the natural language part, it is going to be hard to properly train a model. More details in the beginners guide.

How to extract more than label text items in a single annotation using Google NLP

There are 1 best solutions below

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in NLP

Related Questions in AUTOML

Related Questions in GOOGLE-CLOUD-NL

Related Questions in GOOGLE-NATURAL-LANGUAGE

Trending Questions

Popular # Hahtags

Popular Questions