StormCrawler - Metadata fields not being persisted

22 Views Asked by At

I have a topology with a spout that emits a tuple to the status stream and is picked up by the StatusUpdaterBolt, which in turn write data to an elasticsearch index.

The spout emits a tuple with a Metadata object that contains certain metadata (eg: crawler).

This is not being written to the status index.

The config looks something like this:


bolts:
  - id: "myspout"
    className: com.mycompany.MySpout
    parallelism: 8
  - id: "status"
    className: com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt
    parallelism: 4

streams:
  - from: "myspout"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

The Metadata object is built like this:

Metadata metadata = new Metadata();
...
metadata.setValue("crawler", "mycrawl");

and then is emitted:

collector.emit(new Values(url, metadata));

Why would the custom properties not get written to the status index?

Versions:

storm: 2.4.0 stormcrawler: 2.8

1

There are 1 best solutions below

0
ndtreviv On BEST ANSWER

As per the documentation here: https://github.com/DigitalPebble/storm-crawler/wiki/MetadataTransfer

It's important to specify what fields you want transferred/persisted into the status index. If you don't, it won't get persisted.

In your example:

metadata.persist:
  - crawler

Note: If you were using parsefilters to extract Outlinks, you'd also need to include:

metadata.transfer:
  - crawler

if you wanted it on new documents generated by outlink identification.