Is there a method to control the number of spouts in apache storm?

46 Views Asked by At

On declaration of a topology in Apache Storm, is there a way to control how many instances per machine are used?

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("myspout", new MyDoSomethingOnTheHost(), 5);  

In the examples and the API documentation, the limit seems to be only a per topology count, but for my desired case, I would like to assure that the spouts do a task on each host and each host exactly.

As far as I understand it, even in the case of 5 machines in the above example, there seems to be no way to control how many spouts are launched per machine and in the worst case, all 5 spouts would be executed on one host.

2

There are 2 best solutions below

0
Vitos On BEST ANSWER

The straight answer to the topic - is "No". But here are some workarounds:

  1. The number of spouts is set when you start topology. There is no way to change it programmatically after the start. You can stop topology by "kill" and restart it again with a new config. This can be done with the cmd line storm kill topology-name [-w wait-time-secs] and controlled outside your topologies.

  2. You can start more spouts when needed at the start and use the deactivate() and activate() methods of ISpout interface. This can be controlled programmatically by topology itself. The Storm does not use deactivated spouts and doesn't send nextTuple() requests to them, spouts remain in memory and can be activated instantly. This works great for the cold start of a large topology with a hundred workers.

To my best experience, spouts aren't bottlenecks, it's always bolts, especially in large and indirect topologies. Someone suddenly died without handling the exception, someone is working slowly because the source is not responding, and you see a performance graph in the form of a hairbrush with an interval between teeth equal to topology.message.timeout.secs.

0
moosehead42 On

To my best knowledge, Apache Storm uses a round-robin placement scheme natively. So if you have only these 5 spouts and 5 hosts it should work as desired. However, as you will probably have more bolts, things become more complicated as it is unclear how Storm internally builds its list for distributing bolts and spouts. A naive first thing you can do is try to see how it distributes the operators.

Moreover, if you really want to ensure your desired mapping, Storm offers (in contrast to Flink) to implement its own scheduler by implementing the IScheduler interface. Here you can build your custom logics on how to distribute operators.