I have a node fully dedicated to my Storm-Crawler based crawler. I have at disposal 20 dual-core CPUs, 130 Gb of RAM and 10Gb/s Ethernet connection.
I reduced my topology to: CollapsingSpout -> URLPartitionerBolt -> FetcherBolt. The spout is reading from an Elasticsearch index (with ~50 M records). Elasticsearch is configured with 30 GB RAM and 2 shards.
I use a single worker with roughly 50 GB of RAM dedicated to the JVM. Playing with the different settings (total number of threads, number of threads per queue, max pending spout, some related to Elasticsearch such as number of buckets and bucket size mainly) I can reach an overall fetching speed of 100 MB/s. However, looking at ganglia reports, it corresponds to only 10% of the bandwidth available to me. Note that CPU usage is at about 20% and RAM is not an issue.
I’m seeking some hints on what could be my bottleneck and advice on how to tune/adjust my crawler to fully use the resources available to me.
Thanks in advance.