Preface

Apache Storm has been a core part of our real-time data processing infrastructure, powering critical streaming applications that require low-latency and high-throughput processing. As our workloads grew in complexity and volume, maintaining the stability of our Storm cluster became increasingly challenging. We started encountering frequent Nimbus failures, topology performance issues, and inconsistencies in worker performance, which directly impacted our ability to meet operational demands. In this blog, I’ll walk through the issues we faced, the investigations we conducted, and the staged optimizations that ultimately stabilized our cluster.

Problem Statement

Over the past few months, we observed a significant decline in the stability of our Apache Storm cluster. The following issues became recurring challenges:

  • Sporadic Nimbus failures: Critical for cluster coordination, these failures disrupted our ability to manage topologies effectively.
  • Unresponsive supervisors: Supervisor nodes occasionally failed to respond, creating bottlenecks and affecting the entire cluster.
  • Frequent blacklisting of supervisors: This limited our capacity to distribute load evenly across the cluster.
  • Worker assignment issues post-topology restart: After restarting, certain workers could not be reassigned, affecting topology availability.
  • Uneven load distribution across supervisors: This imbalance led to resource constraints and inconsistent processing speeds.
  • Inconsistent topology performance: Variability in topology performance added unpredictability to data processing timelines.

Initial Symptoms and Problems

Timeline

We manage three Apache Storm clusters across different regions: one in the U.S., where the load is highest, and two in India and the EU, which generally handle lower traffic volumes. The issues described earlier predominantly occur in the U.S. cluster, while the India and EU clusters experience these problems much less frequently. This observation led us to suspect that traffic volume plays a significant role in triggering these issues, as the configuration across all clusters is identical.

Typically, after a cluster restart, services run smoothly for about a week before the problems begin to reappear, indicating a potential link between load accumulation over time and performance degradation.

Symptoms

The primary observations we have witnessed are:

  • Topologies starts getting stuck - when this happens, the spout stops dequeueing the records from the REDIS. Here the source is REDIS from which the vents are picked for processing.
  • Worker nodes are getting blacklisted. When this happens, the load shifts from one worker node to another worker node there by increasing the load on a specific node where CPU goes up to 99%.
  • The primary nimbus fails and the failover to the secondary nimbus didn’t happen.

Logs & Errors

Here are some sample logs for added clarity.

When worker nodes are blacklisted, they appear in the Storm UI:

See attached image (1).png

Simultaneously, the Nimbus logs capture the blacklisting event for a worker node, as shown below:

o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [1,2,6] are blacklisted 

null

Impact

When a topology gets stuck, requests are not processed from Redis, resulting in delays in handling Journey events. Delays in assigning workers have a similar effect, leaving requests undequeued in Redis and slowing down event processing.

When a worker is blacklisted, all topologies running on that node are shifted to other nodes. This shift creates an uneven load distribution across the cluster, increasing the risk of cluster instability and potential failure.

If the primary Nimbus fails and there is no failover, the entire Storm cluster stops, leading to a complete outage of the Journey processing engine.

Phase 1: Early Investigation

Hypotheses

At the start of our investigation, we identified several possible causes for the cluster instability:

  • We suspected that intermittent failures could stem from network connectivity issues between Nimbus and the Supervisor nodes.
  • The symptoms of unresponsive workers and slow task execution pointed to potential resource contention on Supervisor nodes, such as memory or CPU exhaustion.
  • Misconfigurations in Zookeeper, like session timeouts or delays in leader election, also seemed likely contributors to the overall instability.

These early hypotheses guided our in-depth examination of the system’s logs and metrics.

Tools Used

To effectively diagnose the root causes of our Storm cluster’s instability, we relied on a suite of monitoring and log analysis tools.

The Storm UI gave us insights into task performance and the overall health of each topology. Zookeeper logs were invaluable for identifying issues related to leader election and session timeouts. We tracked real-time metrics—such as event dequeuing speed, slow processing times, and specific time windows when no processing occurred—through Grafana, our APM dashboard. AWS Console EC2 metrics allowed us to monitor CPU usage and worker uptime across nodes.

In addition, custom alerts on Slack and Email notified us of Redis queue build-ups, topology stalls, and Nimbus responsiveness issues. Together, these tools helped us locate the sources of failure and guided our troubleshooting process.

Actions Taken

Our initial actions targeted quick fixes to mitigate the most pressing symptoms. We began by restarting Nimbus and Supervisor nodes, which temporarily restored functionality but didn’t resolve the root issues. To gain further insights, we reviewed logs from our custom monitoring scripts, which provided a more detailed view of errors and alarm messages.

We also issued a Standard Operating Procedure (SOP) for our SRE team to perform weekly maintenance restarts of the Apache Storm cluster in the U.S. IDC. While these measures helped maintain some stability, the recurring failures highlighted the need for a deeper, more systematic investigation.

Results

Our initial actions yielded only partial and temporary relief. While restarting Nimbus and Supervisors restored the cluster’s functionality briefly, the same failures reappeared within days. This outcome made it clear that, although we had addressed surface-level symptoms, a deeper analysis of system resources and configurations would be essential to uncover a lasting solution.

Phase 2: Root Cause Analysis

Deep Dive into the Logs

As we analyzed logs from Nimbus, Supervisors, and Zookeeper, several patterns began to stand out.

Zookeeper logs showed frequent session expirations, though no signs of leader election failures were detected. This pattern hinted at potential communication issues between Storm cluster nodes and Zookeeper, though conclusive evidence was lacking.

Nimbus logs, however, frequently displayed NimbusLeaderNotFoundException errors, indicating possible delays in leader election as a root cause. Additionally, we observed significant lag in supervisor-heartbeat messages, suggesting that resource exhaustion on Supervisor nodes could be contributing to these delays.

This deeper log analysis provided a clearer picture of the interconnected issues, helping us narrow our focus to configuration and resource-related challenges.

Configuration Review

Blacklisting of worker nodes

In our configuration review, we closely examined critical settings within the storm.yaml and Zookeeper configuration files to identify potential misconfigurations that could be causing cluster instability. Our primary focus was on parameters like Nimbus and Supervisor timeouts, heartbeat intervals, and retry mechanisms.

To address the persistent issue of worker nodes being “blacklisted,” we specifically adjusted the blacklist.scheduler settings. By fine-tuning how the system handled faulty nodes, we aimed to prevent unnecessary blacklisting and improve stability.

We discovered that the default blacklist.scheduler settings were uniformly applied across all IDCs, suggesting that further tuning might be necessary based on load and regional traffic differences.

Standard Production Environment (Balanced Approach):

   * blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: true

   * blacklist.scheduler.resume.time.secs: 1800

   * blacklist.scheduler.tolerance.count: 3

   * blacklist.scheduler.tolerance.time.secs: 300

Topologies getting stuck

Reason-1

We got the Topology Stuck alert email. We could see the error below in supervisor node.

2024-08-20 19:32:01.643 c.e.a.a.s.s.i.SimpleLogger apm-reporter [INFO] 2024-08-20 19:32:01.643 [apm-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - {
 "accepted": 0,
 "errors": [
   {
     "message": "queue is full"
   }
 ]
}

The error message "queue is full" in our logs pertains to the Elastic APM (Application Performance Monitoring) agent running within our Apache Storm environment. This error signifies that the internal queue of the APM agent, responsible for reporting metrics and traces to the Elastic APM server, has reached its maximum capacity and can no longer accept additional data.

Here's a breakdown of what likely occurred:

  • Queue Saturation: The APM agent collects performance data from the Storm application and queues it for transmission to the Elastic APM server. If the APM server is unavailable or slow to respond—due to failure or overload—the queue within the agent can become full.
  • Dropped Data: Once the queue is full, the agent begins to drop incoming performance data. This is reflected in the logs, where "accepted": 0 indicates that no new data was accepted into the queue.
  • Impact on Topologies: A full queue could have added load or caused blockage in our application, leading to scenarios where Apache Storm topologies became stuck and stopped processing traffic.

Restarting the Elastic APM service likely resolved the issue by enabling the APM agent to resume data transmission, clearing the internal queue and allowing the topologies to operate normally again.

To prevent this issue from recurring in the future, we plan to:

  • Monitor the Elastic APM server to ensure it is consistently operational and capable of handling the load.
  • Investigate the root cause of the APM service failure and implement measures to prevent future occurrences.
  • Enhance our code to determine why non-critical functions are causing topologies to become stuck.

Reason 2

We have a custom monitoring script that runs every 9th, 29th, and 49th minute to check for the first event in the Redis queue. It then waits 30 seconds before checking again to ensure the event hasn't changed. If the events are the same, the script concludes that the topology is stuck and attempts to restart it.

While this script has been effective in the past, we’ve realized that the logic behind identifying a stuck topology may be flawed. For example, in the monitoring log below, the script waited 30 seconds and decided to restart the topology cg92_B:

time="2024-08-31T10:09:03+05:30" level=info msg="lrange data for cg92_B is [\"cg9.p1c131909.lambdas/auto185__1673413553949\",131909,0,1356312,28913,\"{\\\"reqtime\\\":1725078660000,..."
time="2024-08-31T10:09:03+05:30" level=info msg="Wait for 30 seconds topologyId: cg92_B-30641-1725062453"
time="2024-08-31T10:09:33+05:30" level=info msg="Topology cg92_B seems to be stuck"

When the topology was restarted, we saw the following log in the Nimbus server:

2024-08-31 10:10:36.527 o.a.s.d.n.Nimbus timer [INFO] Killing topology: cg92_B-30641-1725062453

It seems that our monitoring script is a bit aggressive in determining when a topology is stuck. The lack of dequeueing could be due to BOLTs not acknowledging the Spout, which may have caused the Spout buffer to fill up, resulting in no new events being dequeued from Redis. Despite this, the topology was functioning properly, with all BOLTs busy as indicated in the supervisor logs.

We identified instances where there were brief pauses between dequeued messages. During these times, the monitoring script wasn't running, so the topology continued processing normally once the Spout received ACKs from the BOLTs. For instance, there was a 35,195 ms gap without any dequeued records, but dequeueing resumed afterward:

Previous Line: 2024-08-31 10:04:45.217 ... dequeued: [...]
Current Line: 2024-08-31 10:05:20.412 ... dequeued: [...]

Given these findings, we decided to evaluate the impact of increasing the time gap between consecutive checks from 30 seconds to 5 minutes. This adjustment could reduce the frequency of false positives and prevent unnecessary topology restarts, allowing for more natural processing rhythms within the application.

Zombie Worker Processes

Even though the topology monitoring script is designed to automatically kill and redeploy topologies, we often encounter delays in worker assignment. In many cases, the SRE team has to manually intervene to restart the Storm cluster for full service restoration. Several factors could contribute to these issues:

  1. Insufficient Worker Slots: If all available worker slots on the supervisors are occupied by other topologies, new assignments cannot be made.
  2. Resource Allocation Issues: The topology may request more resources than what is available, such as an excessive number of workers or memory/CPU requirements.
  3. Nimbus Leader Issues: If the Nimbus server is not functioning properly or loses connection with Zookeeper, it will be unable to assign workers to the topology.
  4. Blacklist Scheduler Settings: Recent changes to the storm.yml configuration, particularly settings like blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot, may result in worker assignment failures due to certain supervisors being blacklisted.
  5. Topology Deployment Error: Problems during the deployment or startup of the topology can hinder worker assignments.
  6. Timeouts in Supervisor or Nimbus: Network issues or timeouts between Nimbus and the supervisors may also disrupt worker assignments.

To investigate unhealthy Storm supervisor slots, we closely monitored the supervisor.log and worker logs for errors like memory issues, port binding failures, or worker disconnections. It’s crucial to check system resources and running worker processes. For instance, we encountered the following error message (detailed logs are available in the Appendix section):

2024-08-30 23:54:02.236 o.a.s.t.ProcessFunction pool-11-thread-3 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat

org.apache.storm.utils.WrappedNotAliveException: cgf3_A-1796-1725004893 does not appear to be alive, you should probably exit

This error prompted us to investigate the running worker processes on a supervisor node. We discovered that there were 11 java -cp storm processes consuming 11 distinct -Dworker.port settings. However, the Storm UI indicated that only 8 slots were in use, suggesting that the additional 3 processes were unhealthy or zombie processes.

Based on our observations, these extra java -cp storm processes likely correspond to old or unhealthy workers associated with topologies that either no longer exist or have been redeployed under different names. This situation arises because the system does not automatically clean up these stale processes when topologies are killed or redeployed, leading to a resource bottleneck and potential delays in worker assignments.

Explanation

Worker Processes: Each java -cp storm process represents a JVM worker allocated by Storm to a specific slot on a supervisor. The -Dworker.port argument indicates the port number associated with that worker, corresponding to a particular slot.

Mismatch in Slot Count: The Storm UI shows that 8 active slots are being used by your topologies, but when checking the command output, we see 11 worker processes running on the supervisor. This discrepancy implies that the extra 3 worker processes are leftovers from previous topologies that were either redeployed or removed.

Old Worker Processes: These additional processes are likely orphaned or zombie workers. Typically, when a topology is redeployed or removed, Storm should automatically terminate its associated workers. However, in some scenarios, this termination does not occur as expected, causing the worker processes to remain active on the supervisor.

Orphaned Processes: These lingering processes continue to consume resources despite their corresponding topologies no longer existing or being renamed in redeployments. As a result, they become "unhealthy" or stale, leading to potential resource exhaustion and delays in worker assignments.

To address this issue, we have decided to develop a sophisticated monitoring script designed to quickly identify orphaned worker processes and clean them up in a timely manner. This proactive approach will help ensure that resources are utilized efficiently and prevent delays in topology processing.

Resource Bottlenecks

In our investigation into resource bottlenecks, we examined CPU, memory, and disk I/O utilization across Supervisor and Nimbus nodes. The metrics indicated that certain Supervisor nodes frequently hit their memory and CPU limits, leading to processing delays and missed heartbeats with Nimbus. Additionally, high disk I/O on some nodes contributed to sluggish worker performance and topology lag.

These findings highlighted resource contention as a significant factor in the cluster's instability, leading us to consider optimizing resource allocation and potentially upgrading hardware to better handle the increased workload.

However, we observed in the AWS console that after each restart of the Storm cluster, the load was evenly distributed across all six Supervisor nodes. When any node was blacklisted, Nimbus automatically redistributed the load to other available nodes, resulting in increased traffic volume on those nodes while others became idle.

Given this situation, we decided to prioritize addressing the "worker blacklisting" issue before further monitoring and analyzing resource bottlenecks. This approach will help ensure a more balanced load across all nodes and improve overall cluster performance.

Phase 3: Incremental Fixes and Tuning

Configuration Tuning

Blacklisting of Worker Nodes
Given the varying load across different IDCs and the frequent blacklisting of worker nodes in the US IDC, where stability issues have been most pronounced, we decided to fine-tune the Storm configuration specifically for this region. We opted for a more lenient blacklisting approach with the following settings:

blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: false
blacklist.scheduler.resume.time.secs: 900
blacklist.scheduler.tolerance.count: 5
blacklist.scheduler.tolerance.time.secs: 600

Infrastructure Upgrades

As discussed in the Resource Bottlenecks section, we concluded that no immediate changes to infrastructure were necessary. Our analysis did not reveal any definitive evidence to warrant an upgrade in hardware resources at this time.

Monitoring Enhancements

Detect Unhealthy Worker Slots and Cleanup
Given the frequency of topology deployments and swaps between A and B, we recognized that the existing Apache Storm 2.2 version might have an internal issue that leads to unhealthy worker slots. To address this, we will implement a dedicated script that runs on a CRON schedule to detect and clean up unhealthy worker slots proactively. This enhancement will help maintain a healthy operational environment and reduce the occurrence of stuck topologies.

Final Solution and Stability Achieved

Changes Implemented

Blacklisting of Worker Nodes

To address frequent transient issues, we adopted a lenient blacklisting approach in the US IDC, where we experience the maximum load, while keeping default settings in other IDCs. The new configuration is as follows:

  • blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: false
  • blacklist.scheduler.resume.time.secs: 900
  • blacklist.scheduler.tolerance.count: 5
  • blacklist.scheduler.tolerance.time.secs: 600

Topologies Getting Stuck

To ensure Apache Storm topologies in Clojure continue processing traffic even if the Elastic APM server becomes unhealthy, we reviewed the Clojure code. We implemented a "fire-and-forget" approach by configuring the APM agent for non-blocking, asynchronous reporting. This allows the agent to send data to the APM server in the background without blocking the application's main execution flow.

Cleanup of Zombie Worker Processes

To tackle the issue of zombie worker processes, we developed a script that runs on every supervisor node. The script checks for specific error messages in the logs, and if detected, attempts to clean up the stale worker processes by killing their associated PIDs. Here’s the structure of the script:

#!/bin/bash

# Define log file
LOG_FILE="/var/log/apps/storm2x/kill_stale_storm_workers.log"

# Function to log messages
log() {
   echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}

# Check if script is already running
SCRIPT_NAME=$(basename "$0")
PID_FILE="/tmp/${SCRIPT_NAME}.pid"

if [ -f "$PID_FILE" ]; then
   OLD_PID=$(cat "$PID_FILE")
   if ps -p "$OLD_PID" > /dev/null 2>&1; then
       # If another instance is running, exit and log
       START_TIME=$(ps -p "$OLD_PID" -o etime=)
       log "Script is already running for $START_TIME. Exiting..."
       exit 1
   fi
fi

# Write the current script PID to the file
echo $$ > "$PID_FILE"
log "Starting storm topology health check."

# Define log file to monitor
SUPERVISOR_LOG="/var/log/apps/st

Performance Metrics

After implementing the final set of configuration changes and resource optimizations, we observed significant improvements in the performance metrics of our Storm cluster.

Elimination of Node Blacklisting

  • We have not encountered any instances of topology blacklisting since the implementation of the changes. Previously, we experienced 4-5 instances of topologies being blacklisted within an hour.
  • When blacklisting occurred, Nimbus transferred the affected topologies from the blacklisted node to other nodes, resulting in a substantial increase in load on the failover nodes—often reaching up to 99%. This led to an uneven distribution of load across the cluster. However, with the elimination of blacklisting, we have achieved a more balanced load distribution, enhancing the overall stability and performance of the Storm cluster.

null

Also the Nimbus logs doesn’t report blacklisting of any nodes. As you can see, it’s an empty array below.

[smartechro@stnb-01-us storm2x]$ grep blacklist nimbus.log
2024-10-13 20:33:23.067 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:33:33.235 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:33:43.402 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:33:53.588 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:34:03.746 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:34:13.882 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:34:24.073 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:34:34.222 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:34:44.418 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:34:55.200 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:35:05.503 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:35:15.696 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:35:25.866 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:35:36.013 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:35:46.287 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:35:56.429 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:36:06.574 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:36:16.746 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:36:26.888 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:36:37.030 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:36:47.181 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:36:57.375 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

2024-10-13 20:37:07.539 o.a.s.s.b.BlacklistScheduler timer [INFO] Supervisors [] are blacklisted.

Cleaning Up Zombie Worker Processes

The periodic cleanup of orphaned worker processes has effectively eliminated issues related to worker allocation during topology deployments. As a result, we have observed a significant reduction in alerts for "topology stuck" scenarios.

Business Impact

The stabilization of our Apache Storm cluster has had a profoundly positive impact on the business. With real-time processing restored to full reliability, we successfully met our operational SLAs without interruptions, ensuring timely delivery of critical data to downstream systems. The reduction in downtime and processing delays has improved overall system performance, facilitating faster decision-making and enhancing customer experiences.

Moreover, increased system stability has minimized the time and effort required for firefighting and manual interventions. This shift has allowed the engineering team to concentrate on new feature development and optimizations, ultimately driving further value for the business.

Lessons Learned

Key Takeaways

  • Systematic Approach: Effectively addressing performance and stability issues necessitates a systematic, phased approach. Starting with quick fixes is essential, but a thorough examination of logs, configurations, and resource bottlenecks is crucial to uncovering root causes.
  • Cross-Team Collaboration: Collaboration across teams—particularly with infrastructure and networking—proved invaluable in ruling out potential issues and fine-tuning our optimizations. Engaging multiple perspectives ensures a more comprehensive understanding of system dynamics.
  • Monitoring and Alerting: The role of monitoring and alerting tools is critical in diagnosing problems and proactively managing system health. These tools help maintain long-term stability and prevent future disruptions by providing timely insights.

Best Practices

From our experience, several best practices have emerged that can help ensure the stability and performance of an Apache Storm cluster:

  1. Regular Configuration Reviews: Periodically review and fine-tune configurations, especially those related to timeouts, retries, and resource allocation, to align with the evolving needs of the cluster. Remember, a single default configuration cannot suit all cluster setups.
  2. Comprehensive Monitoring: Implement thorough monitoring and alerting to track key metrics such as CPU usage, memory consumption, and task execution times in real-time. This enables early detection of potential issues. However, a spike in load does not always necessitate hardware resizing; often, the root cause may lie elsewhere.
  3. Periodic Log Reviews: Conduct regular log reviews to identify emerging patterns or errors that could signal larger issues. This proactive approach can prevent minor problems from escalating.
  4. Foster Cross-Team Collaboration: Encourage collaboration among various teams—application, infrastructure, networking, and business stakeholders—to ensure all aspects are aligned in maintaining optimal system performance.

Appendix

Configuration Snippets

Suggested Values Based on Use Case:

  • Standard Production Environment (Balanced Approach):
blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: true
blacklist.scheduler.resume.time.secs: 1800
blacklist.scheduler.tolerance.count: 3
blacklist.scheduler.tolerance.time.secs: 300

  • Aggressive Blacklisting (Critical Stability Required):
blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: true
blacklist.scheduler.resume.time.secs: 3600
blacklist.scheduler.tolerance.count: 2
blacklist.scheduler.tolerance.time.secs: 300

Lenient Blacklisting (Frequent Transient Issues):

blacklist.scheduler.assume.supervisor.bad.based.on.bad.slot: false
blacklist.scheduler.resume.time.secs: 900
blacklist.scheduler.tolerance.count: 5
blacklist.scheduler.tolerance.time.secs: 600

Error Logs

NimbusLeaderNotFoundException

time="2024-09-16T02:29:58+05:30" level=info msg="--DEPLOY TOPOLOGY--- [cg11_A]"

time="2024-09-16T02:29:58+05:30" level=info msg="sleeping for 60 sec before redeploy, isAdhocCall: false"

time="2024-09-16T02:30:58+05:30" level=info msg="api call url is https://example-host"

time="2024-09-16T02:30:58+05:30" level=info msg="supervisor summary response ( BEFORE ): map[error:500 Server Error errorMessage:org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [storm-nimbus.us-east-1.us.smt.internal]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?\n\tat org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:250)\n\tat org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:179)\n\tat org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:138)\n\tat org.apache.storm.daemon.ui.resources.StormApiResource.getSupervisorSummary(StormApiResource.java:226)\n\tat sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)\n\tat org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)\n\tat org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)\n\tat org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)\n\tat org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)\n\tat org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)\n\tat org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391)\n\tat org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)\n\tat org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253)\n\tat org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)\n\tat org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)\n\tat org.glassfish.jersey.internal.Errors.process(Errors.java:292)\n\tat org.glassfish.jersey.internal.Errors.process(Errors.java:274)\n\tat org.glassfish.jersey.internal.Errors.process(Errors.java:244)\n\tat org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)\n\tat org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)\n\tat org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680)\n\tat org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:392)\n\tat org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)\n\tat org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:365)\n\tat org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:318)\n\tat org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:867)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)\n\tat org.apache.storm.daemon.ui.filters.HeaderResponseServletFilter.doFilter(HeaderResponseServletFilter.java:62)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)\n\tat org.apache.storm.daemon.drpc.webapp.ReqContextFilter.handle(ReqContextFilter.java:83)\n\tat org.apache.storm.daemon.drpc.webapp.ReqContextFilter.doFilter(ReqContextFilter.java:70)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)\n\tat org.apache.storm.logging.filters.AccessLoggingFilter.handle(AccessLoggingFilter.java:46)\n\tat org.apache.storm.logging.filters.AccessLoggingFilter.doFilter(AccessLoggingFilter.java:38)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1557)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:502)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)\n\tat java.lang.Thread.run(Thread.java:748)\n], error: %!s(<nil>), isAdhocCall: false"

time="2024-09-16T02:30:58+05:30" level=info msg="api call url is https://example-host"

Zombie worker process

2024-08-30 23:54:02.061 o.a.s.h.HealthChecker timer [INFO] The supervisor healthchecks succeeded.
2024-08-30 23:54:02.236 o.a.s.d.s.Supervisor pool-11-thread-3 [WARN] Topology config is not localized yet...
2024-08-30 23:54:02.236 o.a.s.t.ProcessFunction pool-11-thread-3 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat
org.apache.storm.utils.WrappedNotAliveException: cgf3_A-1796-1725004893 does not appear to be alive, you should probably exit
       at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0]
       at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.j
ar:2.2.0]
       at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.j
ar:2.2.0]
       at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.
0.jar:2.2.0]
       at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-

2.2.0.jar:2.2.0]

       at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0]

2024-08-30 23:54:12.256 o.a.s.d.s.Slot SLOT_7738 [DEBUG] STATE empty

2024-08-30 23:54:12.257 o.a.s.d.s.Supervisor pool-11-thread-2 [WARN] Topology config is not localized yet...

2024-08-30 23:54:12.257 o.a.s.t.ProcessFunction pool-11-thread-2 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat
org.apache.storm.utils.WrappedNotAliveException: cgf3_A-1796-1725004893 does not appear to be alive, you should probably exit
       at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0]
       at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.j
ar:2.2.0]
       at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.j
ar:2.2.0]
       at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.
0.jar:2.2.0]
       at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-
2.2.0.jar:2.2.0]
       at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_272]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_272]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272]
2024-08-30 23:54:22.281 o.a.s.d.s.Supervisor pool-11-thread-1 [WARN] Topology config is not localized yet...
2024-08-30 23:54:22.282 o.a.s.t.ProcessFunction pool-11-thread-1 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat
org.apache.storm.utils.WrappedNotAliveException: cgf3_A-1796-1725004893 does not appear to be alive, you should probably exit
       at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0]
       at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.j
ar:2.2.0]
       at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.j
ar:2.2.0]
       at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.
0.jar:2.2.0]
       at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-
2.2.0.jar:2.2.0]
       at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0]
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_272]
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_272]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272

Conclusion

Our journey to enhance the stability and performance of our Apache Storm cluster has yielded substantial benefits for both our operations and our business. By systematically addressing issues such as worker blacklisting, orphaned processes, and resource bottlenecks, we were able to restore reliable real-time processing. The implementation of proactive monitoring and configuration adjustments has not only reduced downtime but also empowered our engineering team to focus on innovation rather than firefighting. Moving forward, we remain committed to continuous improvement, ensuring that our systems can adapt to changing demands while delivering exceptional service to our stakeholders.