Application Server: Health System: Tracking And Reporting Anomalies

From Resin 4.0 Wiki

(Difference between revisions)
Jump to: navigation, search
 
 
(2 intermediate revisions by one user not shown)
Line 3: Line 3:
 
== Monitoring Application Server Health Through Statistical Analysis of JMX Attributes ==
 
== Monitoring Application Server Health Through Statistical Analysis of JMX Attributes ==
  
Resin's health system provides many useful tools to monitor, report, and alert on the health of your application server.  Monitoring of all the typical metrics such as high cpu, low memory, deadlocked threads, etc is pre-configured for you in health.xml.  We also include appropriately conservative remediation actions in health.xml, such as triggering thread dumps, heap dumps, and restarts when necessary.  It's up to you to tweak these settings to increase or decrease the aggressiveness of the health system as you see appropriate.
+
The Resin Application Server [http://www.caucho.com/resin-4.0/admin/health.xtp health system] provides many useful tools to monitor, report, and alert on the health of your application server.  Monitoring of all the typical metrics such as [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthCpuHealthCheck high cpu], [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthMemoryTenuredHealthCheck low memory], [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthJvmDeadlockHealthCheck deadlocked threads], etc, is pre-configured for you in health.xml.  We also include appropriately conservative remediation actions in health.xml, such as triggering [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthDumpThreads thread dumps], [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthDumpHeap heap dumps], and [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthRestart restarts] when necessary.  It's up to you to tweak these settings to increase or decrease the aggressiveness of the health system as you see appropriate.
  
Resin goes beyond typical metrics monitoring through the tracking and analysis of JMX attributes. Any numeric attribute available via JMX can be configured as "meter" in Resin, which
+
 
 +
'''''Resin goes beyond typical metrics monitoring by looking for anomalies in JMX attributes.'''''
 +
 
 +
 
 +
Any numeric attribute of any MBean in JMX can be configured as [http://www.caucho.com/resin-4.0/admin/health-meters.xtp Meter] in Resin, which then enables:
 +
 
 +
* Persistent historical tracking
 +
* Visual graphing in resin-admin
 +
* Visual graphing in PDF reports
 +
* Cluster wide reporting
 +
* Health monitoring
 +
* Anomaly analysis and logging
 +
* Triggering health actions (heap dump, thread dump, restart, etc)
 +
 
 +
 
 +
=== Creating a Meter ===
 +
 
 +
Resin comes pre-configured with a set of common meters in health.xml.  When adding new meters and/or anomaly analyzers, we recommend you create a new file in conf/resin-inf, which will be automatically imported by Resin.  This makes upgrades simpler in the future.  Alternatively you can add meters directly to conf/health.xml.
 +
 
 +
conf/resin-inf/my-meters.xml:
 +
 +
<resin xmlns="http://caucho.com/ns/resin"
 +
      xmlns:resin="urn:java:com.caucho.resin"
 +
      xmlns:health="urn:java:com.caucho.health"
 +
      xmlns:ee="urn:java:ee">
 +
 +
<health:JmxMeter>
 +
  <name>JVM|Thread|JVM Blocked Count</name>
 +
  <objectName>resin:type=JvmThreads</objectName>
 +
  <attribute>BlockedCount</attribute>
 +
</health:JmxMeter>
 +
 +
</resin>
 +
 
 +
 
 +
In this example we've created a [http://www.caucho.com/resin-4.0/admin/health-meters.xtp#healthJmxMeter JMXMeter] on the attribute '''BlockedCount''' on the MBean '''resin:type=JvmThreads'''.  This is an important attribute to track, since it reports the number of blocked threads, which can indicate a serious issue when the value increases significantly.
 +
 
 +
 
 +
We also provide [http://www.caucho.com/resin-4.0/admin/health-meters.xtp#healthJmxDeltaMeter JMXDeltaMeter], which reports the difference between the current and previous attribute values. 
 +
 
 +
<health:JmxDeltaMeter>
 +
  <name>JVM|Compilation|Compilation Time</name>
 +
  <objectName>java.lang:type=Compilation</objectName>
 +
  <attribute>TotalCompilationTime</attribute>
 +
</health:JmxDeltaMeter>
 +
 
 +
Above, a delta meter is created for compilation time, another important metric to monitor.
 +
 
 +
 
 +
''Please refer to to resin-doc on [http://www.caucho.com/resin-4.0/admin/health-meters.xtp Health Meters] for more information.''
 +
 
 +
 
 +
=== Analyzing a Meter ===
 +
 
 +
Meters alone are useful for manual inspection in resin-admin since every meter can be graphed.  However Resin provides an extremely useful automatic analysis tool called AnomalyAnalyzer.  AnomalyAnalyzer looks at the current meter value, checking for deviations from the average value.  So unusual changes like a spike in blocked threads can be detected.
 +
 
 +
<health:AnomalyAnalyzer>
 +
  <meter>JVM|Thread|JVM Blocked Count</meter>
 +
  <health-event>caucho.thread.anomaly.jvm-blocked</health-event>
 +
</health:AnomalyAnalyzer>
 +
 
 +
In this example we've created an AnomalyAnalyzer on the blocked thread meter we created above, and assigned it to the health event "caucho.thread.anomaly.jvm-blocked".  The health-event attribute is optional.  '''Without a health-event, an anomaly analyzer alone will only log anomalies it detects to the resin log at WARNING level'''.  These alerts also show up in PDF reports.  An example anomaly log is shown below:
 +
 
 +
2012-01-20 16:10:00 AnomalyAnalyzer JVM|Thread|JVM Runnable Count WARNING value=3.000, deviation=9.487 sigma mean=2.011 std=0.104 n=92.0
 +
 
 +
 
 +
=== Reacting to Anomalies ===
 +
 
 +
Resin's health system provides a set of [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#Healthactions remediation actions] that you can configure to automatically execute in reaction to an anomaly.  The <health-event> attribute we configured above allows us to tie health actions to a detected anomaly, as shown below:
 +
 
 +
<health:DumpThreads>
 +
  <health:IfHealthEvent regexp="caucho.thread"/>
 +
  <health:IfNotRecent time="15m"/>
 +
</health:DumpThreads>
 +
 
 +
In this example we've created a [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthDumpThreads DumpThreads] action with 2 conditions.  The first condition, IfHealthEvent, tells the action to execute only if the health event starts with "caucho.thread".  The send condition, [http://www.caucho.com/resin-4.0/admin/health-checking.xtp#healthIfNotRecent IfNotRecent], prevents the action from executing more than once every 15 minutes. 
 +
 
 +
[http://www.caucho.com/resin-4.0/admin/health-checking.xtp#Healthconditions Resin provides many other useful conditions that can be applied to any health action.]
 +
 
 +
 
 +
Here is the example in full, which belongs in conf/resin-inf/my-meters.xml:
 +
 
 +
<resin xmlns="http://caucho.com/ns/resin"
 +
            xmlns:resin="urn:java:com.caucho.resin"
 +
            xmlns:health="urn:java:com.caucho.health"
 +
            xmlns:ee="urn:java:ee">
 +
 +
  <health:JmxMeter>
 +
    <name>JVM|Thread|JVM Blocked Count</name>
 +
    <objectName>resin:type=JvmThreads</objectName>
 +
    <attribute>BlockedCount</attribute>
 +
  </health:JmxMeter>
 +
 +
  <health:AnomalyAnalyzer>
 +
    <meter>JVM|Thread|JVM Blocked Count</meter>
 +
    <health-event>caucho.thread.anomaly.jvm-blocked</health-event>
 +
  </health:AnomalyAnalyzer>
 +
 +
  <health:DumpThreads>
 +
    <health:IfHealthEvent regexp="caucho.thread"/>
 +
    <health:IfNotRecent time="15m"/>
 +
  </health:DumpThreads>
 +
 +
</resin>
 +
 
 +
 
 +
 
 +
[http://www.caucho.com/resin-4.0/admin/health.xtp Full documentation on Resin's Application Health System is available in the public resin-doc.]

Latest revision as of 00:00, 28 January 2012

Heart-48.pngCookbook-48.png

Contents

[edit] Monitoring Application Server Health Through Statistical Analysis of JMX Attributes

The Resin Application Server health system provides many useful tools to monitor, report, and alert on the health of your application server. Monitoring of all the typical metrics such as high cpu, low memory, deadlocked threads, etc, is pre-configured for you in health.xml. We also include appropriately conservative remediation actions in health.xml, such as triggering thread dumps, heap dumps, and restarts when necessary. It's up to you to tweak these settings to increase or decrease the aggressiveness of the health system as you see appropriate.


Resin goes beyond typical metrics monitoring by looking for anomalies in JMX attributes.


Any numeric attribute of any MBean in JMX can be configured as Meter in Resin, which then enables:

  • Persistent historical tracking
  • Visual graphing in resin-admin
  • Visual graphing in PDF reports
  • Cluster wide reporting
  • Health monitoring
  • Anomaly analysis and logging
  • Triggering health actions (heap dump, thread dump, restart, etc)


[edit] Creating a Meter

Resin comes pre-configured with a set of common meters in health.xml. When adding new meters and/or anomaly analyzers, we recommend you create a new file in conf/resin-inf, which will be automatically imported by Resin. This makes upgrades simpler in the future. Alternatively you can add meters directly to conf/health.xml.

conf/resin-inf/my-meters.xml:

<resin xmlns="http://caucho.com/ns/resin"
      xmlns:resin="urn:java:com.caucho.resin"
      xmlns:health="urn:java:com.caucho.health"
      xmlns:ee="urn:java:ee">

<health:JmxMeter>
  <name>JVM|Thread|JVM Blocked Count</name>
  <objectName>resin:type=JvmThreads</objectName>
  <attribute>BlockedCount</attribute>
</health:JmxMeter>

</resin>


In this example we've created a JMXMeter on the attribute BlockedCount on the MBean resin:type=JvmThreads. This is an important attribute to track, since it reports the number of blocked threads, which can indicate a serious issue when the value increases significantly.


We also provide JMXDeltaMeter, which reports the difference between the current and previous attribute values.

<health:JmxDeltaMeter>
  <name>JVM|Compilation|Compilation Time</name>
  <objectName>java.lang:type=Compilation</objectName>
  <attribute>TotalCompilationTime</attribute>
</health:JmxDeltaMeter>

Above, a delta meter is created for compilation time, another important metric to monitor.


Please refer to to resin-doc on Health Meters for more information.


[edit] Analyzing a Meter

Meters alone are useful for manual inspection in resin-admin since every meter can be graphed. However Resin provides an extremely useful automatic analysis tool called AnomalyAnalyzer. AnomalyAnalyzer looks at the current meter value, checking for deviations from the average value. So unusual changes like a spike in blocked threads can be detected.

<health:AnomalyAnalyzer>
  <meter>JVM|Thread|JVM Blocked Count</meter>
  <health-event>caucho.thread.anomaly.jvm-blocked</health-event>
</health:AnomalyAnalyzer>

In this example we've created an AnomalyAnalyzer on the blocked thread meter we created above, and assigned it to the health event "caucho.thread.anomaly.jvm-blocked". The health-event attribute is optional. Without a health-event, an anomaly analyzer alone will only log anomalies it detects to the resin log at WARNING level. These alerts also show up in PDF reports. An example anomaly log is shown below:

2012-01-20 16:10:00 AnomalyAnalyzer JVM|Thread|JVM Runnable Count WARNING value=3.000, deviation=9.487 sigma mean=2.011 std=0.104 n=92.0


[edit] Reacting to Anomalies

Resin's health system provides a set of remediation actions that you can configure to automatically execute in reaction to an anomaly. The <health-event> attribute we configured above allows us to tie health actions to a detected anomaly, as shown below:

<health:DumpThreads>
  <health:IfHealthEvent regexp="caucho.thread"/>
  <health:IfNotRecent time="15m"/>
</health:DumpThreads>

In this example we've created a DumpThreads action with 2 conditions. The first condition, IfHealthEvent, tells the action to execute only if the health event starts with "caucho.thread". The send condition, IfNotRecent, prevents the action from executing more than once every 15 minutes.

Resin provides many other useful conditions that can be applied to any health action.


Here is the example in full, which belongs in conf/resin-inf/my-meters.xml:

<resin xmlns="http://caucho.com/ns/resin"
           xmlns:resin="urn:java:com.caucho.resin"
           xmlns:health="urn:java:com.caucho.health"
           xmlns:ee="urn:java:ee">

  <health:JmxMeter>
    <name>JVM|Thread|JVM Blocked Count</name>
    <objectName>resin:type=JvmThreads</objectName>
    <attribute>BlockedCount</attribute>
  </health:JmxMeter>

  <health:AnomalyAnalyzer>
    <meter>JVM|Thread|JVM Blocked Count</meter>
    <health-event>caucho.thread.anomaly.jvm-blocked</health-event>
  </health:AnomalyAnalyzer>

  <health:DumpThreads>
    <health:IfHealthEvent regexp="caucho.thread"/>
    <health:IfNotRecent time="15m"/>
  </health:DumpThreads>

</resin>


Full documentation on Resin's Application Health System is available in the public resin-doc.

Personal tools
TOOLBOX
LANGUAGES