Wednesday, March 9, 2011

Collecting Oracle metrics with Reconnoiter

Oracle database makes your life, as an operator, easier by measuring and providing tons of metrics. These metrics range from wait time counters, thru I/O operations and sizes to number of failed logins. Each new version adds even more useful metrics. For example, Oracle 11gR2 collects 628 different metrics.

These metrics are generally available thru the (G)V$SYSSTAT system view. This view can be queried by standard SQL*Plus or SQL Developer.

SQL> SELECT COUNT(*) FROM V$SYSSTAT;

COUNT(*)              
----------------------
628               

However, to make sense of these numbers, we are usually not interested in absolute numbers, but rather in trends and these trends are best observed thru graphs. Oracle provides the Enterprise Manager and if you purchase the StatsPack option, you can display graphs of some of the metrics. But you might not have the StatsPack option and even if you have, the data for some metrics is not available as graphs and if it is, the metrics are not stored for very long time.

Enter Reconnoiter

If you haven't heard of Reconnoiter, it is a new and modern approach to monitoring and trending. Although still in early stages of development, it is very usable open source product. One of the greatest features of Reconnoiter is, that it can be fed data in many different and pretty simple ways. You can write high-performance modules in C or it can be easily extended with Lua modules. Or you can feed it data in Resmon format, which is what we are going to do here.

Resmon is another open-source project from OmniTI and originally is a system resource monitor written in Perl, that makes all the usual system metrics (such as CPU load, memory, disc space etc.) available via an XML format over HTTP protocol. We are going to use the Resmon XML format, because Reconnoiter understands it out-of-the box and no hacking will be required.

Resmon PHP classes

I've prepared a set of PHP classes to generate the Resmon XML data. The Resmon class is a base class and ResmonOracle extends this class to provide our Oracle metrics.

The source code lives on my GitHub in the php-resmon repo.

To get the source code, simply clone the repository like this (you must have git installed):

git clone git://github.com/whizz/php-resmon.git

You can edit the example-oracle.php file to supply your connection details and give it a try. I recommend you create a special user for this purpose and give it only connection privileges and grant select on the (G)V$SYSTAT view to be secure.

The example is pretty straightforward:

<?php

require_once('Resmon.class.php');
require_once('ResmonOracle.class.php');

$resmon = new ResmonOracle('resmon','resmon', 'MYTNSNAME', true);

$resmon->getSysStat();
$resmon->outputAsXML();

This will, when called, produce the XML file which can be fed to Reconnoiter. The output contains a link to the XSL stylesheet, so if you view it in your browser, it should be pretty-printed for you like this:

Screenshot from browser view of the output XML

Enabling metric collection

Now that we have the data source, it's time to register this source in Noit, which is the component of Reconnoiter responsible for data collection. You can add the source in the XML configuratin file noit.conf. I prefer the telnet console to noit, which handles similar to a network appliance.

Fire up the terminal and let's go:

[mtaborsky@myserver ~]# telnet localhost 32322
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
noitd: (no auth)
noit# conf t
noit(conf:/)# cd checks
noit(conf:/checks)# section databases

We have now created a sub-section in the noit checks to hold our database metric collectors (called checks). Now we are going to create a new check and cofigure it:

noit(conf:/checks/databases)# check new target 10.1.16.18
noit(conf:[8044f556-09dc-48a8-8f8f-353179a26d76])# attr module resmon
noit(conf:10.1.16.18`resmon)# attr name OracleTesting
noit(conf:10.1.16.18`OracleTesting)# conf port 80
noit(conf:10.1.16.18`OracleTesting)# conf url http://php-resmon.mtaborsky.dev/
noit(conf:10.1.16.18`OracleTesting)# attr period 60000
noit(conf:10.1.16.18`OracleTesting)# attr timeout 10000
noit(conf:10.1.16.18`OracleTesting)# status
==== 8044f556-09dc-48a8-8f8f-353179a26d76 ====
 name: OracleTesting
 module: resmon
 target: 10.1.16.18
 period: 60000
 timeout: 10000
 oncheck: [undef]
 filterset: default [inherited from @filterset]
 disable: true
 config::url: http://php-resmon.mtaborsky.dev/
 config::port: 80
 currently: disabled
 last run: never

We have configured the check, given it the URL and port of our webserver with the resmon output (in my example it is http://php-resmon.mtaborsky.dev/). The check will run every 60 seconds and timeout if we do not receive the answer within 10 seconds. The status command shows us, that the check is disabled now, because all checks are automatically disabled by default. Let's enable it then:

noit(conf:10.1.16.18`OracleTesting)# no attr disable

After a little while, the check will run for the first time and the status output will show this:

noit(conf:10.1.16.18`OracleTesting)# status
==== 8044f556-09dc-48a8-8f8f-353179a26d76 ====
 name: OracleTesting
 module: resmon
 target: 10.1.16.18
 period: 60000
 timeout: 10000
 oncheck: [undef]
 filterset: default [inherited from @filterset]
 disable: [undef]
 config::port: 80
 config::url: http://php-resmon.mtaborsky.dev/
 currently: idle
 last run: 4.040 seconds ago
 availability/state: available/good
 status: services=3
 metrics:
   Oracle::SysStat`Inst_2`cell smart IO session cache lookups[l] = 0
   Oracle::SysStat`Inst_1`consistent gets from cache[l] = 126600793925
   Oracle::SysStat`Inst_2`DML statements parallelized[l] = 0
   Oracle::SysStat`Inst_1`gc reader bypass grants[l] = 0
   Oracle::SysStat`Inst_2`spare statistic 3[l] = 0

[... about a million lines with metrics deleted ...]

   Oracle::SysStat`Inst_2`read-only violation count[l] = 0

We can see, that noit is running the check and collecting data for us. We will leave the console for now and save our configuration to disc before doing so:

noit(conf:10.1.16.18`OracleTesting)# exit
noit(conf:/checks/databases)# write mem
4413 bytes written.
noit(conf:/checks/databases)# exit
noit# exit
Connection closed by foreign host.

In order to be able to see something meaningful, we now have to wait for few hours and let noit collect enough data to make some pretty pictures.

If you prefer editing the XML config file, the check will look like this:

<check 
  uuid="8044f556-09dc-48a8-8f8f-353179a26d76" 
  module="resmon" 
  name="OracleTesting" 
  period="60000" 
  timeout="10000" 
  target="10.1.16.18">
    <config>
      <port>80</port>
      <url>http://php-resmon.mtaborsky.dev/</url>
    </config>
</check>


Making graphs
It is very easy to produce trending graphs in Reconnoiter web console. Check out this video on how to make them, if you are unfamiliar with that.

Following example shows a graph constructed from a "redo size" metric on a two-node RAC cluster. The two metrics are stacked, so we can read a total volume of redo generated per second on the monitored system.

Generated redo size per second, shown for a period of one day


One of the great features of Reconnoiter is, it never throws away any data. So even if you have several years worth of data, you can still drill down to a five-minute interval anytime in history. And while the volume of data is no doubt huge, the system still performs quite snappy, thanks to using table partitioning in PostgreSQL. As Theo Schlossnagle says in his video linked above, "Buy more disks, they are really cheap."

For some extra awesomeness, try the real-time data streaming by pressing the PLAY button above the graph. Since our PHP script generates the data on-the-fly, you will be able to see, in real-time, how much redo is being generated on the system. How great is that? Quite a lot!

You can watch the redo size in real time, the graph is updated every second


The End
So, in conclusion. I've shown you, how easy is it to generate consumable metric data from Oracle using PHP and how to pull these metrics into Reconnoiter for trending and monitoring. Check out the code on GitHub and if you have any comments or contributions, please do not hesitate to post them here.

3 comments:

  1. You might also want to check out jezebel (part of reconnoiter). The docs are sparse on it, but jezebel is a stand alone Java daemon that can connect directly to Oracle (as well as other DBs) via JDBC, so you can specify queries to run on the DB.

    http://labs.omniti.com/labs/reconnoiter/docs/ch05s09.html

    ReplyDelete
  2. Hi, Could you please tell me how to configure the Stratcon.conf file.

    I am finding difficulty to configure the stomp.

    I am seeing the error messages like,

    >> Error writing jlog header over SSL -1 != 20

    >> [2011-12-16 14:12:21.527885] bind failed[*]: Address already in use

    >> [2011-12-16 14:12:22.538255] MQ connection failed

    >> [2011-12-16 14:12:23.589642] STOMP send failed, disconnecting failed to MQ submit.

    My post link to Reconnoiter user group,which contains attached noit.conf and stratcon.conf files.

    http://labs.omniti.com/labs/reconnoiter/lists/users/2011-December/000700.html

    Thanks,
    Kiran.

    ReplyDelete
  3. Hello Kiran, I am not using the event processing engine (IEP), I have the whole section disabled, so I would be guessing here only.

    If you want to just try the trending features of Reconnoiter, change this line:
    <iep disabled="false">
    to
    <iep disabled="true">

    ReplyDelete