Saturday, November 5, 2011

JMX metrics in Reconnoiter

We started using Hadoop, written in Java and it exposes it's metrics thru JMX, which is a Java standard for monitoring and metrics publishing. Since we use Reconnoiter for all other metrics collection and graphing, I wanted to get this JMX data into it.

Reconnoiter has a Java component, called Jezebel, that functions as a bridge between Java world and Reconnoiter. What it does is it listens on port 8083 for a check request. When the request comes from noitd, it performs an on-demand check and formats the output of the check as a Resmon-style XML document. Inside noitd is a simple Lua module, that composes the check request, submits it to Jezebel and parses the result into metrics. Sounds a little complicated, but is in fact quite flexible. Unfortunately, the documentation on Jezebel check configuration is sparse (to be diplomatic), so I decided to record what I did to make it running.

Getting Jezebel to run
First, we have to make sure Jezebel is running. It does not require any configuration as such. If you installed Reconnoiter properly, it can be started by executing:

jezebel -l /var/log/jezebel.log -p /var/run/jezebel.pid

To test, if Jezebel is running correctly, we can manufacture a test request and submit it to Jezebel. Let's create an XML file with the following content, same as it will be coming from noit, name it check.xml, for example. 

<?xml version="1.0" encoding="utf8"?>
<check target="127.0.0.1" target_ip="127.0.0.1" 
  module="com.omniti.jezebel.SampleCheck" name="jezebelcheck" 
  period="10000" timeout="5000">
  <config/>
</check>

The check attributes get passed to the check class, together with the config parameters (of which there are none in this request, because it is a request to a dummy check class). Now, let's post it to Jezebel and see what happens:

wget -qO - --post-file=check.xml http://localhost:8083/dispatch/com.omniti.jezebel.SampleCheck

You should get output like this:

<!DOCTYPE ResmonResults SYSTEM "http://labs.omniti.com/resmon/trunk/resources/resmon.dtd">
<ResmonResults>
<ResmonResult module="com.omniti.jezebel.SampleCheck" service="jezebelcheck">
<last_update>1320526541</last_update>
<metric name="x" type="i">1234</metric>
<metric name="y" type="n">12.342354</metric>
<metric name="mood" type="s">happy</metric>
</ResmonResult>
</ResmonResults>

Configuring the Lua module
When Jezebel is running, next step is to configure it as noitd module, so you can use it in the checks. Jezebel in fact is composed of several modules itself, each module represented by a Java class implementing the JezebelCheck interface. 

The available classes are:
  • com.omniti.jezebel.SampleCheck - an example module, useful for checking if Jezebel is running, as we did above, it always returns the same three metrics.
  • com.omniti.jezebel.check.jmx - this is what we are interested in, JMX metrics collection.
  • com.omniti.jezebel.check.mysql - this and the other three classes are bult on top of JDBC and allow running arbitrary queries to pull metrics from various databases.
  • com.omniti.jezebel.check.oracle
  • com.omniti.jezebel.check.postgres
  • com.omniti.jezebel.check.sqlserver
We'll configure just two of these, SampleCheck and jmx, like this:

<jezebel>
  <config>
    <url>http://127.0.0.1:8083/dispatch</url>
  </config>
  <module loader="lua" name="com.omniti.jezebel.SampleCheck" 
    object="noit.module.jezebel"/>
  <module loader="lua" name="com.omniti.jezebel.check.jmx" 
    object="noit.module.jezebel"/>
</jezebel>

After restarting noitd, the modules will be available to be used in checks.

Configuring the JMX check
Finally, we are getting to the interesting part, where we add a new check and start pulling some metrics. To add a new check you can either use the interactive console, which I love, or edit the noit.conf file. The JMX module has few configuration values:
  • target - obviously, this is the IP address of the host running the JMX provider
  • port - the port the provider is exposed at
  • username - I'm sure you can guess this one
  • password - ditto
  • mbean_domains - in JMX, each metric is part of a domain and with this parameter, you can limit the domains you want. Specify all the domains you want collected as a space separated list. If this is left empty, all the domains are pulled (which can generate an awful lot of not-so-useful metrics sometimes)

The final check configuration will look like this:
<hadoop module="com.omniti.jezebel.check.jmx" period="60000" >
  <check uuid="5daa2e54-b1fb-4a78-ae53-0f16adc307cf" 
    target="10.1.16.123" name="hadoopNameNode">
    <config>
      <port>8004</port>
      <mbean_domains>hadoop</mbean_domains>
    </config>
  </check>
  <check uuid="d5f14417-63d6-4210-9794-92274b553417" 
    target="10.1.16.123" name="hadoopJobTracker">
    <config>
      <port>8008</port>
      <mbean_domains>hadoop</mbean_domains>
    </config>
  </check>
</hadoop>

This is the configuration we have to get our Hadoop NameNode and JobTracker metrics. 

Saturday, March 19, 2011

DRCP Monitoring with Reconnoiter

This is a second part of my previous article Collecting Oracle metrics with Reconnoiter.

If you are using Database Resident Connection Pooling (DRCP) for connecting your application to Oracle Database (and if you are serious about scalability you should), then you probably know about the (G)V$CPOOL_STATS view, which shows the usage statistics of the pool, such as number of succesful connections made, number of busy connection etc. 


However, these statistics are not available as a time series, because Enterprise Manager does not plot them. Here comes Reconnoiter to the rescue again. Similar to the system stats I showed you in previous article, we can collect the contents of the (G)V$CPOOL_STATS view periodically and plot the values with Reconnoiter.


I've updated the ResmonOracle class and it is available on GitHub. The method to gather the DRCP stats is called getDRCPStats() and will dump contents of the view for all Oracle instances.When the DRCP metrics are collected, they can be graphed in Reconnoiter Web console as any other metrics.



The metrics explained

Collecting numbers and drawing pretty graphs is awesome, but to make any real use we have to understand what the numbers are telling us. Let's take a look at some of the interesting metrics:

  • NUM_OPEN_SERVERS - This is the number of open connections that are in the pool. They can be either currently used or waiting for a connection from a client. New connections are opened when there are none currently available until the connection limit (specified by the MAXSIZE attribute) is reached. The unused connections are closed after a specified timeout if they are no longer used.
  • NUM_BUSY_SERVERS - Number of currenly used server connections. This is always lower than or equal to NUM_OPEN_SERVERS and tells you how many of the currently held connections are being used by clients. If the number of required connections from your application servers is quite low, DRCP will keep the connection live, same as if you were using classic persistent connections. This is fine, but if this happens all the time, it tells you that maybe you do not need DRCP at the moment and you can do fine with classic persistent connections. Even for a busy website with hundreds of requests per second, this number could be quite low, something in the range of 10-20 can be achieved, if you are opening the database connection as late as possible and closing it as soon as possible in your application.
  • NUM_REQUESTS - Shows the total count of requests made. If you make one database connection on every page request, for example, this will be equal to number of page requests. This is a counter value, always increasing, so you have to mark it as such in Reconnoiter. If you set it as a counter, Reconnoiter will plot the value as requests made per second.
  • NUM_MISSES - This metric tells you, how many times DRCP was unable to find a pooled connection to serve the connectin request. This results in a new connection to the database being established, which is costly and can take up to several hundred milliseconds. So obviously you want to avoid that. If this number is high, you are maybe timing out the idle connections too fast, check the INACTIVITY_TIMEOUT configuration attribute.
  • NUM_WAITS - This is probably the most important metric to watch. It tells you, when the request had to wait, because no new connection could be created due to limitation of the pool size. This can have two basic reasons. One is, your MAXSIZE settingis just too low for the load you are serving. However, more often this is caused by your application not closing the connections properly and thus releasing the pooled server for reuse. Make sure you are calling oci_free_statement() for every parsed statement you used and also call oci_close() for every connection. This is of course best handled by your database abstraction library and if you are not using one, you should be. Obviously, you want to keep this metric very near zero.
The beauty of using Reconnoiter for this job is you can correlate these metrics with others you are collecting. For example, if you are collecting number of requests to your HTTP server and this number has similar graph as the number of NUM_REQUESTS, you know everything is fine. But if one graphs spikes and the other does not, you can suspect there is something fishy going on an you can investigate further. Maybe you find out that half of the HTTP requests fail before even establishing the connection to the database or that your developers are making two database connection by mistake instead of one. That is up to you.


    Wednesday, March 9, 2011

    Collecting Oracle metrics with Reconnoiter

    Oracle database makes your life, as an operator, easier by measuring and providing tons of metrics. These metrics range from wait time counters, thru I/O operations and sizes to number of failed logins. Each new version adds even more useful metrics. For example, Oracle 11gR2 collects 628 different metrics.

    These metrics are generally available thru the (G)V$SYSSTAT system view. This view can be queried by standard SQL*Plus or SQL Developer.
    
    SQL> SELECT COUNT(*) FROM V$SYSSTAT;
    
    COUNT(*)              
    ----------------------
    628               
    

    However, to make sense of these numbers, we are usually not interested in absolute numbers, but rather in trends and these trends are best observed thru graphs. Oracle provides the Enterprise Manager and if you purchase the StatsPack option, you can display graphs of some of the metrics. But you might not have the StatsPack option and even if you have, the data for some metrics is not available as graphs and if it is, the metrics are not stored for very long time.

    Enter Reconnoiter

    If you haven't heard of Reconnoiter, it is a new and modern approach to monitoring and trending. Although still in early stages of development, it is very usable open source product. One of the greatest features of Reconnoiter is, that it can be fed data in many different and pretty simple ways. You can write high-performance modules in C or it can be easily extended with Lua modules. Or you can feed it data in Resmon format, which is what we are going to do here.

    Resmon is another open-source project from OmniTI and originally is a system resource monitor written in Perl, that makes all the usual system metrics (such as CPU load, memory, disc space etc.) available via an XML format over HTTP protocol. We are going to use the Resmon XML format, because Reconnoiter understands it out-of-the box and no hacking will be required.

    Resmon PHP classes

    I've prepared a set of PHP classes to generate the Resmon XML data. The Resmon class is a base class and ResmonOracle extends this class to provide our Oracle metrics.

    The source code lives on my GitHub in the php-resmon repo.

    To get the source code, simply clone the repository like this (you must have git installed):

    git clone git://github.com/whizz/php-resmon.git

    You can edit the example-oracle.php file to supply your connection details and give it a try. I recommend you create a special user for this purpose and give it only connection privileges and grant select on the (G)V$SYSTAT view to be secure.

    The example is pretty straightforward:

    <?php

    require_once('Resmon.class.php');
    require_once('ResmonOracle.class.php');

    $resmon = new ResmonOracle('resmon','resmon', 'MYTNSNAME', true);

    $resmon->getSysStat();
    $resmon->outputAsXML();

    This will, when called, produce the XML file which can be fed to Reconnoiter. The output contains a link to the XSL stylesheet, so if you view it in your browser, it should be pretty-printed for you like this:

    Screenshot from browser view of the output XML

    Enabling metric collection

    Now that we have the data source, it's time to register this source in Noit, which is the component of Reconnoiter responsible for data collection. You can add the source in the XML configuratin file noit.conf. I prefer the telnet console to noit, which handles similar to a network appliance.

    Fire up the terminal and let's go:
    
    [mtaborsky@myserver ~]# telnet localhost 32322
    Trying 127.0.0.1...
    Connected to localhost.localdomain (127.0.0.1).
    Escape character is '^]'.
    noitd: (no auth)
    noit# conf t
    noit(conf:/)# cd checks
    noit(conf:/checks)# section databases
    

    We have now created a sub-section in the noit checks to hold our database metric collectors (called checks). Now we are going to create a new check and cofigure it:
    
    noit(conf:/checks/databases)# check new target 10.1.16.18
    noit(conf:[8044f556-09dc-48a8-8f8f-353179a26d76])# attr module resmon
    noit(conf:10.1.16.18`resmon)# attr name OracleTesting
    noit(conf:10.1.16.18`OracleTesting)# conf port 80
    noit(conf:10.1.16.18`OracleTesting)# conf url http://php-resmon.mtaborsky.dev/
    noit(conf:10.1.16.18`OracleTesting)# attr period 60000
    noit(conf:10.1.16.18`OracleTesting)# attr timeout 10000
    noit(conf:10.1.16.18`OracleTesting)# status
    ==== 8044f556-09dc-48a8-8f8f-353179a26d76 ====
     name: OracleTesting
     module: resmon
     target: 10.1.16.18
     period: 60000
     timeout: 10000
     oncheck: [undef]
     filterset: default [inherited from @filterset]
     disable: true
     config::url: http://php-resmon.mtaborsky.dev/
     config::port: 80
     currently: disabled
     last run: never
    

    We have configured the check, given it the URL and port of our webserver with the resmon output (in my example it is http://php-resmon.mtaborsky.dev/). The check will run every 60 seconds and timeout if we do not receive the answer within 10 seconds. The status command shows us, that the check is disabled now, because all checks are automatically disabled by default. Let's enable it then:
    
    noit(conf:10.1.16.18`OracleTesting)# no attr disable
    

    After a little while, the check will run for the first time and the status output will show this:
    
    noit(conf:10.1.16.18`OracleTesting)# status
    ==== 8044f556-09dc-48a8-8f8f-353179a26d76 ====
     name: OracleTesting
     module: resmon
     target: 10.1.16.18
     period: 60000
     timeout: 10000
     oncheck: [undef]
     filterset: default [inherited from @filterset]
     disable: [undef]
     config::port: 80
     config::url: http://php-resmon.mtaborsky.dev/
     currently: idle
     last run: 4.040 seconds ago
     availability/state: available/good
     status: services=3
     metrics:
       Oracle::SysStat`Inst_2`cell smart IO session cache lookups[l] = 0
       Oracle::SysStat`Inst_1`consistent gets from cache[l] = 126600793925
       Oracle::SysStat`Inst_2`DML statements parallelized[l] = 0
       Oracle::SysStat`Inst_1`gc reader bypass grants[l] = 0
       Oracle::SysStat`Inst_2`spare statistic 3[l] = 0
    
    [... about a million lines with metrics deleted ...]
    
       Oracle::SysStat`Inst_2`read-only violation count[l] = 0
    

    We can see, that noit is running the check and collecting data for us. We will leave the console for now and save our configuration to disc before doing so:
    
    noit(conf:10.1.16.18`OracleTesting)# exit
    noit(conf:/checks/databases)# write mem
    4413 bytes written.
    noit(conf:/checks/databases)# exit
    noit# exit
    Connection closed by foreign host.
    

    In order to be able to see something meaningful, we now have to wait for few hours and let noit collect enough data to make some pretty pictures.

    If you prefer editing the XML config file, the check will look like this:

    <check 
      uuid="8044f556-09dc-48a8-8f8f-353179a26d76" 
      module="resmon" 
      name="OracleTesting" 
      period="60000" 
      timeout="10000" 
      target="10.1.16.18">
        <config>
          <port>80</port>
          <url>http://php-resmon.mtaborsky.dev/</url>
        </config>
    </check>


    Making graphs
    It is very easy to produce trending graphs in Reconnoiter web console. Check out this video on how to make them, if you are unfamiliar with that.

    Following example shows a graph constructed from a "redo size" metric on a two-node RAC cluster. The two metrics are stacked, so we can read a total volume of redo generated per second on the monitored system.

    Generated redo size per second, shown for a period of one day


    One of the great features of Reconnoiter is, it never throws away any data. So even if you have several years worth of data, you can still drill down to a five-minute interval anytime in history. And while the volume of data is no doubt huge, the system still performs quite snappy, thanks to using table partitioning in PostgreSQL. As Theo Schlossnagle says in his video linked above, "Buy more disks, they are really cheap."

    For some extra awesomeness, try the real-time data streaming by pressing the PLAY button above the graph. Since our PHP script generates the data on-the-fly, you will be able to see, in real-time, how much redo is being generated on the system. How great is that? Quite a lot!

    You can watch the redo size in real time, the graph is updated every second


    The End
    So, in conclusion. I've shown you, how easy is it to generate consumable metric data from Oracle using PHP and how to pull these metrics into Reconnoiter for trending and monitoring. Check out the code on GitHub and if you have any comments or contributions, please do not hesitate to post them here.

    Tuesday, March 2, 2010

    Hidden errors in PHP affect performance

    I've come across an interesting thing recently, that I didn't think mattered, but, as I'm about to demonstrate, it does.

    So we all know about different levels of error reporting in PHP, right? We all know, that we must pay close attention to E_WARNING and above, and sometimes the occasional E_NOTICE carries some wisdom. (Almost) nobody cares about those annoying E_STRICT messages. In fact, even if you set the E_ALL reporting level, these messages are not displayed or logged, so maybe you even do not know they are there. I sure ignored them myself mostly.

    Well, not anymore. Gather around kids, 'cause I'm about to tell you one thing that just might convince you to turn the E_STRICT reporting on. Even these seemingly insignificant messages negatively affect performance. What surprised me most was, that even if you have these messages suppressed, the penalty is still there.

    But don't take my word for it, let's have some examples.


    Statically calling non-static method


    PHP5 is pretty open minded about the usage of static methods and properties. You can call any method from a class, without instantiating said class. The manual rightly states, that "Calling non-static methods statically generates an E_STRICT level warning." Yeah, who cares, right? Well, you should.

    Try this code:


    $cnt=100000;

    class One {
    public function test1a() { return 1; }
    public static function test1b() { return 1; }
    }

    $t1 = microtime(true);
    for ($i=1; $i<=$cnt; $i++) One::test1a();
    $t2 = microtime(true);
    for ($i=1; $i<=$cnt; $i++) One::test1b();
    $t3 = microtime(true);

    $time1=$t2-$t1;
    $time2=$t3-$t2;

    printf("Test run 1 (bad): %1.3f ms\n", $time1*1000);
    printf("Test run 2 (good) : %1.3f ms\n", $time2*1000);
    printf("Avg. diff.: %1.3f us\n", ($time1-$time2)*1000000/$cnt);


    This simple code runs 100000 times a static call first to a plain public method and then a static call to a statically defined method of the same class. If I run this snippet on my average machine, I get this result:


    Test run 1 (bad): 244.614 ms
    Test run 2 (good) : 88.351 ms
    Avg. diff.: 1.563 us


    There you have it. Each static call of the non-static method costs extra 1.5 microsecond. Now, that might not be much, but if you stick it in a heavy loop, it adds up.

    Accessing an undefined array element

    One of the more annoying notices is the one where PHP tells you, that the array element you are trying to access is not defined. I bet, that is one of the big reasons why many people have these turned off. I mean, what is wrong with just testing like this?


    if ($myarray[1]) {
    do something...
    } else {
    do the other thing
    }


    What is wrong is, this statement will generate the notice. And--I'm sure you know where I am headed with this--it incurs a performance hit. Again, here's the proof.


    <?php

    $cnt=100000;

    function test2a() {
    $a = array();
    $a[0] = 1;
    if ($a[1]) {
    return true;
    } else {
    return false;
    }
    }

    function test2b() {
    $a = array();
    $a[0] = 1;
    if ($a[0]) {
    return true;
    } else {
    return false;
    }
    }

    function test2c() {
    $a[1] = 1;
    if (isset($a[0]) && $a[0]) {
    return true;
    } else {
    return false;
    }
    }

    $t1 = microtime(true);
    for ($i=1; $i<=$cnt; $i++) test2a();
    $t2 = microtime(true);
    for ($i=1; $i<=$cnt; $i++) test2b();
    $t3 = microtime(true);
    for ($i=1; $i<=$cnt; $i++) test2c();
    $t4 = microtime(true);
    $time1=$t2-$t1;
    $time2=$t3-$t2;
    $time3=$t4-$t3;
    printf("Test run 1 : %1.3f ms\n", $time1*1000);
    printf("Test run 2 : %1.3f ms\n", $time2*1000);
    printf("Test run 3 : %1.3f ms\n", $time3*1000);
    printf("Avg. diff. 1-2: %1.3f us\n", ($time1-$time2)*1000000/$cnt);
    printf("Avg. diff. 2-3: %1.3f us\n", ($time2-$time3)*1000000/$cnt);


    The result on the same machine reads:


    Test run 1 : 259.571 ms
    Test run 2 : 160.604 ms
    Test run 3 : 156.587 ms
    Avg. diff. 1-2: 0.990 us
    Avg. diff. 2-3: 0.040 us


    This sample is a bit longer but still simple. The test2a() function reads an element of the array that is not defined. This would generate a notice, if we had them turned on.

    Very similar function test2b() performs the same code but in this case the element is defined (this is a correct scenario and no warning or notice is generated). Hence the shorter runtime.

    The third version of the same function if very careful and checks the existence of the array element before trying to use it. This might seem like an extra work, sometime unnecessary you might even say. The test however proves, that this extra test comes at virtually no cost performance wise and I dare say it's worth it ten times over.

    Conclusion

    I have demonstrated what you might have suspected all along: sloppy coding harms performance. In this case, the ignored warnings and notices and strict messages that you think you sucessfully dealt with when you turned them off, will come back and haunt you. So, my advice is simple. Be brave! Put that error_reporting = E_ALL | E_STRICT in your php.ini and kill all the wee bugs once and for all.

    PS: Obviously, you want to do that on you development box first, but I'm sure I don't have to tell you THAT, right?

    Thursday, July 30, 2009

    Index usage monitoring in Oracle

    Indices (or indexes, if you will) are always good, right? Well, Tom Kyte would disagree with you and so would I. In fact, indices will always impede data manipulation performance, so they should be used only where they are useful in queries. But determining if an index is useful can be sometimes pretty difficult, mainly in a complex application with thousands of different queries, some even dynamically generated.

    In this case, one of the approaches is to create indices on all columns where it possibly may make sense and monitor, if they are actually ever used by the CBO planner. I will show you how to do that. This method was tested in 11g, but should work in 10g as well.

    The magic is done by an ALTER INDEX indexname MONITORING USAGE.

    Starting the monitoring

    You will need an ALTER INDEX privilege on te indices you want to monitor, so preferrably you do it as a SYSDBA. Fire up your SQL*Plus and connect to the database. First, we will create a script to enable the monitoring. Replace the MYUSER with whatever user you want to monitor the indices for.

    set heading off
    set echo off
    set feedback off
    set pages 10000
    spool startmonitor.sql
    select 'alter index '||owner||'.'||index_name
     ||' monitoring usage;'
    from dba_indexes
    where owner ='MYUSER';
    spool off


    Then we simply run the script from within the console.

    SQL> @startmonitor.sql

    You might see some error messages when the script tries to alter a LOB index, which is not possible, just ignore that.

    Now you want to keep your application running for some time. At least for it to perform all kinds of work and jobs it normally does. So for example if you have jobs planned for every hour, leave it running like that for at least an hour. If you have daily jobs, leave it on for a whole day.

    Stop the monitoring

    In the meantime you can prepare a script to turn monitoring off.

    set heading off
    set echo off
    set feedback off
    set pages 10000
    spool stopmonitor.sql
    select 'alter index '||owner||'.'||index_name
     ||' nomonitoring usage;'
    from dba_indexes
    where owner ='MYUSER';
    spool off


    After you reach your desired monitoring period, again, simply run the off-script:

    SQL> @stopmonitor.sql

    Results


    Now, for the results. The index usage overview is stored in the v$object_usage system view. The tricky part here is, that you only see the indices for a currently logged in user. So, even if you were doing the monitoring under a SYS or some other user, now you must login as the user whose indices you were monitoring, in our example MYUSER.

    When you do that, identifying those harmful indices is as simple as this:
    SELECT INDEX_NAME, TABLE_NAME
      FROM V$OBJECT_USAGE
     WHERE USED='NO';


    Some of those indices might be primary keys, so you don't want to be dropping those, but all other unused indices should go. Unless, of course, you know you will need them in a very near future.

    UPDATE:
    You can query the monitoring results for all users too, but you have to build the query yourself from system tables:
    SELECT U.NAME OWNER
    ,      IO.NAME INDEX_NAME
    ,      T.NAME TABLE_NAME
    ,      DECODE(BITAND(I.FLAGS, 65536), 0, 'NO', 'YES') MONITORING
    ,      decode(bitand(ou.flags, 1), 0, 'NO', 'YES') used
    ,      ou.start_monitoring
    ,      ou.end_monitoring
    from
    sys.user$ u
    ,   sys.obj$ io
    ,   sys.obj$ t
    ,   sys.ind$ i
    ,   sys.object_usage ou
    where
    i.obj# = ou.obj#
    and io.obj# = ou.obj#
    AND T.OBJ# = I.BO#
    and u.user# = io.owner#
    

    Wednesday, July 15, 2009

    The hunt for the UNKNOWN SQL in Oracle

    If you are looking at your AWR report or browsing thru performance views in your Oracle Enterprise Manager console, chances are, that some of those SQLs, eating at your CPU or I/O bandwidth are marked as UNKNOWN. This is frustrating, to say the least, but in cases when these mysterious queries are killing your database, it may become a matter of life and death to find out just what does the Oracle do.

    Most commonly, these queries are LOB operations. That is, they represent reading and writing the actual CLOB or BLOB data by the appropriate oci functions. The question then stands: what table? And this is where this little post comes to the rescue.

    We will start with the only thing we are able to get from the report--the SQL ID, which is a hashed fingerprint of our culprit. It looks something like "2v82uk48x55vt". Regular SQL queries can be viewed by querying the system view V$SQL (or GV$SQL in RAC environments).

    SELECT SQL_TEXT
    FROM GV$SQL
    WHERE SQL_ID='2v82uk48x55vt';

    This, however, will not work for those unknown SQLs. The query will come up empty. All hope is not lost yet, though. There are other places to look. One of them is the X$KGLOB table, which is part of the library cache.


    SELECT KGLNAOBJ
    FROM X$KGLOB
    WHERE KGLOBT03 = '2v82uk48x55vt';


    This query will yield a value, which on the first look, looks like gibberish. It will read something like "table_4_9_15f8c_0_0_". On the second look, this string containts one interesting part and that is the fourth underscore separated segment: 15f8c. This, in fact, is the hexadecimal representation of the object_id of the object involved in the operation. Now remains only to query the DBA_OBJECTS view to find out, which table is being operated on.


    SELECT OWNER, OBJECT_NAME, OBJECT_TYPE
    FROM DBA_OBJECTS
    WHERE OBJECT_ID=TO_NUMBER('15f8c','XXXXXXXX');


    And there you have it, the schema and the name of the table, which was causing that unknown SQL in your stats.

    Friday, July 10, 2009

    Performance analysis reports in Oracle

    Sometimes you wonder why your database is slow or you just want to know what those oompa-loompas inside the Oracle black box are up to. One of the tools in your investigation can be the so-called Automatic Workload Repository, which contains an overload of information about what is going on in your database.

    Some information from AWR can be seen inside the Enterprise manager console, but for serious analysis you will want to generate an AWR report in plaintext or more probabaly HTML format.

    The report can be generated by running a SQL*Plus script. Logon to your database server as usual and run a following command:


    sqlplus / as sysdba @$ORACLE_HOME/rdbms/admin/awrrpti.sql


    The script will prompt you for your choices about how you want your report.

    1. First, you must choose an output format. You probably will want a html version, but sometimes Oracle support team will ask you for a plaintext version.
    2. You must enter a database ID. You most likely have only one database and the available choices are listed, so just copy the ten digit DB ID here.
    3. If you run more than one instance, enter the number of the instance you want your report from. Unfortunately, you cannot generate a cumulative report for all your instances in RAC environment, so just pick one.
    4. The script asks you, how far back in days you want to look for AWR snapshots. If you want just the recent activity, enter 1.
    5. The script lists all available snapshots for the number of days specified in the previous step. If you entered 1, it will list just today's snapshots. Most commonly, the snapshot is done every hour. First, you must select a beginning snapshot and in the second step you select the ending snapshot. If you want the report for previous whole hour, enter the last two snapshot numbers, for example 2873 and 2874.
    6. At last, you get prompted for a file name. The default is probably OK for most people, so just hit enter and the system gets to work on your report.
    The generated report HTML file is located in the directory you ran the script from. Transfer the file to your computer or somewhere, where you can access it and open it in your favorite browser.

    Your developers will most likely be interested in the section reporting the most frequently and most costly SQL statements. Your DBA will find interesting information about I/O activity, memory advisory tables etc. etc.

    In a future article, I will try to point out some interesting sections in this report. Until then, self-study time.