Saturday, March 19, 2011

DRCP Monitoring with Reconnoiter

This is a second part of my previous article Collecting Oracle metrics with Reconnoiter.

If you are using Database Resident Connection Pooling (DRCP) for connecting your application to Oracle Database (and if you are serious about scalability you should), then you probably know about the (G)V$CPOOL_STATS view, which shows the usage statistics of the pool, such as number of succesful connections made, number of busy connection etc. 


However, these statistics are not available as a time series, because Enterprise Manager does not plot them. Here comes Reconnoiter to the rescue again. Similar to the system stats I showed you in previous article, we can collect the contents of the (G)V$CPOOL_STATS view periodically and plot the values with Reconnoiter.


I've updated the ResmonOracle class and it is available on GitHub. The method to gather the DRCP stats is called getDRCPStats() and will dump contents of the view for all Oracle instances.When the DRCP metrics are collected, they can be graphed in Reconnoiter Web console as any other metrics.



The metrics explained

Collecting numbers and drawing pretty graphs is awesome, but to make any real use we have to understand what the numbers are telling us. Let's take a look at some of the interesting metrics:

  • NUM_OPEN_SERVERS - This is the number of open connections that are in the pool. They can be either currently used or waiting for a connection from a client. New connections are opened when there are none currently available until the connection limit (specified by the MAXSIZE attribute) is reached. The unused connections are closed after a specified timeout if they are no longer used.
  • NUM_BUSY_SERVERS - Number of currenly used server connections. This is always lower than or equal to NUM_OPEN_SERVERS and tells you how many of the currently held connections are being used by clients. If the number of required connections from your application servers is quite low, DRCP will keep the connection live, same as if you were using classic persistent connections. This is fine, but if this happens all the time, it tells you that maybe you do not need DRCP at the moment and you can do fine with classic persistent connections. Even for a busy website with hundreds of requests per second, this number could be quite low, something in the range of 10-20 can be achieved, if you are opening the database connection as late as possible and closing it as soon as possible in your application.
  • NUM_REQUESTS - Shows the total count of requests made. If you make one database connection on every page request, for example, this will be equal to number of page requests. This is a counter value, always increasing, so you have to mark it as such in Reconnoiter. If you set it as a counter, Reconnoiter will plot the value as requests made per second.
  • NUM_MISSES - This metric tells you, how many times DRCP was unable to find a pooled connection to serve the connectin request. This results in a new connection to the database being established, which is costly and can take up to several hundred milliseconds. So obviously you want to avoid that. If this number is high, you are maybe timing out the idle connections too fast, check the INACTIVITY_TIMEOUT configuration attribute.
  • NUM_WAITS - This is probably the most important metric to watch. It tells you, when the request had to wait, because no new connection could be created due to limitation of the pool size. This can have two basic reasons. One is, your MAXSIZE settingis just too low for the load you are serving. However, more often this is caused by your application not closing the connections properly and thus releasing the pooled server for reuse. Make sure you are calling oci_free_statement() for every parsed statement you used and also call oci_close() for every connection. This is of course best handled by your database abstraction library and if you are not using one, you should be. Obviously, you want to keep this metric very near zero.
The beauty of using Reconnoiter for this job is you can correlate these metrics with others you are collecting. For example, if you are collecting number of requests to your HTTP server and this number has similar graph as the number of NUM_REQUESTS, you know everything is fine. But if one graphs spikes and the other does not, you can suspect there is something fishy going on an you can investigate further. Maybe you find out that half of the HTTP requests fail before even establishing the connection to the database or that your developers are making two database connection by mistake instead of one. That is up to you.


    0 komentářů:

    Post a Comment