Monday, July 30, 2012

Script to check if AWR snapshot collection has stopped

In one database, after the archive log destination got full, snapshot collection stopped.
There was no way to know this had happened. Only in the MMON trace file I got the following message, for which there is no reference in Metalink:
Unable to schedule a MMON slave at: Auto Flush Main 1
  Slave action has been temporarily suspended
    - Slave action had prior policy violations.
  Unknown return code: 101

So, I had 2 choices. Either to restart the database or to kill MMON and MMNL processes.
I chose the latter.
You will get these in your alert log, when they restart:
Restarting dead background process MMON
Restarting dead background process MMNL
Mon Jul 30 08:18:24 2012
MMON started with pid=18, OS id=10223826 
Mon Jul 30 08:18:24 2012
MMNL started with pid=19, OS id=15204612 

Now snapshot collection is restored.

If you compare the time passed from the last snapshot collected and the interval you have set for snapshot collection, you will find out if the collection is halted.
For instance, if your interval is set to 1 hour (the default) and the last snapshot was collected more than 1 hour ago, then you may assume something is wrong.
The following script will return 0, if [time from last snapshot] < [snapshot interval] and there is no problem and 1 otherwise.
SELECT (CASE WHEN NUMTODSINTERVAL ( (SYSDATE - MAX(CAST(END_INTERVAL_TIME AS DATE))), 'DAY') < SNAP_INTERVAL THEN 0 ELSE 1 END) "RESULT"
FROM DBA_HIST_SNAPSHOT, DBA_HIST_WR_CONTROL
GROUP BY SNAP_INTERVAL;

If you want to create a database user-defined metric to monitor this, you should use a function.
The CASE statement confuses OEM and rejects the query.
First, connect as SYSDBA and grant SELECT on DBA_HIST_SNAPSHOT and DBA_HIST_WR_CONTROL to SYSTEM.
The function is:
CREATE OR REPLACE FUNCTION SYSTEM.CHECKAWR RETURN NUMBER
IS
V_RESULT NUMBER;
BEGIN
SELECT (CASE WHEN NUMTODSINTERVAL ( (SYSDATE - MAX(CAST(END_INTERVAL_TIME AS DATE))), 'DAY') < SNAP_INTERVAL THEN 0 ELSE 1 END) "RESULT" INTO V_RESULT
FROM DBA_HIST_SNAPSHOT, DBA_HIST_WR_CONTROL
GROUP BY SNAP_INTERVAL;

RETURN V_RESULT;
END;
/

Now, in the "SQL Query" box, put:
SELECT SYSTEM.CHECKAWR FROM DUAL; 

2 comments:

  1. Thanks.
    I have exactly the same symptoms, also after temporarily filling archivelog destination.
    Despite killing mmon and mmnl (and seeing them auto-restart) I continue to see the error in mmon trace, even after the kill, and snapshots still do not get created automatically (no problem creating them manually).

    11.2.0.3

    One important note is that I am having to kill -KILL of both processes, no other signal seems to get rid of them.

    ReplyDelete
    Replies
    1. I'm afraid, you have to go with plan B: downtime and bounce the database.
      If you have the option to postpone it and investigate further, there is a detailed Oracle Note:
      Troubleshooting: AWR Snapshot Collection issues [ID 1301503.1]
      Section "Snapshots not completing successfully" may help you to verify if snapshots are taken or not,
      by querying WRM$_SNAPSHOT_DETAILS, not DBA_HIST_SNAPSHOT.

      Delete