New OEM 12c/13c Agent Install Won’t Keep Running – Dies After Awhile

Many agents - Copyright by Warner Bros.,The Wachowski Brothers used without permission as educational content.
Agents are often hard to kill, usually. Images Copyright by Warner Bros., The Wachowski Brothers; used without permission as educational content.

OEM agents tend to occupy memory based upon how many targets they have to keep track of in a particular host.  At another organization, we tended to spin up VM’s for each instance environment, so at maximum, a particular agent might have a few hundred targets (especially on an e-Business Suite Applications Tier.)  In those circumstances, the default Java memory settings are probably fine.

In this environment, we run our hosts to death, and on this particular proof-of-concept host, we have 43 instances running on it, with variants of 10g, 11g, and 12c databases combined.

We are doing a fresh install of OEM 12.1.0.5.0 for our POC before setting up the 13c production OMS, and after deploying the agent to this particular database host, the agent would startup fine, run for about 20 or so minutes and then abruptly die without warning.

Re-starts fine, passes the usual tests fine (before the 20 minutes or so goes by) and then dies again.

AGENT_INST=/u01/app/oracle/agent12c/agent_inst

cd $AGENT_INST/bin

./emctl status agent

Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
—————————————————————
Agent Version          : 12.1.0.5.0
OMS Version            : 12.1.0.5.0
Protocol Version       : 12.1.0.1.0
Agent Home             : /u01/app/oracle/agent12c/agent_inst
Agent Log Directory    : /u01/app/oracle/agent12c/agent_inst/sysman/log
Agent Binaries         : /u01/app/oracle/agent12c/core/12.1.0.5.0
Agent Process ID       : 10598
Parent Process ID      : 10499
Agent URL              : https://itsrv33c.mydomain:3872/emd/main/
Local Agent URL in NAT : https://itsrv33c.mydomain:3872/emd/main/
Repository URL         : https://itsrv35g.mydomain:1159/empbs/upload
Started at             : 2016-11-09 09:57:05
Started by user        : oracle
Operating System       : HP-UX version B.11.31 (IA64W)
Last Reload            : (none)
Last successful upload                       : 2016-11-09 10:19:26
Last attempted upload                        : 2016-11-09 10:19:26
Total Megabytes of XML files uploaded so far : 0.2
Number of XML files pending upload           : 0
Size of XML files pending upload(MB)         : 0
Available disk space on upload filesystem    : 16.92%
Collection Status                            : Collections enabled
Heartbeat Status                             : Ok
Last attempted heartbeat to OMS              : 2016-11-09 10:19:53
Last successful heartbeat to OMS             : 2016-11-09 10:19:53
Next scheduled heartbeat to OMS              : 2016-11-09 10:20:53

—————————————————————
Agent is Running and Ready

./emctl pingOMS

Oracle Enterprise Manager Cloud Control 12c Release 5
Copyright (c) 1996, 2015 Oracle Corporation.  All rights reserved.
—————————————————————
EMD pingOMS completed successfully

$AGENT_INST/sysman/log/gcagent.log contains

—– Wed Nov  9 09:39:43 2016::26900::Agent Launched with PID 27336 at time Wed
Nov  9 09:39:43 2016 —–
—– Wed Nov  9 09:39:43 2016::27336::Time elapsed between Launch of Watchdog p
rocess and execing EMAgent is 34 secs —–
2016-11-09 09:39:44,287 [1:main] WARN – Missing filename for log handler ‘wsm’
2016-11-09 09:39:44,302 [1:main] WARN – Missing filename for log handler ‘opss’
2016-11-09 09:39:44,305 [1:main] WARN – Missing filename for log handler ‘opsscf
g’
Agent is going down due to an OutOfMemoryError
—– Wed Nov  9 09:40:06 2016::26900::Checking status of EMAgent : 27336 —–
—– Wed Nov  9 09:40:06 2016::26900::EMAgent exited at Wed Nov  9 09:40:06 201
6 with return value 57. —–
—– Wed Nov  9 09:40:06 2016::26900::EMAgent will be restarted because of an O
ut of Memory Exception. —–
—– Wed Nov  9 09:40:06 2016::26900::writeAbnormalExitTimestampToAgntStmp: exi
tCause=OOM : restartRequired=1 —–
—– Wed Nov  9 09:40:06 2016::26900::Restarting EMAgent. —–

That means, the agent is starting, then stopping, then restarting, then stopping (aka “thrashing”)

Take a look for the running agent daemon at the OS level:

ps -ef | grep agent12c
oracle 26900     1  0 09:39:09 pts/0     0:00 /u01/app/oracle/agent12c/core/12.1.0.5.0/perl/bin/perl /u01/app/oracle/agent12c/core/12.1.0.5.0/bin/emwd.pl agent /u01/app/oracle/…
oracle 27665 26900  0 09:40:12 pts/0     1:01 /u01/app/oracle/agent12c/core/12.1.0.5.0/jdk/bin/IA64W/java -Xmx169M -XX:MaxPermSize=96M -server -Djava.security.egd=file:///de…

Oh – it’s set up for the default of 169MB of RAM.  Check My Oracle Support.

EM 12c: emctl start agent Fails ‘Fatal agent error: State Manager failed at Startup’ ‘restarted because of an Out of Memory Exception’ Reported in emagent.nohup /gcagent.log (Doc ID 1950490.1)

Verify this setting also in the $AGENT_INST/sysman/log/gcagent.log:

—– Wed Nov  9 09:40:06 2016::26900::Auto tuning the agent at time Wed Nov  9
09:40:06 2016 —–
inMemoryLoggingSize=6291456
_SchedulePersistTimer=30
MaxThreads=10
agentJavaDefines=-Xmx169M -XX:MaxPermSize=96M
SchedulerRandomSpreadMins=5
UploadMaxNumberXML=5000
UploadMaxMegaBytesXML=50.0
Auto tuning was successful

Well, it’s trying. Per the above Doc ID 1950490.1

Stop the agent.

$AGENT_INST/bin/emctl stop agent

Edit the $AGENT_INST/sysman/config/emd.properties   (this contains the runtime parameters for the agent):

old entry:
agentJavaDefines=-Xmx169M -XX:MaxPermSize=96M

new entry:
agentJavaDefines=-Xmx512M -XX:MaxPermSize=96M

(You may tune these values up or down according to your environment requirements)

Restart the agent:

$AGENT_INST/bin/emctl stop agent

Agent runs, and keeps running like the E-Bunny.

R12.2 adop phase fs_clone fails because it can’t copy a file.

Complexity leading to fragility
Complex Systems Integration (Courtesy of: https://www.mulesoft.com/resources/esb/business-process-integration)

Oracle…sometimes blinding in complexity, and a foreshadowing of what happens when a system becomes so complex that it becomes fragile.

This is what the screen session looks like:

adnodemgrctl.sh: check the logfile /u01/oradev/fs1/inst/apps/DEV2_itsrv33m/logs/appl/admin/log/adnodemgrctl.txt for more information …
[UNEXPECTED]fs_clone has failed.
[STATEMENT] Please run adopscanlog utility, using the command
“adopscanlog -latest=yes”
to get the list of the log files along with snippet of the error message corresponding to each log file.
adop exiting with status = 2 (Fail)

Here is what it looks like in the adop.log

[EVENT]     Performing CLONE steps.
[PROCEDURE] [START 2016/11/08 08:20:19] CLONE Patch File System from Run File System
[PROCEDURE] Calling: /u01/oradev/fs2/EBSapps/appl/ad/12.0.0/patch/115/bin/txkADOPPreparePhaseSynchronize.pl
[EVENT]     Log: /u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m
[UNEXPECTED]Error occurred running “perl /u01/oradev/fs2/EBSapps/appl/ad/12.0.0/patch/115/bin/txkADOPPreparePhaseSynchronize.pl -contextfile=/u01/oradev/fs2/inst/apps/DEV2_itsrv33m/appl/admin/DEV2_itsrv33m.xml -patchcontextfile=/u01/oradev/fs1/inst/apps/DEV2_itsrv33m/appl/admin/DEV2_itsrv33m.xml -promptmsg=hide -console=off -mode=create -sync_mode=copy -sessionid=25 -timestamp=20161108_081645 -outdir=/u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m”
[UNEXPECTED]occurred during CLONE Patch File System from Run File System, running command: “perl /u01/oradev/fs2/EBSapps/appl/ad/12.0.0/patch/115/bin/txkADOPPreparePhaseSynchronize.pl -contextfile=/u01/oradev/fs2/inst/apps/DEV2_itsrv33m/appl/admin/DEV2_itsrv33m.xml -patchcontextfile=/u01/oradev/fs1/inst/apps/DEV2_itsrv33m/appl/admin/DEV2_itsrv33m.xml -promptmsg=hide -console=off -mode=create -sync_mode=copy -sessionid=25 -timestamp=20161108_081645 -outdir=/u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m”.
[EVENT]     Releasing managed server ports.
[PROCEDURE] [START 2016/11/08 08:30:55] Releasing ports
[STATEMENT] Calling task: “Releasing ports on itsrv33m”; script: “/u01/oradev/fs2/inst/apps/DEV2_itsrv33m/admin/install/txkCloneAcquirePort.pl”; args: ” -option=close -server=itsrv33m.mwd.h2o -ports=5570,7015,7615,7415,7215,7216,7217,7815,6114,6214,10013 -log=/u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/txkCloneAcquirePort.log”
[PROCEDURE] [START 2016/11/08 08:30:56] Releasing ports on itsrv33m
[PROCEDURE] Calling: /u01/oradev/fs2/inst/apps/DEV2_itsrv33m/admin/install/txkCloneAcquirePort.pl
[EVENT]     Log: /u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/txkCloneAcquirePort.log
[PROCEDURE] [END   2016/11/08 08:30:57] Releasing ports on itsrv33m
[PROCEDURE] [END   2016/11/08 08:30:57] Releasing ports

How cryptic!  Looks complicated.  This system happens to have AD/TXK.Delta.8, so the logs related to adop are all being centralized under the session # (in this case “25”.)

Diving deep into the log subdirectories:

/u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/TXK_SYNC_create/txkADOPPreparePhaseSynchronize.log

================================
Inside copyComnUtil()…
================================

Directory /u01/oradev/fs1/EBSapps/comn already exists.
Removing the directory: /u01/oradev/fs1/EBSapps/comn/util

Copying the directory
———————
SOURCE : /u01/oradev/fs2/EBSapps/comn/util
TARGET : /u01/oradev/fs1/EBSapps/comn/util

Execute SYSTEM command : cp -rH  /u01/oradev/fs2/EBSapps/comn/util /u01/oradev/fs1/EBSapps/comn/util
EXIT STATUS: 1
Copied /u01/oradev/fs2/EBSapps/comn/util into /u01/oradev/fs1/EBSapps/comn/utilLOG FILE: /u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/TXK_SYNC_create/cp_rH.out

<snip>
======================================
Inside copyCloneLogsToFSNE()…
======================================

Creating the directory: /u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/TXK_SYNC_create/fsclone_apply

Copying the directory
———————
SOURCE : /u01/oradev/fs2/inst/apps/DEV2_itsrv33m/admin/log/clone
TARGET : /u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/TXK_SYNC_create/fsclone_apply

/u01/oradev/fs2/EBSapps/comn/adopclone_itsrv33m/bin/adclone.pl did not go through successfully.
LOG DIRECTORY: /u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/TXK_SYNC_create/fsclone_apply.
*******FATAL ERROR*******
PROGRAM : (/u01/oradev/fs2/EBSapps/appl/ad/12.0.0/patch/115/bin/txkADOPPreparePhaseSynchronize.pl)
TIME    : Tue Nov  8 08:30:55 2016
FUNCTION: main::runFSCloneApply [ Level 1 ]
ERRORMSG: /u01/oradev/fs2/EBSapps/comn/adopclone_itsrv33m/bin/adclone.pl did not go through successfully.

/u01/oradev/fs_ne/EBSapps/log/adop/25/20161108_081645/fs_clone/itsrv33m/TXK_SYNC_create/fsclone_apply/FSCloneApplyAppsTier_11080821.log

(One innocuous, but annoying file, owned by someone other than applmgr, and happened to be 600 (user read/write only):)

Done copying /u01/oradev/fs2/EBSapps/appl/wms to /u01/oradev/fs1/EBSapps/appl/wms/
Copying /u01/oradev/fs2/EBSapps/appl/gmf to /u01/oradev/fs1/EBSapps/appl/gmf/
WARNING:
.end std out.
cp: cannot open /u01/oradev/fs2/EBSapps/appl/mwd/12.0.0/sql/MWDPOLPO.xsl: Permission denied

.end err out.
WARNING: Copying of /u01/oradev/fs2/EBSapps/appl/mwd failed
ERROR: Copying failed, so stopping all the parallel threads
Done copying /u01/oradev/fs2/EBSapps/appl/okc to /u01/oradev/fs1/EBSapps/appl/okc/
WARNING: Copying of /u01/oradev/fs2/EBSapps/appl/okc failed
Done copying /u01/oradev/fs2/EBSapps/comn/java/classes/oracle/apps/pqh to /u01/oradev/fs1/EBSapps/comn/java/classes/oracle/apps/pqh/
WARNING: Copying of /u01/oradev/fs2/EBSapps/comn/java/classes/oracle/apps/pqh failed
Done copying /u01/oradev/fs2/EBSapps/appl/gmf to /u01/oradev/fs1/EBSapps/appl/gmf/
WARNING: Copying of /u01/oradev/fs2/EBSapps/appl/gmf failed
Done copying /u01/oradev/fs2/EBSapps/comn/java/classes/oracle/apps/ies to /u01/oradev/fs1/EBSapps/comn/java/classes/oracle/apps/ies/
WARNING: Copying of /u01/oradev/fs2/EBSapps/comn/java/classes/oracle/apps/ies failed
Done copying /u01/oradev/fs2/EBSapps/appl/msc to /u01/oradev/fs1/EBSapps/appl/msc/
WARNING: Copying of /u01/oradev/fs2/EBSapps/appl/msc failed
Done copying /u01/oradev/fs2/EBSapps/appl/ams to /u01/oradev/fs1/EBSapps/appl/ams/
WARNING: Copying of /u01/oradev/fs2/EBSapps/appl/ams failed
Done copying /u01/oradev/fs2/EBSapps/appl/cs to /u01/oradev/fs1/EBSapps/appl/cs/
WARNING: Copying of /u01/oradev/fs2/EBSapps/appl/cs failed
START: Inside exitClone….
Updating status INCOMPLETE for ApplyApplTop
START: Updating status INCOMPLETE for action ApplyApplTop
END: Updated status INCOMPLETE for action ApplyApplTop

One file.  A single file.  Could have been a hidden file. Could have been anywhere in the entire AppsTier filesystem.  This one file stopped ADOP cold from performing.  This happened to be phase=fs_clone, but could have been a patching session, or a post-clone config (adcfgclone.pl) session.

Remove the file, and:

The fs_clone phase completed successfully.
adop exiting with status = 0 (Success)

Complexity leads to unintentional fragility.