Monday, 25 July 2016

Oracle CRS troubleshooting

Oracle CRS troubleshooting

This post is related to CRS 11.1.0.7, but concept and fundamentals remains same in all versions. I am writing this because this is what I faced in 11.1.0.7 after applying a PSU patch and running root.sh followed by patch application.

Problem Description:

DBAs often face a problem where crs_stat -t (or crsctl stat res -t in 11gR2 or later) doesnt gives the output or CRS doesnt comes up after patching. or CRS comes up but doesn’t display its registered services. I faced this issue with a 3 node cluster on Linux 5.11. Plan was to upgrade CRS from 11.1.0.7 to 11.2.0.4 and latest PSU was required to be applied on 11.1.0.7 as a prereq of upgrade. PSU (11724953) was applied successfully but got following errors while running postrootpatch.sh:

./postrootpatch.sh -crshome /grid/app/oracle/product/11.1.0./crs

Checking to see if Oracle CRS stack is already up…

Checking to see if Oracle CRS stack is already starting

Startup will be queued to init within 30 seconds.

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

/bin/sh: /grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

Then I did the following steps as suggested by Oracle support :

run <CRS_HOME>/install/rootdelete.sh, it will remove the init* scripts and place back the blank inittab
<CRS_HOME>/install/rootdelete.sh
run <CRS_HOME>/install/rootdeinstall.sh, it will blank out the $ORACLE_HOME/cdata/localhost/local.ocr and remove the ocr.loc <CRS_HOME>/install/rootdeinstall.sh
run <CRS_HOME>/root.sh, CRS should start automatically after this.
<CRS_HOME>/root.sh
Confirm that the Node Clusterware has started successfully
crs_stat -t
Only if all looks Ok in Step 4 repeat for next node
But no use, Then Oracle provided another plan :

Check the permissions of /grid/app/oracle/product
ls -al /grid/app/oracle/product – You should see Oracle user doesnt have permissions i.e. its likely set to 700

Change the permission of directory /grid/app/oracle/product to 777 i.e. chmod 777 /grid/app/oracle/product
Rerun the delete command
./rootdeinstall.sh
Check permissions again and make sure Oracle user has permissions i.e. its not just set to ROOT (700)
ls -al /grid/app/oracle/product

Now run root.sh again i.e.
But this didn’t help, rather as soon as I use to ran root.sh, server use to reboot and never came up. Actually to bring the server up, I had to start the server in single user mode and comment CRS starting scripts in init.d and then started the server in normal mode. This was a serious permission issue.

Then I repeated the above steps and ran root.sh in debug mode (sh –x root.sh). It displayed hell lot of output but in the end it displayed :

/grid/app/oracle/product/11.1.0./crs/bin/crsctl: Permission denied

Then I changed the permissions as follows :

Set the ownership of all directories to be owned by oracle i.e.
chowm -R oracle:dba /grid/app/oracle/product/11.1.0./crs
./rootdeinstall.sh
./root.sh
It fixed the issue and brought CRS up on 1 node: ps –ef |grep d.bin started showing the nodeapps services up and running. But …

crs_stat command gave following error :
PRKH-1010 : Unable to communicate with CRS services.

Checked various log files — cssd.log, evemd.log, crsd.log). Although there were many error messages but those were not clear.

Then tried I set the trace level to 2 and tried to manually start nodeapps on 1 node as follows:

[oracle@ldsfsxs012q ~]$ export SRVM_TRACE=true
[oracle@ldsfsxs012q ~]$ srvctl start nodeapps -n <hostname>

This gave following output/error:

[main] [10:50:33:911] [OPSCTLDriver.setInternalDebugLevel:173]  tracing is true at level 2 to file null

[main] [10:50:33:911] [OPSCTLDriver.main:116]  SRVCTL arguments : args[0]=start args[1]=nodeapps args[2]=-n args[3]=<nodename>

[main] [10:50:33:918] [OPSCTLDriver.<init>:96]  Security manager is set

[main] [10:50:33:924] [CommandLineParser.parse:193]  parsing cmdline args

[main] [10:50:33:924] [CommandLineParser.parse2WordCommandOptions:981]  parsing 2-word cmdline

[main] [10:50:33:949] [HASContext.getInstance:199]  Module init : 16

[main] [10:50:33:949] [HASContext.getInstance:222]  Local Module init : 19

[main] [10:50:33:949] [HASContext.<init>:92]  moduleInit = 19

[main] [10:50:33:959] [Library.getInstance:106]  Created instance of Library.

[main] [10:50:33:959] [Library.load:206]  Loading libsrvmhas11.so…

[main] [10:50:33:960] [Library.load:212]  oracleHome /ora/app/oracle/product/11.1.0/db_1

[main] [10:50:33:960] [sPlatform.isHybrid:63]  osName=Linux osArch=amd64 JVM=64 rc=false

[main] [10:50:33:960] [Library.load:238]  Loading  library /ora/app/oracle/product/11.1.0/db_1/lib/libsrvmhas11.so

[main] [10:50:33:967] [Library.load:262]  Loaded library /ora/app/oracle/product/11.1.0/db_1/lib/libsrvmhas11.so from path=

/ora/app/oracle/product/11.1.0/db_1/lib

[main] [10:50:33:968] [has.HASContextNative.Native]  prsr_trace: no lsf ctx, line=Native: allocHASContext

[main] [10:50:33:968] [has.HASContextNative.Native]

allocHASContext: Came in

[main] [10:50:33:968] [has.HASContextNative.Native]  allocHASContext: module_init = 19

[main] [10:50:33:968] [has.HASContextNative.Native]

allocHASContext: META context [1]

[main] [10:50:33:969] [has.HASContextNative.Native]

allocHASContext: LSF context [1]

[main] [10:50:33:969] [has.HASContextNative.Native]  prsr_trace: Native: prsr_initCLSS

[main] [10:50:35:617] [has.HASContextNative.Native]  prsr_trace: clsc_connect: (0x2b8e60164920) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_ldsfsxs012q_))

[main] [10:50:35:618] [has.HASContextNative.Native]  prsr_trace: Native: clss error 3

[main] [10:50:35:618] [has.HASContextNative.Native]  prsr_trace: Native: prsr_freeCLSS

[main] [10:50:35:618] [has.HASContextNative.Native]  prsr_trace: prsr_throwException: oracle/ops/mgmt/has/HASContextException[Communications Error–Native: prsr_initCLSS]

oracle.ops.mgmt.cluster.ClusterException: PRKC-1056 : Failed to get the hostname for node <nodename>

PRKH-1010 : Unable to communicate with CRS services.

And on rest 2 nodes also CRS was not coming up at all.

Solution:

Solution was to actually reconfigure voting disk as follows because Voting disk was corrupted:

./crsctl stop crs –f
./crsctl query css votedisk
./crsctl delete css votedisk /dev/raw/raw4 –force
Successful deletion of voting disk /dev/raw/raw4.

./crsctl  add css votedisk /dev/raw/raw4 –force

./crsctl start crs

After this CRS comes up successfully.
Then manually add node for all 3 nodes:

srvctl add nodeapps -n <node1_name>  -A <public address>/<subnet_mask>/<interface_name like bond0 or eth0 etc>

srvctl add nodeapps -n <node2_name>  -A <public address>/<subnet_mask>/<interface_name like bond0 or eth0 etc>

srvctl add nodeapps -n <node3_name>  -A <public address>/<subnet_mask>/<interface_name like bond0 or eth0 etc>

Public IP address, subnet mask and interface name can be seen by “ifconfig -a” command. I am not giving any hostnames or IP addresses in this blog due to security reasons.

So once this is done start nodeapps as root user:

./srvctl start nodeapps -n <node1>

./srvctl start nodeapps -n <node2>

./srvctl start nodeapps -n <node3>

—–>>>> Super !! this comeup without any issues. <<<<——

Then add asm as follows:

./srvctl add asm -n <node1> -i +ASM1 -o /ora/app/oracle/product/11.1.0/asm

./srvctl add asm -n <node2> -i +ASM2 -o /ora/app/oracle/product/11.1.0/asm

./srvctl add asm -n <node3> -i +ASM3 -o /ora/app/oracle/product/11.1.0/asm

Then start asm. Similarly add database, instance, listeners etc.

Problem Solved !!!!!

No comments:

Post a Comment