Some more about RAC in details:
1. CRSCTL: Cluster Control utility performs various administrative operation of oracle clusterware. It is located in $ORA_CRS_HOME/bin and must be executed by the “root” user.
a. To check the current state of all oracle clusterware daemon:
[root@PROD1 bin]# ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
b.You can also check the state of individual oracle clusterware daemon:
[root@PROD1 bin]# ./crsctl check cssd
CSS appears healthy
[root@PROD1 bin]# ./crsctl check crsd
CRS appears healthy
[root@PROD1 bin]# ./crsctl check evmd
EVM appears healthy
c. To start oracle clusterware
[root@PROD1 bin]# ./crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
d. To stop oracle clusterware
[root@PROD1 bin]# ./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
e. To disable oracle clusterware:
[root@PROD1 bin]# ./crsctl disable crs
f. To enable oracle clusterware:
[root@PROD1 bin]# ./crsctl enable crs
g. To get current value of CSS parameter
[root@PROD1 bin]# ./crsctl get css
for example: to get value of misscount parameter
[root@PROD1 bin]# ./crsctl get css misscount
60
h. To set a new value of CSS parameter
[root@PROD1 bin]# ./crsctl set css
for example: to set value of misscount parameter
[root@PROD1 bin]# ./crsctl set css misscount 120
Configuration parameter misscount is now set to 120.
i. To unset CSS parameter value
[root@PROD1 bin]# ./crsctl unset css
for example: to unset value of misscount parameter
[root@PROD1 bin]# ./crsctl unset css misscount
Configuration parameter misscount is now undefined.
j. To list the module for debugging in CSS
[root@PROD2 bin]# ./crsctl lsmodules css
The following are the CSS modules ::
CSSD
COMMCRS
COMMNS
2. CRS_STAT: It reports the current state of resources configured in the OCR.
[oracle@rac1 bin]$ ./crs_stat -t
Name Type Target State Host
———————————————————————————–
ora….C1.inst application ONLINE ONLINE PROD1
ora….C2.inst application ONLINE ONLINE PROD2
ora….AC1.srv application ONLINE ONLINE PROD1
ora.RAC.abc.cs application ONLINE ONLINE PROD1
ora.RAC.db application ONLINE ONLINE PROD2
ora….AC1.srv application ONLINE ONLINE PROD1
ora….ice2.cs application ONLINE ONLINE PROD1
ora….AC1.srv application ONLINE ONLINE PROD1
3. OCRDUMP : It dumps the contents of OCR into a text file.
[root@PROD1 bin]# ./ocrdump /home/oracle/ocr.dmp
4. OCRCHECK : It verifies the integrity of the OCR.
[root@PROD2 bin]# ./ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 5237072
Used space (kbytes) : 9360
Available space (kbytes) : 5227712
ID : 794527192
Device/File Name : /u01/oracle/oradata/ocr
Device/File integrity check succeeded
Cluster registry integrity check succeeded
7. OCRCONFIG: It perform various administrative operation on the OCR.
Interconnect setup in RAC
The interconnect is a very important part of the cluster environment it is on of the aorta’s of a cluster environment. The interconnect is used as physical layer between the cluster nodes to perform heartbeats as well as the cache fusion is using it. The interconnect must be a private connection. Cross over cable is not support.
In a day to day operation it is proven that when the interconnect is configured correctly the interconnect will not be the bottleneck in case of performance issues. In the rest of this article will be focus on the how to validate the interconnect is really used. An DBA must be able to validate the interconnect settings in case of performance problems. Out of scope is the physical attachment of the interconnect.
Although you should thread performance issues in a Cluster environment the way you would normally also do in no-cluster environments here some area’s you can focus on. Normally the average interconnect latency using gigabit must be < 5ms. Latency around 2ms are normal.
10g and 11g
set linesize 120
col name for a22
col ip_address for a15
select inst_id,name,ip_address,is_public from gv$configured_interconnects order by 1,2;
INST_ID NAME IP_ADDRESS IS_PUBLIC
---------- ---------------------- --------------- ---------
1 en11 145.72.220.10 YES
1 en12 145.72.220.83 NO
2 en11 145.72.220.20 YES
2 en12 145.72.220.84 NO
Used interfaces for the interconnect ?
In 1.1.1 the interfaces available are listed, but which interface is used for the cache fusion part ? And make sure the interconnect is using the correct interface ?
set linesize 120
col name for a22
col ip_address for a15
select inst_id,name,ip_address,is_public from gv$cluster_interconnects order by 1,2;
INST_ID NAME IP_ADDRESS IS_PUBLIC
---------- ---------------------- --------------- ---------
1 en12 145.72.220.83 NO
2 en12 145.72.220.84 NO
9i,10g and 11gOradebug
Before the above queries where available (10g and 11g) you needed to use oradebug to validate if the correct interface was used. This was the way to validate the interconnect usage in Oracle 9i, bust still possible in 10g and 11g as well.
SQL> connect / as sysdba
SQL> alter session set tracefile_identifier=oradebug_interc
SQL> oradebug setmypid
SQL> oradebug ipc
SQL> exit
Now if you open the tracefile, in the bdump location, you can find the IP address used for the interconnect. Here is the result of the above oradebug ipc command.
INSTANCE FAILURE AND RECOVERY (RAC)
Oracle uses a simple mechanism of a heartbeat to detect instance failure of a node in a RAC cluster. The cluster manager software detects if it cannot receive the heartbeat of the other instances and if it does not the recovery process begins. From what I can find, this heartbeat is provided by the udlm package on Sun and the MC/ServiceGuard on HP. Fundamentally, as all instances in a cluster can read all threads of redo another surviving member performs recovery on behalf of the failed instance. This comprises two main steps. Firstly, the Global Resource Directory must be remastered so that all resources that were mastered on the failed node are remastered to the surviving nodes. This remastering is performed by lmon. Secondly, the redo thread of the failed node is read so that changes to blocks on the failed node that were not written to disk (dirty blocks) can be applied to the datafiles. The RAC Concepts Guide outlines the following steps that are followed after node failure has occurred:
1. During the first phase of recovery, which is the GES reconfiguration, Oracle first reconfigures the GES enqueues. Then Oracle reconfigures the GCS resources. During this time, all GCS resource requests and write requests are temporarily suspended. However, processes and transactions can continue to modify data blocks as long as these processes and transactions have already acquired the necessary enqueues.
2. After the reconfiguration of enqueues that the GES controlled, a log read and the remastering of GCS resources occur in parallel. At the end of this step the block resources that need to be recovered have been identified. Only the redo of the dead instances must be read.
3. Buffer space for recovery is allocated and the resources that were identified in the previous reading of the log are claimed as recovery resources. Then, assuming that there are PIs of blocks to be recovered in other caches in the cluster database, resource buffers are requested from other instances. The resource buffers are the starting point of recovery for a particular block.
4. All resources and enqueues required for subsequent processing have been acquired and the Global Resource Directory is now unfrozen. Any data blocks that are not in recovery can now be accessed. Note that the system is already partially available.
5. The cache layer recovers and writes each block identified in step 2, releasing the recovery resources immediately after block recovery so that more blocks become available as cache recovery proceeds.
6. After all blocks have been recovered and the recovery resources have been released, the system is again fully available. Recovered blocks are available after recovery completes.
If all instances in a cluster fail then crash recovery is performed by the first instance in the cluster to start, just as in single instance.
CONCEPTS AND FUNCTIONAL OVERVIEW
RAC is the ability to have two or more instances connect to the same database. These instances reside on separate servers, thereby allowing each instance to make full use of the processing ability of each server. As these instances are accessing the same database they need to be able to communicate with each other, this is done through physical interconnects that join the servers together. All the servers in the configuration make up the cluster.
All instances in a cluster share access to common database resources, this access needs to be coordinated between the instances in order to maintain the overall integrity of the database. In order to coordinate this access RAC databases have a Global Resource Directory. This Global Resource Directory is RAC specific and is not required on single instance systems.
Global Resource Directory(GRD)
Global Resource Directory provides an extra layer of control in order to allow all instances in a cluster to share access to shared database resources. The Global Resource Directory resides in the SGA of each instance in the cluster, with each instance maintaining its own portion. The main role of the Global Resource Directory is to ensure that access and changes to common resources is controlled between the instances in order to maintain the integrity of the database.
Global Enqueue Service(GES) and the Global Cache Service(GCS) maintain the information in the Global Resource Directory. Although the Global Resource Directory is split amongst all instances in the cluster, the Global Cache Service(GCS) and Global Enqueue Service(GES) nominate one instance to manage all information about a particular database resource. This instance is called the resource master. This ownership is periodically checked and changed accordingly between instances, this is done to reduce interconnect traffic and resource acquisition time.
Global Cache Service(GCS)
Global Cache Service is responsible for cache fusion i.e. transmitting data blocks between the instances. The main features of this processing are:
-The lms processes are the Global Cache Service(GCS) background processes in the instance
-Blocks can exist in more than one instance at a time.
-If an instance is requested to transfer a dirty block (a dirty block is a block that has been modified but not yet written to disk) to another instance in exclusive mode it keeps a past image of the block. This past image is basically a copy of the block as the holding instance last knew it before it transferred it to the requesting instance. This past image is used for recovery. Once the most recent copy (the master copy) of the dirty block is written to disk by dbwr the past images can and will be discarded. Note that PI’s can be used for a consistent read of a block as this saves having to build a copy from the rollback segment. The important thing to note is that an instance will always create a PI version of a dirty block before sending it to another instance if the reqeusting instance wants it in exclusive mode. If an instance requests the master block for read (consistent or current) there is no need for the holding instance to keep a PI as the requesting instance is not going to change the block
-The most recent copy (the master copy or current block) of a block contains all changes made to it by transactions, regardless of which instance the change occurred on and whether the transaction(s) has committed or not
-A block is assigned a role and mode and a status (clean or dirty)
-The block is held in a local role if it is only held in one SGA, it is held in a global role if it is held in more than one SGA
-The block can be held in null, shared or exclusive mode. Null mode means that the instance has no access rights on the blocks, shared mode means that the instance can read the block and exclusive mode means that the instance can write to the block. Many instances can have the same block in null or shared mode, but only one can have it in exclusive mode (as exclusive mode implies the instance wants to modify the block). To view the current mode of the block in an instance view v$bh.status. The following applies to the state of the blocks:
Global Enqueue Service(GES)
Global Enqueue Service(GES) manages all non-cache fusion resource requests and tracks the status of all enqueuing mechanisms. The GES only does this for resources that are accessed by more than one instance. The primary resources that the GES controls are dictionary cache locks and library cache locks. The GES manages the interinstance communication that occurs between the instances for these resources. These resources exist in single instance, the difference being that in RAC these must be coordinated between all instances in the cluster.
Dictionary cache locks – The data dictionary must be consistent across all nodes, if a table definition is changed on one instance the Global Enqueue Service ensures that the definition is propagated to the dictionary cache on all the other instances.
Library cache locks – These locks are held by sessions whilst they parse or execute SQL or PLSQL statements. These locks prevent other sessions from modifying the definitions of objects in the data dictionary that are being referenced by the statement that is currently parsing or executing.
BACKGROUND PROCESSES IN RAC
Each instance in a RAC configuration has the same background processes as a single instance database, however there are extra background processes evident in a RAC enabled instance, these are mainly related to the Global Cache Service(GCS) and the Global Enqueue Service(GES):
LMS processes – These are the processes for the Global Cache Service. These processes are responsible for transferring blocks between instances and maintaining the Global Resource Directory to reflect the modes and roles the blocks are held in on each instance. A block can exist in more than one instance at a time, but the Global Cache Service controls who has what version of the block thereby ensuring that the most up to date block (the master copy) is always the one that is updated. All other versions of the block will be past images or read consistent versions of the block.
LMON – Global Enqueue Service Monitor
This process monitors global enqueues and resources across the cluster and performs global enqueue recovery operations. LMON also handles recovery associated with global resources and can detect instance failure of other nodes
LMD – Global Enqueue Service Daemon
This process manages global enqueue and global resource access. Within each instance, the LMD process manages incoming remote resource requests
LCK – This is the lock process and makes up part of the Global Enqueue Service. It manages non-Cache Fusion resource requests such as library and dictionary cache requests
DIAG – This is diagnostic daemon.Captures diagnostic information related to instance process failures.
To view all the background processes (RAC and non-RAC) evident in an instance:
SQL> select * from v$bgprocess where paddr <> ‘00’
STORAGE SYSTEM FOR RAC
In a RAC environment all instances need to be able to write to the same datafiles simultaneously. There are 2 ways to do this, use RAW devices or use a Cluster Filesystem.
Datafiles and tempfiles
All datafiles and tempfiles must reside on shared disks. The first instance to start will verify that it can read all datafiles identified in the controlfile. This must be done so that the first instance to start can determine whether instance or media recovery is required or not, this behaviour is no different to single instance. However, instances that join the cluster at a later date can operate even if they cannot access all the files, they will simply raise an error when an attempt is made to access the file.
Control files
The control files must be on shared disks and must be accessible by all instances at startup time as determined in the parameter file.
Redo log files and archived logs
On RAC each instance must write to its own set of redo logs. This set is called a thread of redo. All threads of redo must reside on shared disks. The instance gets its thread of redo at startup time as determined by the thread parameter. If an instance cannot get its thread of redo it will fail to open. Each redo group will still be uniquely numbered at the database level and will be multiplexed or mirrored, just as in single instance. The only difference is that in RAC each redo group belongs to a thread, and only the instance specifying that thread number at startup time will write to the redo groups in that thread.
Each instance can, however, read all threads of redo. This is to facilitate instance recovery i.e if instance a fails then instance b will read instance a’s thread of redo to recover the failure. This must happen so that the consistency and integrity of the database is maintained if one instance fails. In order to facilitate instance recovery all redo files must reside on shared disks.
Archived log files are generated by each thread of redo and are uniquely identified by the thread number that we include in the log_archive_format and the sequence number which is unique for each instance. Archived_log can be on the local filesystem or shared filesystem.
Sequences
Sequences are held on disk. Even in single instance many DBA’s cache sequence numbers to avoid contention for the sequence. We cache most sequences in RAC to avoid contention on the sequence. If you have a high cumulative wait time in v$enqueue_stat on the SQ enqueue (the sequence number enqueue) then you should consider caching enqueues. RAC does support CACHING and ORDERING of sequence numbers.
Undo management
Undo/rollback datafiles must reside on RAW devices. If you use MANUAL undo then each instance must specify unique rollback segments in the instance specific parameter file. If you use AUTOMATIC undo then each instance must specify a separate tablespace, this tablespace must be available and of type UNDO. All instances in a RAC cluster must run in the same UNDO mode i.e you can’t have one running AUTOMATIC undo and another running MANUAL undo. If you are using AUTOMATIC undo monitor v$undostat for statistics.
To see if the oracle home is RAC enabled issue the following SQL:
select * from dba_registry where comp_id = 'RAC';
To relink an oracle home with RAC disabled or enabled:
cd $ORACLE_HOME/rdbms/lib
make –f ins_rdbms.mk rac_off install OR
make –f ins_rdbms.mk rac_on install
Important RAC wait events
SQL> select event from v$system_event where event like '%global%' order by event;
EVENT
----------------------------------------------------------------
buffer busy global CR
buffer busy global cache
ges global resource directory to be frozen - no
ges global resource directory to be unfrozen - no
global cache busy
global cache cr request
global cache domain validation - no
global cache null to s
global cache null to x
global cache open s
global cache open x
global cache s to x
buffer busy global cache
This wait event falls under the umbrella of ‘global buffer busy events’. This wait event occurs when a user is waiting for a block that is currently held by another session on the same instance and the blocking session is itself waiting on a global cache transfer.
buffer busy global CR
This wait event falls under the umbrella of ‘global buffer busy events’. This wait event occurs when multiple CR requests for the same block are submitted from the same instance before the first request completes, users may queue up behind it
global cache busy
This wait event falls under the umbrella of ‘global buffer busy events’. This wait event means that a user on the local instance attempts to acquire a block globally and a pending acquisition or release is already in progress.
global cache cr request
This wait event falls under the umbrella of ‘global cache events’. This wait event determines that an instance has requested a consistent read version of a block from another instance and is waiting for the block to arrive
global cache null to s and global cache null to x
This wait event falls under the umbrella of ‘global cache events’. These events are waited for when a block was used by an instance, transferred to another instance, and then requested back again.
global cache open s and global cache open x
This wait event falls under the umbrella of ‘global cache events’. These events are used when an instance has to read a block from disk into cache as the block does not exist in any instances cache. High values on these waits may be indicative of a small buffer cache, therefore you may see a low cache hit ratio for your buffer cache at the same time as seeing these wait events.
global cache s to x
This wait event falls under the umbrella of ‘global cache events’. This event occurs when a session converts a block from shared to exclusive mode.
To find locks in RAC
SELECT inst_id,DECODE(request,0,'Holder: ','Waiter: ')||sid sess,
id1, id2, lmode, request, type
FROM GV$LOCK
WHERE (id1, id2, type) IN
(SELECT id1, id2, type FROM gV$LOCK WHERE request>0)
ORDER BY id1, request;
Some Imp points
1. Cache Fusion:Cache Fusion is a new parallel database architecture for exploiting clustered computers to achieve scalability of all types of applications. Cache Fusion is a shared cache architecture that uses high speed low latency interconnects available today on clustered systems to maintain database cache coherency. Database blocks are shipped across the interconnect to the node where access to the data is needed. This is accomplished transparently to the application and users of the system. As Cache Fusion uses at most a 3 point protocol, this means that it easily scales to clusters with a large numbers of nodes
2.The LMD and LMS processes are critical RAC processes that should not be blocked on CPU by queuing up behind other scheduled CPU events
3 v$ges_statistics view returns various statistics on the Global Enqueue Service.
4.gv$lock view will show all the locks held by all the instances
No comments:
Post a Comment