Failover and Load balancing in Oracle

saibal's picture
articles: 

Advanced features of Oracle Net include failover and load balancing. These features are actually interrelated in as much as you usually don’t have one without the other. While they are mostly used in a RAC environment, they can be set up in a single instance environment as well.

FAILOVER:
In the context of Oracle Net, failover refers to the mechanism of switching over to an alternate resource when connection to the primary resource gets terminated due to any reason. Connection failure may be broadly categorized as:

  • Those that occur while making the initial connection.
  • Those that occur after a connection has been successfully established.
The first category of connection failure can be dealt with comparatively easily. If your attempt to connect to an instance fails, you can make the connection attempt again, but this time to a backup instance. As long as you have backup instances configured, you can continue trying to connect, until a successful connection is established. This process is technically termed as Connect Time Failover.
The second category of connection failure is of the type that happens after a successful connection has already been established, and subsequently the connection gets terminated.
In such cases the application normally has to handle all the nitty gritty of reconnecting to a backup instance, reestablishing the session environment, and resubmitting any work lost, because of the break in the connection. The technical name for this type of failover mechanism is transparent application failover or TAF for short.

Ok, now let’s have a look at the mechanism of setting up these features. Let’s start with Connect Time Failover. Although Connect Time Failover is a RAC environment mechanism, you can still use it with non-RAC environments, where you have a mechanism like standby database in place. Connect Time Failover can be achieved by the simple expedient of configuring a Net Service Name through which clients may connect to the standby database, whenever they cannot access the primary database.
This can even be achieved with a single net service name provided you configure
a) Multiple listener addresses within a description
b) Multiple descriptions within a description list
The difference between the two is that while the former will use the same connect data for all the listener addresses, that latter may have separate connect data for each configured listener address. Below, I provide an example of both types of configuration:
Multiple Listener addresses within a description:

SAIBAL.GHOSH=
			(DESCRIPTION=
			(ADDRESS_LIST=
			(ADDRESS= (PROTOCOL=TCP) (HOST=TEST_DB) (PORT=1521))
			(ADDRESS= (PROTOCOL=TCP) (HOST=MY_DB) (PORT=1521))
			(FAILOVER= TRUE)
					)
			(CONNECT_DATA=
			(SERVICE_NAME= saibal)
					)
				)

An important issue to be aware of at this point is that Connect Time Failover only works for you, if you are using dynamic registration. This means that this feature won’t work if you have something like this configured in your listener.ora:
SID_LIST_MYLISTENER=
				(SID_LIST=
				(SID_DESC=
				(GLOBAL_DBNAME=sales.us.acme.com)
                                                (ORACLE_HOME=/s01/app/oracle/ora/prod)
				(SID_NAME=sghosh)
					)
				     )

Any reference to Global_dbname and you can forget about Connect Time Failover. By default, failover is enabled when you specify multiple addresses. However, you can disable failover by specifying (FAILOVER=false). When failover is disabled, Oracle Net will attempt to connect using the first address, and if that attempt fails, no further attempts will be made and Oracle Net will generate an error.

Multiple descriptions within a description list:

SAIBAL.GHOSH=
			(DESCRIPTION_LIST=
			(FAILOVER=true)
			(LOAD_BALANCE=false)
			(DESCRIPTION=
			(ADDRESS= (PROTOCOL=TCP) (HOST=TEST_DB) (PORT=1521))
			(CONNECT_DATA=
			(SERVICE_NAME=saibal))
) 
			
			(DESCRIPTION=
			(ADDRESS= (PROTOCOL=TCP) (HOST=MY_DB) (PORT=1521))
			(CONNECT_DATA=
			(SERVICE_NAME= test_saibal))
)
			      )

Notice that in the description list above, I have put (FAILOVER=true) and (LOAD_BALANCE=false). There is no real need to put (FAILOVER=true), as it is the default behavior, however, (LOAD_BALANCE=false) does not represent default behavior, and I have to set it to false to stop client load balancing which is enabled by default, if I am using multiple description lists. When client load balancing is enabled, Oracle Net will randomly choose descriptions from the description list to make a client connection.

Now, let’s look at how TAF works. Unlike connect time failover, which is invoked before the connection is made, TAF comes into play after the connection is made (and then, broken). If the connection is lost while the application is running, Oracle Net will transparently reconnect the application to another instance accessing the same database. The failover is done gracefully and uses failover aware APIs built into OCI.
TAF supports two types of failover: SESSION and SELECT. A SESSION failover connection is not over ambitious. It just fails over to a backup instance. All work in progress at that point are irrevocably lost. SELECT failover is more intricate in as much as it enables some type of read only application to failover without losing the work in progress. If a SELECT statement was in progress at the time of the termination of the connection, then as the connection is reestablished to a backup instance, Oracle Net re-executes the SELECT statement and positions the cursor in a manner that the client can seamlessly continue fetching the rows. But that’s about all that TAF has to offer. It doesn’t have any mechanism to recover DML statements that were in progress, when the failure occurred, or even for SELECT statements, you lose global temporary tables, package states and session settings.
TAF supports two failover methods: BASIC and PRECONNECT. In BASIC method, you connect to the backup instance when the primary connection fails. In the PRECONNECT method, you connect to the backup instance at the same time you connect to the primary instance. This has the obvious benefit of having a backup connection available all of the time, thus reducing the time of ‘failover’. But the downside is that you have to pay the extra ‘cost’ in terms of resources spent, of having a backup connection open all the while.
TAF is configured by adding a FAILOVER_MODE parameter to the CONNECT_DATA parameter for a net service name. Since you cannot configure TAF using the Net Manager, you have to use either OEM or manually edit tnsnames.ora. If you are going to use TAF, then you have to have a backup instance in place. That means you have to configure two net service names—one to connect to the primary instance and the other to connect to the backup instance. The example below shows a TAF configuration where connections to TEST will failover to TEST_BKUP.

TEST=
(DESCRIPTION=
            (ADDRESS= (PROTOCOL=TCP) (HOST=NEW_HOST) (PORT=1521))
				(CONNECT_DATA=
				(SERVICE_NAME=saibal.ghosh)
				(FAILOVER_MODE= (TYPE=SELECT) (METHOD=BASIC) (BACKUP=TEST_BKUP))
                                                      )
                                                  )

		TEST_BKUP =
(DESCRIPTION=
         (ADDRESS= (PROTOCOL=TCP) (HOST=BKUP_HOST) (PORT=1526))
				(CONNECT_DATA=
                                                     (SERVICE_NAME= sbkup.ghosh)
                                                                          )
                                                                 
)

The definition of TEST contains a FAILOVER_MODE entry that specifies the name of the net service name to which a connection is to be made, should the TEST connection happen to fail. In this particular example, I have ‘failed over’ to an entirely different database, not to another instance accessing the same database. I did this to show that TAF can work either way, but if you are failing over to a different database, then you should keep the two databases in sync by using a mechanism like standby database.
Normally, TAF makes only a single attempt to connect to the backup instance. However, if you specify the RETRIES and DELAY parameters, you can force TAF to make multiple connection attempts to the backup instance. The following example shows TAF configured for 10 retries, each at an interval of fifteen seconds.

PROD_BKUP=
(DESCRIPTION=
      (ADDRESS= (PROTOCOL=TCP) (HOST=ss-bkup) (PORT=1521))
			(CONNECT_DATA=
				(SERVICE_NAME=saibal6)
                                                   (FAILOVER_MODE=
 (TYPE=SELECT) (METHOD= BASIC)(BACKUP=saibal6_bkup) (RETRIES= 10)
                                                                                                                                              (DELAY= 15))
                                                                             )
                                                                  
)

RETRIES and DELAY parameters may be gainfully employed where you are using a standby database to failover to and such a database may take a few moments to be brought up.
If we want to have the entire failover mechanism configured, so that we can take advantage of both connect time failover, as well as transparent application failover, we can put something like the following in place:
PROD=
          (DESCRIPTION=
           (ADDRESS=(PROTOCOL=TCP) (HOST=ss01-main) (PORT=1521))
	   (ADDRESS=(PROTOCOL=TCP) (HOST=ss02-main) (PORT=1521))
	    (CONNECT_DATA=
              (SERVICE_NAME=saibal6)
(FAILOVER_MODE= 
(TYPE=SELECT)(METHOD=BASIC)(BACKUP=PROD_BKUP)                                                                                                                                           ) 
)
PROD_BKUP=
    (DESCRIPTION=
     (ADDRESS= (PROTOCOL=TCP) (HOST=ss02-main) (PORT=1521))
        (CONNECT_DATA=                                                
            (SERVICE_NAME=saibal6)
                                                                  )

)
Now, let’s analyze the scenario above. The failover set up is from ss01-main to ss02-main. But a problem will occur where the initial connection is made to ss-02-main because of connect time failover; then we are already connected to the backup mode. If that fails, we have had it, there’s nowhere else to go. And what’s more, if you are thinking of using a description list, with several descriptions, that won’t work either. TAF settings are picked up from the first connect_data entry encountered, so other descriptions in the description list become useless.

LOAD BALANCING
Load balancing may be defined as distributing a job or piece of work over multiple resources. RAC is an ideal environment for distributing load amongst multiple instances accessing the same physical database. Other environments may also suitably configured to invoke load balancing, and in the following few paragraphs I show how load balancing can be set up.
CLIENT LOAD BALANCING: You can configure load balancing either at the client end or at the server end. Client load balancing is configured at the net service name level, and it is done simply by providing multiple addresses in an address list, or multiple descriptions in a description list. When load balancing is enabled, Oracle Net chooses an address to connect to in a random order rather than sequential order. This has the effect of clients connecting through
addresses which are picked up at random and no one address is overloaded. But significantly, there is no guarantee that just because clients are being connected through different addresses picked at random, there is an even distribution of workload at the server end. To do that, you will need to configure load balancing at the server end which is discussed below.
CONNECTION LOAD BALANCING: This feature improves connection performance by allowing the listener distribute new connections to different dispatchers and instances. The listener is in a position to do so because due to dynamic registration, the current load of instances and dispatchers are available with the listener. This allows the listener to balance the load across dispatchers and instances while connecting client connection requests.
Connection load balancing may be done in:

  • Single instance shared server configuration
  • Multiple instance shared server configuration
  • Multiple instance dedicated server connection
Load balancing is done in the following order: In case of dedicated server configuration it is:
  • Least loaded node
  • Least loaded instance
In case of shared server, a listener selects a dispatcher in the following order:
  • Least loaded node
  • Least loaded instance
  • Least loaded dispatcher for that instance
Connection load balancing may also be combined with client load balancing to leverage load balancing activity. While the listener will load-balance connections to dispatchers/instances, client load balancing will distribute load of handling new connections over more than one listener. Add failover at the client end and you will be setting up a robust system, which will insulate the clients from potential connection failures, while at the same time keeping an eye on performance.


Comments

This is an Excellent Technical Write Up from Saibal Ghosh. I was not having any knowledge on this arena. But after going through Saibal's article, I really feel better equipped and confident.

Thanks Saibal. I will look for more such articles from you.

Regards,
Bishwajit

Thanks for the information on Fail over Load Balancing on RAC.when we talk about connection load balancing there are two main thing considered Least loaded node & Least loaded instance.So what factor need to be considered for least loaded node mean CUP,Memory ,IO ,network or else the number of process running(running,sleep,...) does oracle gather info from top,glance utility to get statistician.And when we say Least loaded node what factor are considered by oracle is that like number of sessions(active,inactive..or killed ),SGA component utilization.Can we say load on node is proportional to load on instance ..how does oracle identifies and what process does it consider for balancing.

Thanks
Arun

Is there a way with Oracle to be able to identify which connection is serviced by what listener in a single instance scenario?
For example in the following 3 listener config::

PROD1.world =
(DESCRIPTION = (FAILOVER=on) (LOAD_BALANCE=on)
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.10.10.10)(PORT =1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.10.10.10)(PORT =1522))
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.10.10.10)(PORT =1523))
(CONNECT_DATA = (SID = PROD1)(GLOBAL_NAME = PROD1.world))
)

And also, I have seen it where FAILOVER and LOAD_BALANCE have taken no vs. false as well as on vs. true. Is this the case?
Thanks in advance.

P.S. the article is awesome.
-victor

Hello,

As you see in the following MY_DB_URL there are two hosts for one SID. This is works perfectly for oracle driver. When one of them is gone down, connection establishes with the other.

MY_DB_URL = jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=xx.xx.104.18)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=xx.xx.104.19)(PORT=1521))(FAILOVER=true)(LOAD_BALANCE=false)(CONNECT_DATA=(SERVER=DEDICATED)(SID=SIDNAME_1)))

But i have a problem. Our partner uses Golden Gate product, and they said that they must install two different oracle with two different SID names. (For Example one of them is SIDNAME_1 the other is SIDNAME_2) So i can't use this URL in my application.

Is there a way to define a DB connection URL for two different SID names in a single URL???

Best Regards,

Alper Ozdamar

Hi,

If you have a 3-node cluster and the service is only running on one node, does that mean load balancing is not use even if LOAD_BALANCE=ON is set on the client side?

How do we monitor/prove whether load balance is really working or not? Is there any log that we can check to confirm this?