Surviving A Month of Dead Connection Detection Zombie Land!

Timeline

August 11th – Switched Over using a Physical Standby on new hardware for our ERP software in about 15 minutes. This was a manual SQLPLUS command-line method because of issues with using DataGuard to complete this task without hanging or connection-related issues (that never show up until switchover time). We are ecstatic at this point because we also implemented huge pages on Linux RH 64bit , resized the SGA, encrypted filesystems , DB_ULTRA_SAFE parameter, 10GB NIC and implemented ASMM all at the same time.  Life couldn’t be better. This is Oracle 11.2.0.3 …and all SQLNET traffic is encrypted using Native Network Encryption….keep all of this in mind as you follow along on this horror story!

August 15th – Network guys updated Cisco switches and routers with software/firmware upgrades with planned downtime. Life seems normal up to this point.

August 16 th – As DBA I scan the alertlogs every 15 minutes for any of the following keywords: ORA and/or WARNING

I start getting complaints from endusers that our ERP is slow, freezing, etc. At this point we suspect the switchover to new hardware.  But we had made so many changes which one could it be?  I see a 10% increase in waits related to the redo logs (log file parallel write). So I moved the redo logs to unencrypted filesystem area….made no difference, moved them back.   I could reset the DB_ULTRA_SAFE to improve performance but that requires a database cycle.

By the way when mentioning encrypted native filesystems…this is Oracle’s answer to a Support ticket asking about compatibility.

“This is a 3th party issue, we have our own solution which would be TDE tablespace encryption,
for any 3th party solution to properly work, it must be completely transparent to oracle,
the normal read / write OS calls oracle does must be redirected to the decrypt / encrypt code, it
is possible asynch_io can no longer work and you also may need to set parameter disk_asynch_io = false,
otherwise it is entirely up to the 3th party product being tested and certified to run with oracle
by the 3th party vendor.”

As the university started in full work mode for our FALL semester…performance issues worsened.  So I started looking in the alert logs for other errors I wasn’t capturing….now I start seeing a LOT of the 12170 ….so I modify my script to send those by email when they occur by adding the keyword FATAL .

Errors started to come in groups  I had seen the occasional ORA-3136 which I knew to ignore as an error related to logging in.
WARNING: inbound connection timed out (ORA-3136)
Fatal NI connect error 12170.

August 21 – Sept 5 – This is where I dread coming to work every day. I am spending 10 hours or so trying to figure out what is wrong, my ulcer also kicks into high gear.

Oracle Enterprise Graphs are looking unusual…..huge NETWORK waits especially for jobs that connect to other databases using database links.  See the following for how it appeared to me. BAD, BAD, BAD…..for everyone. So now the guessing goes into full swing…and I mean GUESSING because we don’t have much to go except lots of spurious ORA errors.  By the way this is an Oracle Forms App….so we don’t have expensive tools to trace sessions all the way back to the database. There is some functionality for tracing but they only indicated the hanging/freezing sessions were gone from the database after a point in time, but what disconnected them?

NetworkZombie

I surmised the freezing that people experienced with this FORMS app was due to the 30 network retries we have set on the FORMS configuration. We have been using this parameter for many years which attempts a reconnection for at least 30 times before giving up. So…this is the freezing as it takes time to do this 30 times. People end up closing their browser and that generates the 12170 errors.

I notice some patterns – these 12170 errors come all at once in batches through the alertlogs. I finally figured out to find the smallest error number is the important one – 110 .

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 – Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 – Production
Time: 29-AUG-2014 18:14:10
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=

See MOS Note

Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)

We are pretty sure this is network-related/OS-related but how do we convince the Network guys something is wrong.  I am digging through MOS finding everything I can find on tracing and the like.  I start tracing on the database server generating huge amounts of logs (50 GB at a time). Trying to find error numbers that I haven’t seen.  I see a lot of the following and if you search through MOS you can find information on those errors.

[16-SEP-2014 18:11:01:606] nsfull_pkt_rcv: error exit
[16-SEP-2014 18:11:01:606] nioqer: entry
[16-SEP-2014 18:11:01:606] nioqer:  incoming err = 12151
[16-SEP-2014 18:11:01:606] nioqce: entry
[16-SEP-2014 18:11:01:606] nioqce: exit
[16-SEP-2014 18:11:01:606] nioqer:  returning err = 3113

Finally I find a document that mentions the 12151 and 3113 are really just spurious and not the real cause of our problems. Tracing on the database side didn’t really help….basically we determined the sessions were gone at that point and the traces just verified this…Oracle didn’t know what happened to them so the lack of information should be speaking volumes at this point.

Talking with network guys throughout all of this ….they can not find anything wrong.

We request them to turn off sqlnet packet inspection as per this MOS Doc:

Troubleshooting guide for ORA-12592 / TNS-12592: TNS: bad packet (Doc ID 373431.1)

Still the problems persist….what is wrong?  We are so desperate at this point we start second-guessing everything that was done during the switchover. And this is where we start heading down the road of too many mods! Guess what….the following list shows what DIDN’T HELP.

1. Switched back to the 1GB NIC

2. Switched hardware load balancers

3. Upgraded the database to 11.2.0.4 because of an Oracle bug

10096945 Waits using DBLINKS & nested loops

We had to install two more patches on top of 11.2.0.4 (plus CPU) to get rid of some issues associated with that version which produced ORA-904 errors…turning off query rewrite fixed one of the problems, flushed the shared pool.

17956707 ORA-904 executing SQL over a database link 19/May/2014
17551261 ORA-904 “from$_subquery$_003”.<column_name> with query rewrite 21/Feb/2014

4. Moved to 12.1.0.2 listener

5. Installed p17956707_112042_Linux-x86-64.zip patch to fix ORA-904 as a result of 11.2.0.4
Reinstalled recreatectxsyssyncrn.sql – bug in 11.2.0.4
6. Installed/ran OSWatcher on the database server. How does one even start to understand what is produced especially for the network stats.

7.  Added USE_NS_PROBES_FOR_DCD=TRUE in sqlnet.ora (this reverts to a 11g type of Dead Connection Detection)
8. Enabled a job to restart services on a different application that was losing its connections as well.
9. Removed one of the three Oracle Forms/Reports Servers from the load balancer….plan was to wipe it, reinstall to a fully patched version for redeployment.

10. Ran RDA against the forms server, using it for a contact on this issue. Lots of disconnects showing up in the logs.

11. Was it a recent JAVA desktop update…., was it the Java version on Weblogic?

12. Results of traces on the database
Lots of 12151 and 3113 errors which are spurious in nature
ora-12547 in  server_45148.trc
MOS Notes 1104673.1 , 1591874.1 & 1300824.1, 1531223.1,  461053.1

13. Recompiled all of the forms, rewrote bad application code

14. Reconfigured/rebooted tweaked all settings on the LOAD BALANCER

15.  OK this was the BAD thing we did….modified the Linux Operating System TCP keep alive parameters on the database host.  Haven’t ever done this before….didn’t know what we were doing. Upped it to two hours …and then upped it again to over 8 hours assuming that was giving a session exclusive access for that time period.

About this time….the network guys did realize the Network Intrusion System was listening on the connections between the servers….not the outside traffic, protected/firewalled interior traffic.  Oops….huge bottleneck because it was inadequately sized.  So they reconfigured it.

Life seemed OK….it was better in some respects. Some of our other applications that connected to the same database turned back to normal activity but the FORMS app was still choking.

Well my wonderful boss finally decided to start a SQLPLUS trace session from different vantage points to set what happened to try and determine if any kind of session was affected.

1. SQLPLUS from our desktops

2. SQLDEVELOPER from our desktops

3. SQPLUS from a server in the same subnet

4. SQLPLUS from other servers in different subnet

5. SQLPLUS from the app server different subnet not on the load balancer

6. SQLPLUS from the app servers still on the load balancer

Connected to the database in question and started the wait…..we realized the disconnects seemed to happen somewhere between one hour and two hours.  Isn’t that Network? Not necessarily…..we were seeings a ORA-3135 which is a completely different error number than anything I had seen.

Finally I get an error number that helps with searching on MOS, I start finding a lot of better information as it applies to Dead Connection Detection.  Due to the tracing I knew our DCD was in place and working!

[16-SEP-2014 15:24:00:384] niotns: Enabling CTO, value=180000 (milliseconds)
[16-SEP-2014 15:24:00:384] niotns: Enabling dead connection detection (10 min)
[16-SEP-2014 15:24:00:384] niotns:  listener bequeathed shadow coming to life…
[16-SEP-2014 15:24:00:384] nsinherit: entry

Ora-03135: Connection Lost Contact After Idle Time (Doc ID 1380261.1)
Troubleshooting ORA-3135/ORA-3136 Connection Timeouts Errors – Database Diagnostics (Doc ID 730066.1)
Troubleshooting ORA-3135 Connection Lost Contact (Doc ID 787354.1)

” Idle Connection Timeout

The most frequently occurring reason for this error is due to a Max Idle Time setting at the firewall.
If the client traverses the firewall to get to the server and is being terminated abruptly, this is likely the cause.

A relatively simple test to determine if this is a Firewall maximum idle time issue:

At the offending client, establish a SQL*Plus or OCI client connection to the server.

SQL*Plus username@TNS connect string
password:

SQL>  <===Allow this client connection to sit idle for an hour.

Return after the hour is up and issue a simple query:

SQL>select * from dual;

If this connection is terminated with ANY error it is likely that your firewall will not allow a connection to remain idle for a lengthy period of time.

It is possible to trick the firewall with Dead Connection Detection packets in order to keep the connection alive. 

WHAT!…..I HAD DCD in place all along….it is still saying firewall but read on as there is a gotcha or caveat to DCD.

“PLEASE NOTE:
 DCD was never designed to be used as a “virtual traffic generator” as we are wanting to use it for here. This is merely a useful side-effect of the feature.
In fact, some later firewalls and updated firewall firmware may not see DCD packets as a valid traffic possibly because the packets that DCD sends are actually empty packets. Therefore, DCD may not work as expected and the firewall / switch may still terminate TCP sockets that are idle for the period monitored, even when DCD is enabled and working.
In such cases, the firewall timeout should be increased or users should not leave the application idle for longer than the idle time out configured on the firewall.”

At this point we are completely discouraged as we believe that there is no fix….we could implement PROFILES setting idle timeout on the database so people would have to relogin every hour..at least that would stop the freezing due to 30 network retries.

Finally the last clue we needed to figure out how to get rid of the ZOMBIES…….as part of a MOS document.

“The firewalls inactivity timer can be disabled for the affected host.

The host OS keep alive setting (tcp_keep_alive) can be modified to be less than the firewall inactivity timeout. This will cause the OS to send a test packet to the client when the timeout is reached and the client will respond with an ACK. To all intents and purposes this is the same as turning off the firewall inactivity timer for this host.”

We had reset the tcp_keep_alive to higher than the default setting of 1 hour for the firewall during all of the mods we tried…so we broke it while trying to fix the original connection issue that the IPS caused. Applications that only connected for a few seconds weren’t affected but all apps that required a sticky session for more than one hour were…

The horror story is now over…but I still have an ulcer, hopefully that will heal in time.  Now on to the database upgrade to 12.1.0.2 ….as I see that DCD is handled completely different in that one…so I expect to experience this ZOMBIE stuff again. But at least I have more knowledge that what I started with.

So the final conclusion is that the network/switches/router upgrades that happened way back when this started… no longer recognizes Oracle’s DCD packets (they are zero length) but it does recognize the OS packets for keep alive (non zero length).

Errors seen when firewall is blocking connections – rule-based
Fatal NI connect error 12514, connecting to:
Fatal NI connect error 12514, connecting to:
Fatal NI connect error 12514, connecting to:

Advertisements

About April C Sims

Oracle DBA for over a decade...enough said.
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Surviving A Month of Dead Connection Detection Zombie Land!

  1. anonymous says:

    hi !
    did you also try to set ENABLE=BROKEN in your connect-string in the DESCRIPTION Part in your Forms-App ?
    http://docs.oracle.com/cd/E11882_01/java.112/e16548/apxtblsh.htm#CHDBBDDA
    I thought that DCD does not work with JDBC Connections.
    regards anonymous

    • April C Sims says:

      Thanks for the additional information….we have been using DCD with Forms for over 10 years with no problems. I know that other types of applications we have which are deployed war files would fall under that category. Not a JDBC expert though!

  2. That was a LOT OF WORK ! I bet you’re not the only one with the ulcer. Yes, DCD wasn’t designed to handle firewall timeouts so it isn’t a “product bug” !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s