Extract abends, reporting “Unable to lock file” or “response from Server/Collector.
Unable to lock file "/h/remis/ggs/dirdat/em000002" (error 13, Permission denied). Lock currently held by process id (PID) nnnnn). Unable to lock file "/opt/oracle/ggs/dirdat/r2000008" (error 11, Resource temporarily unavailable).
The trails cannot be exclusively lock for writes by the server/collector process running on the target. As of v10.4, Server/Collector locks the trail file to prevent multiple processes from writing to the same trail file, so new Server/Collector processes are unable to lock the trail files.
Network outages that last longer than the time the TCP/IP stack is configured to retransmit unacknowledged packets may result in “orphan” TCP/IP connections on the RMTHOST system. Since the local system has closed the connections and the “RST” packets were lost due to the network outage, no packets (data or “control”) will ever be sent for these connections.
Since the RST packets were not delivered to the RMTHOST, the TCP/IP stack will not present an error to the Server/Collector process The Server/Collector process will continue to wait, passively, forever, for new data that will never arrive because the Extract process on the other system is no longer running.
A second cause for this symptom is that the remote server was rebooted and the Network-Attached Storage (NAS) device where the target trails reside did not detect and was not notified of the reboot, so the locks acquired prior to the reboot are still considered to be in force.
Here are few ways to resolve this issue
1) Investigate why a server/collector process is still running when a new server/collector process is started to access the same trail. You can kill orphan server/collector to resolve the immediate issue.
2) You can overwrite the server/collector by using the RMTHOST UNLOCKEDTRAILS option. Use this option with CAUTION as it can cause trail corruption. You must investigate why the trails are locked by another server or kill these server/collector processes.
NOTE that if an extract pump is stopped normally, the server/collector process stops immediately. By default, current versions (11.1/11.2 onwards) has a default timeout of 5 mins. Please refer to the reference for your version’s default. One can overwrite this value using the RMTHOST TIMEOUT option. Example setting timeout to 40 seconds.
RMTHOST 192.168.10.1, MGRPORT 7809, PARAMS TIMEOUT 40
This tells the Server/Collector to terminate if it doesn’t receive any checkpoint information for more than 40 seconds. DO NOT set too low a value, TCPIP communication performance varies throughout the day.
Other notes:
Cluster failover:
When a system is failover to another node, the GoldenGate processes should be stopped typically by using ggsci > stop * and > stop mgr commands, however processes such a server/collectors remain running. Stop the extract pumps manually or kill the processes. You should check that no processes are running from the GoldenGate directory before switching GoldenGate to run on another node.
NAS related issue:
In the case where the NAS was unaware that the system had been rebooted, the best long-term solution is to contact the NAS vendor, who might be able to provide an utility program that can be run early in the system startup process to notify the NAS that it should release all locks owned by this system. The following procedure might offer a short-term work-around:
- Stop all REPLICAT processes that read the trail file.
- Stop the target MGR process.
- Copy trail file xx000000 to xx000000.bk
- Delete trail file xx000000.
- mv xx000000.bk to xx000000.
- Repeat steps 2-5 for each trail file that can’t be locked.
- From the shell, kill the server (collector) process that was writing to the trail. ie Check on OS level for orphan processes, e.g. on unix style OS’s: ps -ef | grep server If any such orphan servers exist, e.g.:
oracle 25145 1 0 11:20 ? 00:00:00 ./server -p 7840 -k -l /opt/oracle/gg/ggserr.log
kill 25145 (or, kill -9 25145)
- Start MGR.
- Start the REPLICAT processes.
- Re-start the extract that abended and gave this error message.
Note that this may not work, depending on the NAS and the way it keeps track of advisory file locks acquired using fcntl( F_GETLK ).