Issue Details (XML | Word | Printable)

Key: SFOS-1137
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Steve Loughran
Reporter: Steve Loughran
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
SmartFrog

Hadoop can exit the JVM with a fatal error in the namenode

Created: 06/Mar/09 04:30 PM (GMT)   Updated: 15/May/09 11:03 AM (BST)
Component/s: _service_hadoop
Affects Version/s: 3.17.010
Fix Version/s: 3.17.010

Time Tracking:
Not Specified

Issue Links:
Depends
 

Compatibility: backwards compatible


 Description  « Hide
Some test runs end up with the SF daemon not being around to accept connections.

A likely cause is the Runtime.exit() calls in the FSEditLog lines 400 and 432, where the runtime is exited

   FSNamesystem.LOG.fatal(
          "Fatal Error : All storage directories are inaccessible.");
      Runtime.getRuntime().exit(-1);
    }

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Steve Loughran added a comment - 06/Mar/09 04:30 PM (GMT)
Looking at the log, that FATAL error message is the last thing printed. I thought we'd stopped this with an intercept of System.exit(); need to test more

[sf-system-test-junit] Testsuite: org.smartfrog.services.hadoop.test.system.local.namenode.BasicNameNodeTest
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO namenode.NameNode : State change: NameNode is now STARTED
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO metrics.RpcMetrics : Initializing RPC Metrics with hostName=ExtNameNode, port=8020
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO namenode.NameNode : Namenode up at: localhost/127.0.0.1:8020
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO jvm.JvmMetrics : Cannot initialize JVM Metrics with processName=NameNode, sessionId=null - already initialized
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO metrics.NameNodeMetrics : Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO metrics.FSNamesystemMetrics : Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Image file of size 93 saved in 0 seconds.
[sf-startdaemon-debug] 2009/03/06 15:13:46:651 GMT [INFO ][RMI TCP Connection(51)-127.0.0.1] HOST localhost:rootProcess:testLocalNameNodeCompoundTerminates - Checking host:port localhost/127.0.0.1:8021 is open
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Storage directory /tmp/tempdir26042 has been successfully formatted.
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Storage directory /tmp/hadoop/dfs/name has been successfully formatted.
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Number of files = 1
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Number of files under construction = 0
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Image file of size 93 loaded in 0 seconds.
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Edits file /tmp/hadoop/dfs/name/current/edits of size 4 edits # 0 loaded in 0 seconds.
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] INFO common.Storage : Image file of size 93 saved in 0 seconds.
[sf-startdaemon-debug] 2009/03/06 15:13:46:714 GMT [INFO ][TerminatorThread] HOST localhost:rootProcess:testLocalNameNodeCompoundTerminates - Initiating NameNode termination; serviceThread=Thread[NameNode,5,RMI Runtime] service=ExtNameNode instance ExtNameNode instance org.apache.hadoop.hdfs.server.namenode.ExtNameNode@1b3cc96 in state STARTED in state STARTED, IPC localhost/127.0.0.1:8020
[sf-startdaemon-debug] 2009/03/06 15:13:46:715 GMT [INFO ][TerminatorThread] HOST localhost:rootProcess:testLocalNameNodeCompoundTerminates - Terminating deployer thread
[sf-startdaemon-debug] 2009/03/06 15:13:46:715 GMT [INFO ][TerminatorThread] HOST localhost:rootProcess:testLocalNameNodeCompoundTerminates - Requesting thread termination
[sf-startdaemon-debug] 2009/03/06 15:13:46:715 GMT [INFO ][TerminatorThread] HOST localhost:rootProcess:testLocalNameNodeCompoundTerminates - waiting for thread to finish
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] ERROR common.Storage : Cannot write file /tmp/hadoop/dfs/name
[sf-startdaemon-debug] java.nio.channels.ClosedByInterruptException
[sf-startdaemon-debug] at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
[sf-startdaemon-debug] at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:271)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.write(Storage.java:279)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.write(Storage.java:255)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.FSImage.rollFSImage(FSImage.java:1374)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1075)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:297)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:279)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:216)
[sf-startdaemon-debug] at org.apache.hadoop.hdfs.server.namenode.NameNode.innerStart(NameNode.java:311)
[sf-startdaemon-debug] at org.apache.hadoop.util.Service.start(Service.java:186)
[sf-startdaemon-debug] at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.innerDeploy(HadoopServiceImpl.java:504)
[sf-startdaemon-debug] at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.access$000(HadoopServiceImpl.java:54)
[sf-startdaemon-debug] at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl$ServiceDeployerThread.execute(HadoopServiceImpl.java:744)
[sf-startdaemon-debug] at org.smartfrog.sfcore.utils.SmartFrogThread.run(SmartFrogThread.java:279)
[sf-startdaemon-debug] at org.smartfrog.sfcore.utils.WorkflowThread.run(WorkflowThread.java:117)
[sf-startdaemon-debug] 09/03/06 15:13:46 [NameNode] FATAL namenode.FSNamesystem : Fatal Error : All storage directories are inaccessible.
[sf-system-test-junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 65.616 sec
[sf-system-test-junit]
[sf-system-test-junit] Testcase: testNoNameNode took 1.002 sec
[sf-system-test-junit] Testcase: testLocalNameNodeCompoundTerminates took 63.604 sec
[sf-system-test-junit] Caused an ERROR
[sf-system-test-junit] Timeout waiting for a test run to complete
[sf-system-test-junit] Event history has 3 events
[sf-system-test-junit] (unknown) -DeployedEvent at Fri Mar 06 15:13:46 GMT 2009 alive: true
[sf-system-test-junit] (unknown) -StartedEvent at Fri Mar 06 15:13:46 GMT 2009 alive: true
[sf-system-test-junit] (unknown) -TestStartedEvent at Fri Mar 06 15:13:46 GMT 2009 alive: true
[sf-system-test-junit]
[sf-system-test-junit] - time limit: 60000
[sf-system-test-junit] TestTimeoutException:: Timeout waiting for a test run to complete
[sf-system-test-junit] Event history has 3 events
[sf-system-test-junit] (unknown) -DeployedEvent at Fri Mar 06 15:13:46 GMT 2009 alive: true
[sf-system-test-junit] (unknown) -StartedEvent at Fri Mar 06 15:13:46 GMT 2009 alive: true
[sf-system-test-junit] (unknown) -TestStartedEvent at Fri Mar 06 15:13:46 GMT 2009 alive: true
[sf-system-test-junit]
[sf-system-test-junit] - time limit: 60000, SmartFrog 3.17.005dev (2009-03-06 10:36:11 GMT)
[sf-system-test-junit] at org.smartfrog.services.assertions.events.TestEventSink.runTestsToCompletion(TestEventSink.java:449)
[sf-system-test-junit] at org.smartfrog.test.DeployingTestBase.runTestDeployment(DeployingTestBase.java:259)
[sf-system-test-junit] at org.smartfrog.test.DeployingTestBase.completeTestDeployment(DeployingTestBase.java:302)
[sf-system-test-junit] at org.smartfrog.test.DeployingTestBase.runTestsToCompletion(DeployingTestBase.java:338)
[sf-system-test-junit] at org.smartfrog.test.DeployingTestBase.expectSuccessfulTestRunOrSkip(DeployingTestBase.java:439)
[sf-system-test-junit] at org.smartfrog.services.hadoop.test.system.local.namenode.BasicNameNodeTest.testLocalNameNodeCompoundTerminates(BasicNameNodeTest.java:49)
[sf-system-test-junit]
[sf-system-test-junit] Testcase: testLocalNameNodeCompoundIsOffline took 1.002 sec
[sf-system-test-junit] TEST org.smartfrog.services.hadoop.test.system.local.namenode.BasicNameNodeTest FAILED
[sf-system-test-junit] Testsuite: org.smartfrog.services.hadoop.test.system.local.namenode.ForkedNameNodeTest
[sf-system-test-junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.049 sec
[sf-system-test-junit]
[sf-system-test-junit] Testcase: testNameNodeForked took 1.021 sec
[sf-system-test-junit] Caused an ERROR
[sf-system-test-junit] Unable to connect to sfDaemon on: localhost/127.0.0.1.
[sf-system-test-junit] Reason: sfDaemon may not be running on localhost/127.0.0.1

Steve Loughran added a comment - 09/Mar/09 04:26 PM (GMT)
the exit security manager is there but doesn't always get turned on. the hadoop service classes will be tweaked to always attempt to install this if it is not running -but what happens in secure mode?

Steve Loughran added a comment - 10/Mar/09 02:06 PM (GMT)
We aren't blocking system exit calls. We can, but we don't. If we turn it on, RMI refuses to play.

Steve Loughran added a comment - 10/Mar/09 02:21 PM (GMT)
After looking at this, I'm going to stop for a while. Filing as a Hadoop improvement
https://issues.apache.org/jira/browse/HADOOP-5453

Steve Loughran added a comment - 29/Apr/09 03:22 PM (BST)
This is believed to be fixed