Issue Details (XML | Word | Printable)

Key: SFOS-1135
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Steve Loughran
Reporter: Steve Loughran
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
SmartFrog

move hadoop-cluster components to using dynamically determined hostnames

Created: 03/Mar/09 05:42 PM (GMT)   Updated: 30/Mar/09 05:11 PM (BST)
Component/s: _service_hadoop
Affects Version/s: 3.17.010
Fix Version/s: 3.17.010

Time Tracking:
Original Estimate: 3 hours
Original Estimate - 3 hours
Remaining Estimate: 3 hours
Remaining Estimate - 3 hours
Time Spent: Not Specified
Remaining Estimate - 3 hours

Issue Links:
Depends
 

Compatibility: may break builds or test process


 Description  « Hide
with SFOS-1133, we can work out the hostnames as we are deployed -use this to set up all hostnames and URLs.

This implies a switch from early to late bindings, which could have some consequences

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Steve Loughran added a comment - 03/Mar/09 05:58 PM (GMT)
this is breaking tests, especially datanode setup, which isn't binding right

 datanode.DataNode : DatanodeRegistration(127.0.1.1:8042, storageID=DS-1708430574-127.0.1.1-8042-1236102999151, infoPort=8030, ipcPort=50020):




sf-system-test-junit] SmartFrogRuntimeException:: Failed to copy /home/slo/Projects/SmartFrog/Forge/core/extras/hadoop-cluster/build/test/work/in.txt to /tests/CopyFileInAndOut/in/in.txt on hdfs://morzine.hpl.hp.com, cause: java.net.ConnectException: java.net.ConnectException: Connection refused connecting to /127.0.1.1:8042, SmartFrog 3.17.005dev (2009-03-03 14:55:46 GMT)
[sf-system-test-junit] at org.smartfrog.services.hadoop.common.DfsUtils.copyLocalFileIn(DfsUtils.java:325)
[sf-system-test-junit] at org.smartfrog.services.hadoop.components.dfs.DfsCopyFileInImpl.performDfsOperation(DfsCopyFileInImpl.java:72)
[sf-system-test-junit] at org.smartfrog.services.hadoop.components.dfs.DfsOperationImpl.performDfsOperation(DfsOperationImpl.java:66)
[sf-system-test-junit] at org.smartfrog.services.hadoop.components.dfs.DfsOperationImpl$DfsWorkerThread.execute(DfsOperationImpl.java:115)
[sf-system-test-junit] at org.smartfrog.sfcore.utils.SmartFrogThread.run(SmartFrogThread.java:279)
[sf-system-test-junit] at org.smartfrog.sfcore.utils.WorkflowThread.run(WorkflowThread.java:117)
[sf-system-test-junit] Caused by: java.net.ConnectException: java.net.ConnectException: Connection refused connecting to /127.0.1.1:8042
[sf-system-test-junit] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:411)
[sf-system-test-junit] at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2709)
[sf-system-test-junit] at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2665)
[sf-system-test-junit] at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1954)
[sf-system-test-junit] at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2141)
[sf-system-test-junit] Caused by: java.net.ConnectException: Connection refused
[sf-system-test-junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
[sf-system-test-junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
[sf-system-test-junit] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
[sf-system-test-junit] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
[sf-system-test-junit] ... 4 more
[sf-system-test-junit] copy a file in and out the file system
[sf-system-test-junit] succeeded:false
[sf-system-test-junit] forcedTimeout:false
[sf-system-test-junit] skipped:false
[sf-system-test-junit] )
[sf-system-test-junit] Tests run: 2, Failures: 0, Errors: 1, Time elapsed: 54.738 sec
[sf-system-test-junit] Testcase: testHDFS took 17.623 sec
[sf-system-test-junit] Testcase: testFileSystemCopyFileInAndOut took 37.111 sec
[sf-system-test-junit] Caused an ERROR
[sf-system-test-junit] Test failed
[sf-system-test-junit] (unknown) -TestCompletedEvent at Tue Mar 03 17:57:03 GMT 2009 alive: true

Steve Loughran added a comment - 04/Mar/09 05:02 PM (GMT)
root cause is ubuntu /etc/hosts configuration ; the datanode was coming up bound to 127.0.0.1, so of course you can't connect via 127.0.1.1. This only shows up once you start using the real hostname in URLs.

This problem will recur on all ubuntu systems

see http://linux.derkeiler.com/Mailing-Lists/Ubuntu/2007-08/msg00681.html
http://ubuntuforums.org/showthread.php?t=432875
https://lists.ubuntu.com/archives/ubuntu-users/2008-December/168883.html

There's not much to do here, except handle the event better

Steve Loughran added a comment - 30/Mar/09 05:11 PM (BST)
we have done more to address this