Apache Nutch 1.4 – Form authentication [SOLVED]

As I have been searching over the internet, I found out lots of people having problems with Form-based authentication when using Apache Nutch crawler. All posts I have found ended with no solution, so I am giving you one option here.

By default, Nutch uses protocol-http plugin to retrieve pages. The plugin protocol-httpclient supports several HTTP authentication schemes out of the box and uses (still :() HttpClient v3.x (to use this plugin, you will need to update conf/nutch-site.xml plugin.includes properties). Credentials for specific hosts are read from conf/httpclient-auth.xml file.

Good option is to define xml for saving forms credentials. I for example, used:

<credentials username="myUsn" password="myPass">
      <formscope loginPage="httpMethodPageUrl" 
                 className="si.zitnik.pathToClassName"
                 port="portNum" />
</credentials>

Then I edited setCredentials method inside Http class in protocol-httpclient plugin to read new type of credentials. In the method resolveCredentials I instantiate class given by className and call the login function (build your prefered way of abstract classes/interfaces to make the procedure as generic as possible). In the plugin, httpclient uses BROWSER_COMPATIBILITY Cookie policy, so we need no further changes.

The last thing is writing your own login class that accepts previously read parameters and authenticates to the page. The easiest way is to write it directly inside protocol-httpclient plugin. (If you want to write it somewhere else, you will need to modify dependencies in plugin’s build xmls).

After that enjoy crawling!

4 thoughts on “Apache Nutch 1.4 – Form authentication [SOLVED]

  1. Denis

    Nice solution 🙂
    I’m beginner in Java and want to try your solution for Nutch.
    Why need it ~loginPage=”httpMethodPageUrl”~ and could you publish example of loginClass or/and some changes?
    Thank you 🙂

  2. ap

    Hi,

    Thank you for your solution. It’s really useful.
    I was just wondering how did you send the HttpResponse back? Since that is the return in Http.java ‘s getResopnse() function.
    I am trying to solve the form post authentication.
    I wrote a class in HttpClient 4.1 but that’s not useful. So, I wrote this :

    URL url = new URL(“http://somewebsite/docs/DOC-2264″);
    String authStr =”usrname:password”;
    String encodedAuthStr = Base64.encodeBase64String(authStr.getBytes());
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestMethod(“POST”);
    connection.setDoOutput(true);
    connection.setRequestProperty(“Authorization”, “Basic ” + encodedAuthStr);

    InputStream content = (InputStream)connection.getInputStream();
    BufferedReader in =
    new BufferedReader (new InputStreamReader (content));

Leave a Reply

Your email address will not be published. Required fields are marked *