Ruby::Mechanize, interacting with webpages from your source code

Did you ever have to crawl or scrape a site that required login? Or maybe one that had quite a lot of javascript for UI interaction?
Well, turns out, I had to do that a couple of weeks ago, I was required to interact with a site that required login credentials and the login form used quite a lot of javascript to authorized the user. After googling around for a while, I found the Mechanize gem and after giving it a try I was so happy with it that I decided to write a bit about it, so here it goes…

What is Mechanize?

Simply put, Mechanize is a ruby gem that allows you to interact with a website from your source code, just as you would with your mouse (well, not exactly like it, but very close to it).
This amazing gem lets you send requests, get resources and interact with them by clicking links, submiting forms and the like.

It has some limitations, of course, since it doesn’t use any kind of headless browser (unlike Capybara), it will not be able to work with Javascript generated html for instance. But it will work to some extent with javascript, like when getting the response from an Ajax request for example.

How?

Using the gem and interacting with sites is pretty simple, lets see some of the examples taken from the site:

require 'rubygems'
require 'mechanize'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

a.get('http://google.com/') do |page|
  search_result = page.form_with(:name => 'f') do |search|
    search.q = 'Hello world'
  end.submit

  search_result.links.each do |link|
    puts link.text
  end
end

The DSL it pretty simple to read, and as you can see, in that example the user is accessing the Google home page and submiting the query “Hello world”, after that, the script is listing all the links found.

Here is another example, this time a login action:

require 'rubygems'
require 'mechanize'

a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
  # Click the login link
  login_page = a.click(page.link_with(:text => %rLog In/))

  # Submit the login form
  my_page = login_page.form_with(:action => '/account/login.php') do |f|
    f.form_loginname  = ARGV[0]
    f.form_pw         = ARGV[1]
  end.click_button

  my_page.links.each do |link|
    text = link.text.strip
    next unless text.length > 0
    puts text
  end
end

The best part? Mechanize by defaults saves the cookies, that means if you keep using the same agent on your code, you can keep browsing the site just as if you were a logged in user.

Also, note that to access the form fields, you just access the attributes of the f object. If for some reason, the names of those fields were to have invalid ruby characters, you could still access them like so:

#... code
    f['form_loginname']  = ARGV[0]
    f['form_pw']         = ARGV[1]
#... more code

If you want to logout, you’ll have to clean the cookie jar, and to do that you’ll have to do:

a.cookie_jar.clear!

And you’re done, you can now keep using the agent and you’ll be received as a new (logged out) user.

Inspecting the code

By default, Mechanize integrates with Nokogiri, allowing us to interact with the HTML using simple XPath syntax or even CSS selectors.

Lets go back to one of the previous examples, but using the search method.

require 'rubygems'
require 'mechanize'

a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
  # Click the login link
  login_page = a.click(page.link_with(:text => %rLog In/))

  # Submit the login form
  my_page = login_page.form_with(:action => '/account/login.php') do |f|
    f.form_loginname  = ARGV[0]
    f.form_pw         = ARGV[1]
  end.click_button

  my_page.search("//a") do |link|  
    puts link.to_s #prints the html code of every link on the page
  end
end

To checkout the entire scope of the gem, checkout it’s official page and it’s examples section.

Installing the gem

If you liked what you’ve read so far and wanna give it a try, it’s pretty simple, as you might’ve guessed just do the usual:

> gem install mechanize

Pretty simple uh? Now go ahead and try it out!

Happy coding!


Fork me on GitHub
Tagged , , , ,