Did you ever have to crawl or scrape a site that required login? Or maybe one that had quite a lot of javascript for UI interaction?
Well, turns out, I had to do that a couple of weeks ago, I was required to interact with a site that required login credentials and the login form used quite a lot of javascript to authorized the user. After googling around for a while, I found the Mechanize gem and after giving it a try I was so happy with it that I decided to write a bit about it, so here it goes…
What is Mechanize?
Simply put, Mechanize is a ruby gem that allows you to interact with a website from your source code, just as you would with your mouse (well, not exactly like it, but very close to it).
This amazing gem lets you send requests, get resources and interact with them by clicking links, submiting forms and the like.
It has some limitations, of course, since it doesn’t use any kind of headless browser (unlike Capybara), it will not be able to work with Javascript generated html for instance. But it will work to some extent with javascript, like when getting the response from an Ajax request for example.
How?
Using the gem and interacting with sites is pretty simple, lets see some of the examples taken from the site:
require 'rubygems'
require 'mechanize'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
a.get('http://google.com/') do |page|
search_result = page.form_with(:name => 'f') do |search|
search.q = 'Hello world'
end.submit
search_result.links.each do |link|
puts link.text
end
end
The DSL it pretty simple to read, and as you can see, in that example the user is accessing the Google home page and submiting the query “Hello world”, after that, the script is listing all the links found.
Here is another example, this time a login action:
require 'rubygems'
require 'mechanize'
a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
# Click the login link
login_page = a.click(page.link_with(:text => %rLog In/))
# Submit the login form
my_page = login_page.form_with(:action => '/account/login.php') do |f|
f.form_loginname = ARGV[0]
f.form_pw = ARGV[1]
end.click_button
my_page.links.each do |link|
text = link.text.strip
next unless text.length > 0
puts text
end
end
The best part? Mechanize by defaults saves the cookies, that means if you keep using the same agent on your code, you can keep browsing the site just as if you were a logged in user.
Also, note that to access the form fields, you just access the attributes of the f object. If for some reason, the names of those fields were to have invalid ruby characters, you could still access them like so:
#... code
f['form_loginname'] = ARGV[0]
f['form_pw'] = ARGV[1]
#... more code
If you want to logout, you’ll have to clean the cookie jar, and to do that you’ll have to do:
a.cookie_jar.clear!
And you’re done, you can now keep using the agent and you’ll be received as a new (logged out) user.
Inspecting the code
By default, Mechanize integrates with Nokogiri, allowing us to interact with the HTML using simple XPath syntax or even CSS selectors.
Lets go back to one of the previous examples, but using the search method.
require 'rubygems'
require 'mechanize'
a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
# Click the login link
login_page = a.click(page.link_with(:text => %rLog In/))
# Submit the login form
my_page = login_page.form_with(:action => '/account/login.php') do |f|
f.form_loginname = ARGV[0]
f.form_pw = ARGV[1]
end.click_button
my_page.search("//a") do |link|
puts link.to_s #prints the html code of every link on the page
end
end
To checkout the entire scope of the gem, checkout it’s official page and it’s examples section.
Installing the gem
If you liked what you’ve read so far and wanna give it a try, it’s pretty simple, as you might’ve guessed just do the usual:
> gem install mechanize
Pretty simple uh? Now go ahead and try it out!
Happy coding!