I'm using Ruby and Mechanize to parse a local HTML file but I can't do it. This works if I use a URL though:
agent = Mechanize.new
#THIS WORKS
#url = 'http://www.sample.com/sample.htm'
#page = agent.get(url) #this seems to work just fine but the following below doesn't
#THIS FAILS
file = File.read('/home/user/files/sample.htm') #this is a regular html file
page = Nokogiri::HTML(file)
pp page.body #errors here
page.search('/div[@class="product_name"]').each do |node|
text = node.text
puts "product name: " + text.to_s
end
The error is:
/home/user/code/myapp/app/models/program.rb:35:in `main': undefined method `body' for #<Nokogiri::HTML::Document:0x000000011552b0> (NoMethodError)
How do I get a page object so that I can search on it?
Mechanize uses URI strings to point to what it's supposed to parse. Normally we'd use a "
http" or "https" scheme to point to a web-server, and that's where Mechanize's strengths are, but other schemes are available, including "file", which can be used to load a local file.I have a little HTML file on my Desktop called "test.rb":
Running this code:
Outputs:
Which tells me Mechanize loaded the file, parsed it, then accessed the
body.However, unless you need to actually manipulate forms and/or navigate pages, then Mechanize is probably NOT what you want to use. Instead Nokogiri, which is under Mechanize, is a better choice for parsing, extracting data or manipulating the markup and it's agnostic as to what scheme was used or where the file is actually located:
which then output the same file after parsing it.
Back to your question, how to find the node only using Nokogiri:
Changing
test.htmlto:and running:
shows that Nokogiri found the node and returned the text.
This code in your sample could be better:
node.textreturns a string:So
text.to_sis redundant. Simply usetext.