Trouvage is small tool written with ruby and Hpricot that allows one to parse and extract tag informations from HTML.
I was just toying with Hpricot, when I thought I’d just rewrite an old script of mine, that did quite the same thing with the now outdated HTML-Parser. So, here it is. Please, be indulgent, I’m not a ruby coder (yet).
Latest version is available for download here: http://lab.cyprio.net/snap/trouvage.tgz
trouvage [options] <URIs...>
When run without options, trouvage will try to fetch and parse the given URIs, and extract every single link (as in a with an href attribute) it can find in the stream.
Simple options are as follow:
-a Turn extracted links into absolute URIs. -r Turn extracted links into relative URIs (when possible). -u Don't report same link twice. -R Send given URI as referer in HTTP headers. -U <STRING> Change Trouvage User-Agent in HTTP headers. -h Shows usage options -v Verbose mode...
Trouvage’s default behavior is to extract a tags when they do possess a non-empty href attribute. However it allows one to change this tag-attribute tuple, and the matching regular expression used to extract it.
-m <RX> Change matching regular expression
/blabla/
/blabla/i
-t <tag1,attr1,tag2,attr2,...> Change tag-attribute tuples
Trouvage’s default behavior can be changed to extract anything that’s not a link. This option allows you to specify as many tags and attributes you’d want to.
For example if one calls:
trouvage http://ruby-lang.org -t img,src
Trouvage will then parse the website looking for img tags with an non-empty src attribute. Now, if for some reason you’d like to extract all links and images from the site, you would call:
trouvage http://ruby-lang.org -t img,src,a,href
The “inner_html” tag attribute can be used to extract a HTML node’s… inner_html (surprise!). For instance – with a little Hpricot magic – you can fetch ruby-lang.org “Other News” titles…
trouvage http://ruby-lang.org -t a,inner_html -e '//div[@id=news]//li//a'
Trouvage builds an Hpricot expression by itself when fed tags and attributes. Since this is restrictive, and probably far from perfect, you can specify you own Hpricot expresion:
-e <EXPR> Force Hpricot XPath/CSS expression
Using this option does not free one from specifying tags to trouvage: You can’t ask trouvage to fetch href attributes from a tags, while specifying an XPath expression dedicated to div tags… Ok, sure you can do this, just don’t expect trouvage to ouput anything usefull.
Example use for this could be:
trouvage www.google.com/search?q=foobar -e '//a[@class=l]'
This will do a really simple Googlesearch for “foobar” and extract the first resulting links.