Home RecentChanges Log

Trouvage

Trouvage : HTML extracting

Trouvage is small tool written with ruby and Hpricot that allows one to parse and extract tag informations from HTML.

I was just toying with Hpricot, when I thought I’d just rewrite an old script of mine, that did quite the same thing with the now outdated HTML-Parser. So, here it is. Please, be indulgent, I’m not a ruby coder (yet).

Latest version is available for download here: http://lab.cyprio.net/snap/trouvage.tgz

What is this thing?

Overview

  trouvage [options] <URIs...>

When run without options, trouvage will try to fetch and parse the given URIs, and extract every single link (as in a with an href attribute) it can find in the stream.

Base use

Simple options are as follow:

  -a            Turn extracted links into absolute URIs.
  -r            Turn extracted links into relative URIs (when possible).
  -u            Don't report same link twice.
  -R            Send given URI as referer in HTTP headers.
  -U <STRING>    Change Trouvage User-Agent in HTTP headers.
  -h            Shows usage options
  -v            Verbose mode...

Weird options

Trouvage’s default behavior is to extract a tags when they do possess a non-empty href attribute. However it allows one to change this tag-attribute tuple, and the matching regular expression used to extract it.

Change matching expression

  -m <RX>   Change matching regular expression
    /blabla/
    /blabla/i

Change tag-attribute tuples

    -t <tag1,attr1,tag2,attr2,...>  Change tag-attribute tuples

Trouvage’s default behavior can be changed to extract anything that’s not a link. This option allows you to specify as many tags and attributes you’d want to.

For example if one calls:

    trouvage http://ruby-lang.org -t img,src

Trouvage will then parse the website looking for img tags with an non-empty src attribute. Now, if for some reason you’d like to extract all links and images from the site, you would call:

    trouvage http://ruby-lang.org -t img,src,a,href

The “inner_html” tag attribute can be used to extract a HTML node’s… inner_html (surprise!). For instance – with a little Hpricot magic – you can fetch ruby-lang.org “Other News” titles…

    trouvage http://ruby-lang.org -t a,inner_html -e '//div[@id=news]//li//a'

Change Hpricot expression

Trouvage builds an Hpricot expression by itself when fed tags and attributes. Since this is restrictive, and probably far from perfect, you can specify you own Hpricot expresion:

    -e <EXPR>   Force Hpricot XPath/CSS expression

Using this option does not free one from specifying tags to trouvage: You can’t ask trouvage to fetch href attributes from a tags, while specifying an XPath expression dedicated to div tags… Ok, sure you can do this, just don’t expect trouvage to ouput anything usefull.

Example use for this could be:

    trouvage www.google.com/search?q=foobar -e '//a[@class=l]'

This will do a really simple Googlesearch for “foobar” and extract the first resulting links.

System requirements

Todo

EditNearLinks: OpenBSD FreeBSD