The Machines Are Watching

I’ve been playing around with RubyfulSoup for a few different projects. It’s a library for accessing HTML pages from within Ruby as if the format was actually sane for web pages (based on Beautiful Soup

download American Gangster

for Python). Sure XHTML is supposed to be XML and parsing is supposed to predictable and possible with a normal XML parser. But no one who has actually worked with a set of web pages from the real world would actually expect that to be the case. It’s just so nice to be able to say:

@soup.find_all( 'a' ).each { |t|
    u = uri.merge( t['href'] )
    ...
}

and ignore the details. Ahhh. Now only if it also worked with WML, that would be stellar!

This entry was posted in ThisIsMobility. Bookmark the permalink.

4 Responses to The Machines Are Watching

  1. Pingback: » The Machines Are Watching - Computer internet safety & security

  2. Luca Passani says:

    here is how Java does the same:

    http://home.ccil.org/~cowan/XML/tagsoup/

    quoting from the introduction:

    “This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.”

    Luca

  3. Eli Dickinson says:

    Mike- Have you tried Hpricot for Ruby? (http://code.whytheluckystiff.net/hpricot/)

    Check out the showcase — it’s a real slick little library. Beats the pants off Soup, if you ask me.

    Eli

  4. Pingback: Email Management

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">