The Machines Are Watching
I’ve been playing around with RubyfulSoup for a few different projects. It’s a library for accessing HTML pages from within Ruby as if the format was actually sane for web pages (based on Beautiful Soup for Python). Sure XHTML is supposed to be XML and parsing is supposed to predictable and possible with a normal XML parser. But no one who has actually worked with a set of web pages from the real world would actually expect that to be the case. It’s just so nice to be able to say:
@soup.find_all( 'a' ).each { |t|
u = uri.merge( t['href'] )
…
}
and ignore the details. Ahhh. Now only if it also worked with WML, that would be stellar!

April 27th, 2007 at 11:41 pm
[...] Read more at miker [...]
April 28th, 2007 at 1:46 pm
here is how Java does the same:
http://home.ccil.org/~cowan/XML/tagsoup/
quoting from the introduction:
“This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.”
Luca
May 3rd, 2007 at 10:45 am
Mike- Have you tried Hpricot for Ruby? (http://code.whytheluckystiff.net/hpricot/)
Check out the showcase — it’s a real slick little library. Beats the pants off Soup, if you ask me.
Eli
September 13th, 2007 at 9:48 am
Email Management…
While cyber space continues to round up core guidance, we\’ll attempt to recommend them to you….