How to extract just the text from html page articles
Not only is it terrific in terms of handling xml, it can do wonders with html of all flavors, even badly-formed and specification-invalid html data.
A common task I have these days is to grab the text from an html page or article (e.g., in curating content for Macaronics).
The only real work is understanding the page structure and creating the correct xpath expression for each site (the readability algorithm is essentially a collection of these rules), and monitoring their changes over time so that the xpath expression can be updated accordingly.
Another bonus is that it works with foreign language sites, too, provided the parser is passed the same encoding as defined in the target page’s Content-Type meta tag.
>>> import urllib, text_grabber
>>> import codecs
>>> f=codecs.open('facta-201211043-print.txt', 'w', 'utf-8'); f.write(t); f.close()