Wednesday, October 9, 2013

Extract main content from web page

When we are designing a web search engine, especially vertical search engine, we usually want to extract the main content from a given web page. However, many boilerplate may be placed in a page. It's difficult to extract the main content from a web page since there are all kinds of tags we need to find and exclude.
    A tool I found to extract main content from a web page is called boilerpipe, its a Java library supporting for "detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page" according to the introduction on the official website. But, there's a little problem that this tool is written in Java, while I want to extract html main content using python. The good thing is there is a python wrapper for this tool, its called python-boilerpipe. Since this is a python wrapper for java libraries, so it needs some "middleware" to act as bridge joint between java and python, it needs jpype and charade. Jpype is an open source project designed for bridging the worlds of java and python, and it is better than Jython( previously called JPython) which supports similar functionality; charade is an universal character encoding detector

    Since all things I need are ready, here's the steps I deploy the software environment for my project:
 1) I am using python 2.7, since charade is already integrated, so in the first step, I only need to set up Jpype. Just download the package from the official site( actually, it will redirect to sourceforge), unpack and install it:
    python setup.py install --prefix=customized_path
add the customized library path (its the installed package path, not only including the customized install path, for example: ~/jpype/lib/python2.7/site-packages/) into environment variable: PYTHONPATH.
     Jpype installation can be easily verified from python command mode. Here's some sample code for testing:
     import jpype 
     jpype.startJVM(jpype.getDefaultJVMPath()
     jpype.java.lang.System.out.println( “ hello world! ” ) 
     jpype.shutdownJVM() 
     charade can be configured and installed in the same way, just download it from the official website.
     After jpype and charade have been set up, then we can install python-boilerpipe, its very straightforward.

2) Play with python-boilerpipe.
    One thing I like most of this tool is that it support kinds of text extractors.
    We may specify a extractor to extract text from a web page, by default, the DefaultExtractor will be used.
    We can just specify the url, and the tool will download the html automatically and extract the text from the page, we can use methods of extractor: getText() to get the pure main text of the page, and use getHTML() to get the html content of the page.
    Meanwhile, we can also download the html page by ourselves, and analysis it with python-boilerpipe.