Wednesday, October 9, 2013

Extract main content from web page

When we are designing a web search engine, especially vertical search engine, we usually want to extract the main content from a given web page. However, many boilerplate may be placed in a page. It's difficult to extract the main content from a web page since there are all kinds of tags we need to find and exclude.
    A tool I found to extract main content from a web page is called boilerpipe, its a Java library supporting for "detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page" according to the introduction on the official website. But, there's a little problem that this tool is written in Java, while I want to extract html main content using python. The good thing is there is a python wrapper for this tool, its called python-boilerpipe. Since this is a python wrapper for java libraries, so it needs some "middleware" to act as bridge joint between java and python, it needs jpype and charade. Jpype is an open source project designed for bridging the worlds of java and python, and it is better than Jython( previously called JPython) which supports similar functionality; charade is an universal character encoding detector

    Since all things I need are ready, here's the steps I deploy the software environment for my project:
 1) I am using python 2.7, since charade is already integrated, so in the first step, I only need to set up Jpype. Just download the package from the official site( actually, it will redirect to sourceforge), unpack and install it:
    python setup.py install --prefix=customized_path
add the customized library path (its the installed package path, not only including the customized install path, for example: ~/jpype/lib/python2.7/site-packages/) into environment variable: PYTHONPATH.
     Jpype installation can be easily verified from python command mode. Here's some sample code for testing:
     import jpype 
     jpype.startJVM(jpype.getDefaultJVMPath()
     jpype.java.lang.System.out.println( “ hello world! ” ) 
     jpype.shutdownJVM() 
     charade can be configured and installed in the same way, just download it from the official website.
     After jpype and charade have been set up, then we can install python-boilerpipe, its very straightforward.

2) Play with python-boilerpipe.
    One thing I like most of this tool is that it support kinds of text extractors.
    We may specify a extractor to extract text from a web page, by default, the DefaultExtractor will be used.
    We can just specify the url, and the tool will download the html automatically and extract the text from the page, we can use methods of extractor: getText() to get the pure main text of the page, and use getHTML() to get the html content of the page.
    Meanwhile, we can also download the html page by ourselves, and analysis it with python-boilerpipe.



Wednesday, September 4, 2013

Big data analysis in people search, Really challenging.

Nowadays, more and more people, from both industry and academy, are talking about big data. its popularity is like the fanaticism brought by MJ's new album if he's still alive. It is no exaggerating to say that we are now living in an era of "Big Data": science, engineering and technology are producing increasingly large data streams, with petabyte and exabyte scales becoming increasingly common. Big data presents opportunities and also perils. On the optimistic side, it gives us a pretty good opportunity to scale existing theoretical principles and also learning algorithms from modest-size data sets to massive data sets, thus, big data give us a fantastic platform and resources to verify the success of those principles and algorithms; while on the other side, there are some big challenges also brought by big data. It amplify the errors and their effects in existing technologies, and also, efficiency, space, energy are major issues we need to tackle in big data analysis. 
As the social network website are booming since later 90s and early 00s, millions of people join these kind of website like facebook and twitter each year, as the official statistic data of facebook says, it has a billion active user at the end of 2012. People are looking for and sharing all kinds of information on these website, and an obvious trend leaded by social network is that people are much more willing to post and share their personal information online. As a result, anyone wants to find his friends and try to get their recent updates, he can get all of these information by just typing his friends' names online, for example, on google or on some social network websites. Moreover, as this kind of requirement becomes more popular and general, an advanced people search engine is very useful. 
Recently, I'm investigating people search. I felt that neither google or social network website like facebook cannot exactly meet my requirement, actually, there is a wonderful people search engine which can aggregate a person's profile around the internet, not only from social network website, but also from all kinds of personal pages, even news pages online. Its name is called "whova",  provides advanced people search ability which is built upon proprietary big data analytics and mining technology. To precisely aggregate one's personal information online is very challenging, because:
1) Firstly, as we have mentioned above, a single person's information can be scattered around many websites, for example, I may ever submitted my personal information, such as, name, education, affiliation etc on facebook, and also on other website, eg linkedin. To precisely determine that different profiles information parsed from different website belong to a same person is very difficult.
2) Secondly, it may be some kind easy to get profile information from social network website, since the information show on these websites are structural. However, for general pages, an intelligent parser must be carefully designed to inference personal profile from those complex text paragraphs on a page, as we know, semantic analysis is a pretty challenging open problem.


Tuesday, August 27, 2013

Hbase Setup in Ubuntu


Apache HBase is the Hadoop database, which is a distribute, sacalable, big data store. The following is some steps of configuring and using HBase as a new user guide.

1) Firstly, download latest version of HBase from official site: http://hbase.apache.org/. A binary version is preferred.

2) Unpack the compressed tar file, and start to config it. The most important things of configuring hbase includes:
i) Change conf/hbase-site.xml to look like this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///DIRECTORY/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/DIRECTORY/zookeeper</value>
  </property>
</configuration>

while, DIRECTORY is your local directory, which you want to store the data, a sample name could be like this: /home/yourname/database/hbase. And also, you can set up the directory for zookeeper, but that's not mandatory.

ii) For ubuntu users, or even linux users, also you should change your hosts a little bit. As the tutorial in the official site says: "HBase expects the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions, for example, will default to 127.0.1.1 and this will cause problems for you". A modified hosts file possibly may look like this:

127.0.0.1 localhost
127.0.0.1 yourname
127.0.1.1 yourname
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
"yourname" means your user name in ubuntu.  If you do not modify the hosts file in /etc, then possibly, you may encounter a problem when you start to use hbase shell: HBase will give you a error message says "ERROR: org.apache.hadoop.hbase.PleaseHoldException: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing". The loopback IP address was set up correctly, which may cause the master to stuck in the initialization process.

3) Then you can simply launch HBase by using: ./bin/start-hbase.sh. The master will be running, you can check it by looking at running process in your ubuntu. You can also terminate HBase by using: ./bin/stop-hbase.sh

4) Moreover, you can also use HBase shell to create new tables or operate on existed tables by using: ./bin/hbase shell. But firstly, you should launch HBase master by using the command in step 3), otherwise, HBase shell is not gonna work.



Friday, August 23, 2013

Just Do It

Sometime, something, you just need to make the first step, then everything should be much easier for you.

Thursday, August 22, 2013

About Me

About me:


I am now a post-doc in the University of California, San Diego, since Aug. 2013. Before joining Opera group, I graduated from Institution of Computing Technology, Chinese Academy of Sciences with PhD in computer science. I ever studied in the University of Science and Technology of China with a major in EE from 2003 to 2007.

My interests lie in many fields, including: advanced compiler, program analysis, data mining, machine learning, web development. Including C, C++, Java, python, perl are used during my daily development. I am also interested in big data analysis, for instance, effectively and efficiently collecting and analyzing people connection information from the internet.