Thursday, April 25, 2013

Some errors when using Jsoup

I wrote another Google App Engine server (using Java) and encountered 2 problems with Jsoup library. Luckily I found solutions for them. Then this blog entry is for other people who will have same problems with Jsoup later.

1. Jsoup.parse(URL) doesn't work

So when writing a Java program to parse a web page (in my project it is phimchieurap.vn), it works. But when using the same code for Google App Engine server, it doesn't work.
My solution is:  
Jsoup.connect(yourUrl).userAgent(yourUserAgent).get()

2. Element.html() return an incorrect string

So instead of return a correct string, it returns a weird string with some characters like &;...
My solution is:  
StringEscapeUtils.unescapeHtml3(String)

We need to include commons-lang.jar in order to use StringEscapeUtils.
Also remember to set character encoding to utf-8 in your Servlet.

Sunday, April 14, 2013

Parse web page with JSoup

I worked on some projects that need to parse content from web pages (they're all XML, right). And this JSoup library helped me a lot.

The first step is download it from here. Then add it to build path of your Java project (can also be Android, Google App Engine).

You can read full documentation of this library at this link. In this post I'll cover some tips that I think they'll be helpful for parsing a web page.

1. Main components of JSoup


+ org.jsoup.nodes.Document: this is the object which represents the whole web page.
Document doc = JSoup.parse(yourWebPageAddress);
+ org.jsoup.nodes.Element: this is the object which represents a tag in your web page. From it you can extract data that you need.

2. How to parse data with JSoup


I often use inspect element feature of Chrome to find which part of web page that can be used for extracting data. What I'll do next is using
Elements elements = doc.select(yourSelector);
elements contains a list of Element (via Elements.get(int index)). And from these Element object you can get everything you want. yourSelector is CSS-like element selector, that finds elements matching a query. You can read detail about these selectors when browsing http://jsoup.org/apidocs and find section Selector.

2.1 Some main selectors:

+ .class: elements with a class name of "class".
+ #id: elements with attribute ID of "id".
+ E F: an F element descended from an E element. (eg: "a img")
+ E, F, G: all matching elements E, F, or G.
+ [attr]: elements with an attribute named "attr" (with any value).

2.2 Some main methods:

+ We can use method select(yourSelector) for all Document, Elements, Element.
+ To get attribute of a tag:
Element.attr(attributeName)
+ To get element's inner HTML:
Element.html()