Thursday, April 25, 2013

Some errors when using Jsoup

I wrote another Google App Engine server (using Java) and encountered 2 problems with Jsoup library. Luckily I found solutions for them. Then this blog entry is for other people who will have same problems with Jsoup later.

1. Jsoup.parse(URL) doesn't work

So when writing a Java program to parse a web page (in my project it is phimchieurap.vn), it works. But when using the same code for Google App Engine server, it doesn't work.
My solution is:  
Jsoup.connect(yourUrl).userAgent(yourUserAgent).get()

2. Element.html() return an incorrect string

So instead of return a correct string, it returns a weird string with some characters like &;...
My solution is:  
StringEscapeUtils.unescapeHtml3(String)

We need to include commons-lang.jar in order to use StringEscapeUtils.
Also remember to set character encoding to utf-8 in your Servlet.

1 comment:

  1. Tool using Jsoup available at http://www.srccodes.com/tools/html/js-extractor

    ReplyDelete