I worked on some projects that need to parse content from web pages (they're all XML, right). And this
JSoup library helped me a lot.
The first step is download it from
here. Then add it to build path of your Java project (can also be Android, Google App Engine).
You can read full documentation of this library at this
link. In this post I'll cover some tips that I think they'll be helpful for parsing a web page.
1. Main components of JSoup
+
org.jsoup.nodes.Document: this is the object which represents the whole web page.
Document doc = JSoup.parse(yourWebPageAddress);
+
org.jsoup.nodes.Element: this is the object which represents a tag in your web page. From it you can extract data that you need.
2. How to parse data with JSoup
I often use inspect element feature of Chrome to find which part of web page that can be used for extracting data.
What I'll do next is using
Elements elements = doc.select(yourSelector);
elements contains a list of
Element (via
Elements.get(int index)). And from these
Element object you can get everything you want.
yourSelector is CSS-like element selector, that finds elements matching a query. You can read detail about these selectors when browsing
http://jsoup.org/apidocs and find section
Selector.
2.1 Some main selectors:
+
.class: elements with a class name of "class".
+ #
id: elements with attribute ID of "id".
+
E F: an F element descended from an E element. (eg:
"a img")
+
E, F, G: all matching elements E, F, or G.
+
[attr]: elements with an attribute named "attr" (with any value).
2.2 Some main methods:
+ We can use method
select(yourSelector) for all
Document, Elements, Element.
+ To get attribute of a tag:
Element.attr(attributeName)
+ To get element's inner HTML:
Element.html()