The first step is download it from here. Then add it to build path of your Java project (can also be Android, Google App Engine).
You can read full documentation of this library at this link. In this post I'll cover some tips that I think they'll be helpful for parsing a web page.
1. Main components of JSoup
+ org.jsoup.nodes.Document: this is the object which represents the whole web page.
Document doc = JSoup.parse(yourWebPageAddress);+ org.jsoup.nodes.Element: this is the object which represents a tag in your web page. From it you can extract data that you need.
2. How to parse data with JSoup
I often use inspect element feature of Chrome to find which part of web page that can be used for extracting data. What I'll do next is using
Elements elements = doc.select(yourSelector);elements contains a list of Element (via Elements.get(int index)). And from these Element object you can get everything you want. yourSelector is CSS-like element selector, that finds elements matching a query. You can read detail about these selectors when browsing http://jsoup.org/apidocs and find section Selector.
2.1 Some main selectors:
+ .class: elements with a class name of "class".+ #id: elements with attribute ID of "id".
+ E F: an F element descended from an E element. (eg: "a img")
+ E, F, G: all matching elements E, F, or G.
+ [attr]: elements with an attribute named "attr" (with any value).
2.2 Some main methods:
+ We can use method select(yourSelector) for all Document, Elements, Element.+ To get attribute of a tag:
Element.attr(attributeName)+ To get element's inner HTML:
Element.html()
No comments:
Post a Comment