The other day I needed to write an application which parses different data from a pile of web pages. I’ve found an excellent library that parses HTML with JQuery-like selection methods. You do not need any regexes which cause a headache.
For example, you need to get a title and URL of anchor from the HTML code below:
String html = "<div><a href="http://blog.romanvlasenko.com/">Click here...</a></div>"; Document doc = Jsoup.parse(html); String title = doc.select("div.some-class a").text(); String url = doc.select("div.some-class a").attr("href");
Voila! You have what you needed.
If you need to parse a real web page, use Jsoup.connect() method instead of the Jsoup.parse()
Jsoup has it’s own HttpConnection that encapsulates some extra-work which you were needed to do with Java’s HttpURLConnection.
Document webPage = Jsoup.connect("http://blog.romanvlasenko.com/").userAgent("Mozilla").get();
Now you have downloaded web page ready for processing. Jsoup provides both POST and GET methods. For more details see the Jsoup project homepage.