14 Feb

HTML parsing in Java is really simple with Jsoup

The other day I needed to write an application which parses different data from a pile of web pages. I’ve found an excellent library that parses HTML with JQuery-like selection methods. You do not need any regexes which cause a headache.

For example, you need to get a title and URL of anchor from the HTML code below:

String html = "<div><a href="http://blog.romanvlasenko.com/">Click here...</a></div>";
Document doc = Jsoup.parse(html);
String title = doc.select("div.some-class a").text();
String url = doc.select("div.some-class a").attr("href");

Voila! You have what you needed.

If you need to parse a real web page, use Jsoup.connect() method instead of the Jsoup.parse()
Jsoup has it’s own HttpConnection that encapsulates some extra-work which you were needed to do with Java’s HttpURLConnection.

Document webPage = Jsoup.connect("http://blog.romanvlasenko.com/").userAgent("Mozilla").get();

Now you have downloaded web page ready for processing. Jsoup provides both POST and GET methods. For more details see the Jsoup project homepage.