How to Parse HTML in Java
If you are working on a program that works with HTML files, you may need to find a way to parse HTML files efficiently. You can quickly parse HTML files through the Java programming language using the most used web scraping tool, Jsoup
.
This article discusses how to parse an HTML file. Also, we will discuss the topic by providing necessary examples and explanations to make the topic easier.
Working of Jsoup
in Java
The Jsoup
works by parsing the HTML file of the web page and then converting it into a Document
object. You can say this as a programmatic representation of the DOM
.
A method named parse
in Jsoup
creates the Document
. Below discussed some of the functionality of Jsoup
:
parse(File MyFile, @Nullable String charsetName)
- It is used to parse an HTML file.parse(InputStream in, @Nullable String CharsetName, String BaseUri)
- reads theInputStream
and parse it.parse(String html)
- It is used to parse an HTML string.
Use Jsoup
to Parse HTML in Java
Our example below will parse a website using the Jsoup
. The Java code for our example will be as follows:
// importing necessary packages
package javaparsehtml;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JavaParseHtml {
public static void main(String[] args) {
URL MyUrl;
try {
// Providing the URL of the website
MyUrl = new URL("https://www.example.com");
HttpURLConnection MyConnection;
try {
// Create an Http connection
MyConnection = (HttpURLConnection) MyUrl.openConnection();
// Defining the request format
MyConnection.setRequestProperty("accept", "application/json");
try {
// Create a response stream
InputStream ResponseStream = MyConnection.getInputStream();
// Parsing the website
Document MyDoc = Jsoup.parse(ResponseStream, "UTF-8", "https://www.example.com");
// Showing the output as HTML
System.out.println(MyDoc.html());
} catch (IOException e) {
e.printStackTrace();
}
} catch (IOException e) {
e.printStackTrace();
}
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
}
In our example above, we will illustrate how we can parse an HTML file, and we have already commanded the purpose of each line.
In the example, we created an HTTP connection based on the provided URL and then defined the requested property. After that, we created an InputStream
and parsed the website.
Lastly, we print the website as an output. After executing the above Java program, you will get an output like the below:
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8">
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
An important note here, if you don’t have installed or included the jar
file of the Jsoup
, you first need to include the jar
file in your project directory or install the package. Otherwise, you may get errors.
Aminul Is an Expert Technical Writer and Full-Stack Developer. He has hands-on working experience on numerous Developer Platforms and SAAS startups. He is highly skilled in numerous Programming languages and Frameworks. He can write professional technical articles like Reviews, Programming, Documentation, SOP, User manual, Whitepaper, etc.
LinkedIn