How to Build a Web Scraper in Java

Hello, Today we are going to build a web scraper. Although All of you know the term, I just want to tell you shortly what exactly web scraper is?


Web scraping, also known as web harvesting or web data extraction, is a type of data scraping that is used to gather information from websites. Using the Hypertext Transfer Protocol or a web browser, web scraping software can directly access the World Wide Web.

In this project, We are going to use spring boot as a backend service and build a simple UI for the frontend part using HTML and CSS.


We will use the JSOUP library for extracting data from the websites. JSOUP is a Java library that allows you to interact with real-world HTML. It uses the finest of HTML5 DOM techniques and CSS selectors to create a very convenient API for requesting URLs, extracting, and modifying data.

Create a project first. You can generate one by https://start.spring.io/

Firstly go to the pom.xml file and add the dependency link below:

               <dependency>
			<groupId>org.jsoup</groupId>
			<artifactId>jsoup</artifactId>
			<version>1.14.2</version>
		</dependency>

And finally, the pom.xml file will be

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>2.5.4</version>
		<relativePath/> <!-- lookup parent from repository -->
	</parent>
	<groupId>com.webs</groupId>
	<artifactId>data-scraper</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>data-scraper</name>
	<description>Demo project for data-scraping</description>
	<properties>
		<java.version>8</java.version>
	</properties>
	<dependencies>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-devtools</artifactId>
			<scope>runtime</scope>
			<optional>true</optional>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>org.codehaus.jettison</groupId>
			<artifactId>jettison</artifactId>
			<version>1.3.7</version>
		</dependency>


		<dependency>
			<groupId>org.jsoup</groupId>
			<artifactId>jsoup</artifactId>
			<version>1.14.2</version>
		</dependency>

	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.1</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
				</configuration>
			</plugin>
		</plugins>
	</build>

</project>

In my case, the main class is DataScraperApplication. It’s just a simple one that is responsible for running the application.

DataScraperApplication.java

package com.webs;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.CrossOrigin;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@SpringBootApplication
@RestController
public class DataScraperApplication {

	public static void main(String[] args) {
		SpringApplication.run(DataScraperApplication.class, args);
	}

	@GetMapping
	@CrossOrigin
	public String displayWelcomeMessage(){
		return "Welcome Guys! Now, Scrap Anything, what do you want.........";
	}


}

Ok now creates three packages such as controller, exceptions, and utils. All the functionalities lie are in the controller package. The exception package will responsible for handling the exceptions and finally, utils are responsible for some utility.

In the utils package, let’s create a bean named Params that is responsible for selecting the HTML selector and the name of the id or class. The class will be:-

Params.java

package com.webs.utils;

public class Params {

    private String selector;
    private String name;

    public String getSelector() {
        return selector;
    }

    public void setSelector(String selector) {
        this.selector = selector;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public String toString() {
        return "Params{" +
                "selector='" + selector + '\'' +
                ", name='" + name + '\'' +
                '}';
    }
}

Also within the utils package, create another class named MethodUtils. Responsible for displaying the response message when occurred any exceptions. The file will be finally:-

MethodUtils.java

package com.webs.utils;


import org.codehaus.jettison.json.JSONException;
import org.codehaus.jettison.json.JSONObject;
import org.springframework.http.HttpStatus;


public class MethodUtils {

    MethodUtils(){

    }
    public static String prepareErrorJSON(HttpStatus status, Exception ex) {
        JSONObject errorJSON=new JSONObject();
        try {
            errorJSON.put("success",false);
            errorJSON.put("message",ex.getMessage());
            errorJSON.put("status_code",status.value());
        } catch (JSONException e) {

            e.printStackTrace();
        }

        return errorJSON.toString();

    }

}

In the exceptions package, we will create two classes for handling the exceptions. The first one is ApplicationException which will extend RuntimeException class. and another one is ApplicationExceptionHandler which will be annotated with the @ControllerAdvice annotation.

ApplicationException.class

package com.webs.exception;

public class ApplicationException extends RuntimeException{

    public ApplicationException(String msg){
        super(msg);
    }
}

ApplicationExceptionHandler.class

package com.webs.exception;


import com.webs.utils.MethodUtils;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.ControllerAdvice;
import org.springframework.web.bind.annotation.ExceptionHandler;

@ControllerAdvice
public class ApplicationExceptionHandler {


    @ExceptionHandler(value = ApplicationException.class)
    public ResponseEntity<String> applicationException(ApplicationException exception){
        HttpStatus status=HttpStatus.NOT_FOUND;
        return new ResponseEntity<>(MethodUtils.prepareErrorJSON(status,exception),status);
    }

}

Finally, let’s create our controllers. In the Controller package, let’s create a class named DataScraperController. It is annotated with the @RestController and @RequestMapping(“api/v1/data-scraper”) with the URL.

In the DataScraperController.class, create a method that will fetch the data using the HTML tag name.

DataScraperController.class

   @GetMapping("/get/urls")
    @CrossOrigin
    public ResponseEntity<?> getDataByTagName(@RequestParam("url") String url) throws IOException {

        if (url.isEmpty()){
            throw new ApplicationException("URL Can't be Empty");
        }

        Document doc = getDocument(url);
        Elements web_data = doc.select("a");

        Map<String,String> data = new HashMap<>();

        for (Element web_item : web_data) {
            String link = web_item.attr("href");
            String text = web_item.text();
            data.put(link,text);
        }
        return new ResponseEntity<>(data,HttpStatus.OK);
    }

Create another two util methods:-

private String removeQuotationFromText(String str){
        return str.replaceAll(" \" "," ");
    }

    private Document getDocument(String url) throws IOException {
        return Jsoup.connect(url).get();
    }

    private List<String> extractElementData(Elements element){
        List<String> data = new ArrayList<>();
        for (Element web_item : element) {
            String str = removeQuotationFromText(web_item.text());
            data.add(str);
        }

        return data;
    }

Now create another method for getting the images from a website.


    @GetMapping("/get/images")
    @CrossOrigin
    public ResponseEntity<?> getImages(@RequestParam("url") String url) throws IOException {

        if (url.isEmpty()){
            throw new ApplicationException("URL Can't be Empty");
        }

        Document doc = getDocument(url);
        Elements web_data = doc.select("img");

        Map<String,String> data = new HashMap<>();

        for (Element web_item : web_data) {
            String link = web_item.attr("src");
            String text = web_item.attr("alt");
            data.put(link,text);
        }
        return new ResponseEntity<>(data,HttpStatus.OK);

    }

Let’s create another method for getting the data from a website.

   @GetMapping("/get")
    @CrossOrigin
    public ResponseEntity<?> getData(@RequestParam("url") String url, Params params) throws IOException {
        if (url.isEmpty()){
            throw new ApplicationException("URL Can't be Empty");
        }

        if(params.getSelector().isEmpty() || params.getSelector() == null){
            throw new ApplicationException("Tag Selector Name Can't be Empty");
        }

        if(params.getName().isEmpty() || params.getName() == null){
            throw new ApplicationException("Tag Name Can't be Empty");
        }

        Document document = Jsoup.connect(url).get();
        List<String> results = new ArrayList<>();

        if (params.getSelector().equals("id")){
            Element web_data = document.getElementById(params.getName());
            results.add(web_data.text());
        }
        if (params.getSelector().equals("class")){
            Elements web_data = document.getElementsByClass(params.getName());
            results = extractElementData(web_data);
        }
        if (params.getSelector().equals("tag")){
            Elements web_data = document.getElementsByTag(params.getName());
            results = extractElementData(web_data);
        }

        return new ResponseEntity<>(results,HttpStatus.OK);
    }

So, finally, the DataScraperController class will be:-

DataScraperController.java

package com.webs.controller;


import com.webs.exception.ApplicationException;
import com.webs.utils.Params;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;


@RestController
@RequestMapping("api/v1/data-scraper")
public class DataScraperController {




    @GetMapping("/get")
    @CrossOrigin
    public ResponseEntity<?> getData(@RequestParam("url") String url, Params params) throws IOException {
        if (url.isEmpty()){
            throw new ApplicationException("URL Can't be Empty");
        }

        if(params.getSelector().isEmpty() || params.getSelector() == null){
            throw new ApplicationException("Tag Selector Name Can't be Empty");
        }

        if(params.getName().isEmpty() || params.getName() == null){
            throw new ApplicationException("Tag Name Can't be Empty");
        }

        Document document = Jsoup.connect(url).get();
        List<String> results = new ArrayList<>();

        if (params.getSelector().equals("id")){
            Element web_data = document.getElementById(params.getName());
            results.add(web_data.text());
        }
        if (params.getSelector().equals("class")){
            Elements web_data = document.getElementsByClass(params.getName());
            results = extractElementData(web_data);
        }
        if (params.getSelector().equals("tag")){
            Elements web_data = document.getElementsByTag(params.getName());
            results = extractElementData(web_data);
        }

        return new ResponseEntity<>(results,HttpStatus.OK);
    }

    private List<String> extractElementData(Elements element){
        List<String> data = new ArrayList<>();
        for (Element web_item : element) {
            String str = removeQuotationFromText(web_item.text());
            data.add(str);
        }

        return data;
    }

    @GetMapping("/get/images")
    @CrossOrigin
    public ResponseEntity<?> getImages(@RequestParam("url") String url) throws IOException {

        if (url.isEmpty()){
            throw new ApplicationException("URL Can't be Empty");
        }

        Document doc = getDocument(url);
        Elements web_data = doc.select("img");

        Map<String,String> data = new HashMap<>();

        for (Element web_item : web_data) {
            String link = web_item.attr("src");
            String text = web_item.attr("alt");
            data.put(link,text);
        }
        return new ResponseEntity<>(data,HttpStatus.OK);

    }
    @GetMapping("/get/urls")
    @CrossOrigin
    public ResponseEntity<?> getDataByTagName(@RequestParam("url") String url) throws IOException {

        if (url.isEmpty()){
            throw new ApplicationException("URL Can't be Empty");
        }

        Document doc = getDocument(url);
        Elements web_data = doc.select("a");

        Map<String,String> data = new HashMap<>();

        for (Element web_item : web_data) {
            String link = web_item.attr("href");
            String text = web_item.text();
            data.put(link,text);
        }
        return new ResponseEntity<>(data,HttpStatus.OK);
    }

    private String removeQuotationFromText(String str){
        return str.replaceAll(" \" "," ");
    }

    private Document getDocument(String url) throws IOException {
        return Jsoup.connect(url).get();
    }
}

And finally, the application. properties will be


spring.application.name=data-scraper

server.port=9999

So, run the project and test the API……. Hope it works fine. if not work feel free to comment.

In the next part, we will build the frontend part. stay tuned………….

Leave a Reply