Lightweight Web-crawler for finding broken links in Java using BFS .

A definition of graph is set of edges interconnected by verticies. As simple as it sounds but graphs are most widely used data structures in computer science, its applications are broad from GPS maps to mapping relationships and also in areas such as Artificial intelligence .

A Graph has two variations Undirected and Directed Graph. As the name implies in an Undirected graph edges are presumed to be two way, whereas directred graph is where we have to explicity specify the direction of flow of eges . It is also appropriate to assume Internet as a directed graph where vertices are URL’s and edges represent connections/paths. For example, my website wizardofbigdata.wordpress.com has a reference to website datascience.ibm.com and it can be visualized as following.

 

Screen Shot 2017-07-17 at 10.20.01 am

There are two most popularly used graph traversal techniques DFS (depth first search) and BFS (Breadth first search). Both have their own well known areas of applications. In this blog i’l be using BFS , which in simple terms visits all the nodes at the same level before moving to next level down. BFS is used in GPS maps to find the shortest path and is also used in Web crawlers .

Lets write a simple web crawler  program using BFS to find if there are any broken links in my blog site or in the referenced sites. I’ll not be discussing BFS algorithm here but only snippets of my code which are relevant to use case of finding broken links.

The complete code can be downloaded from my github repo link.

Using my blog site as root, I look for urls both http and https using Java regex pattern ,

String regexpattern=”(http|https)://(\\w+\\.)*(\\w+)”;
Pattern pattern=Pattern.compile(regexpattern);
Matcher matcher=pattern.matcher(htmlBody);
while(matcher.find())

As the URLs are discovered , I add them to a Set if not already added , this allows us to prevent the code from revisiting sites.

while(matcher.find())
{
String w=matcher.group();
if(!visited.contains(w))
{
visited.add(w);
urls.add(w);
}
}

Finally the part which marks a URL as broken is shown following, if I hit the exceptions UnknownHost or FileNotFound I assume they are broken.

catch(UnknownHostException e)
{
System.out.println(“Broken Url Found:”+url);
}
catch(FileNotFoundException e)
{
System.out.println(“Broken Url Found:”+url);
}

Here is a sample output from the program, it found a broken url while crawling, which is inaccessible from the internet.

https://wizardofbigdata.files.wordpress.com
https://developer.ibm.com
http://www.ibm.com
https://docs.oracle.com
Broken Url Found:http://bdavm040.svl.ibm.com

Further enhancements:

We can further add a logic to control the depth to be traversed , to prevent the program from running infinitely.

 

To summarize in this blog I have tried my best to highlight, that with correct choice of algorithm and few lines of code how a complex problem can be solved easily.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s