Lightweight Web-crawler for finding broken links in Java using BFS .

A definition of graph is set of edges interconnected by verticies. As simple as it sounds but graphs are most widely used data structures in computer science, its applications are broad from GPS maps to mapping relationships and also in areas such as Artificial intelligence .

A Graph has two variations Undirected and Directed Graph. As the name implies in an Undirected graph edges are presumed to be two way, whereas directred graph is where we have to explicity specify the direction of flow of eges . It is also appropriate to assume Internet as a directed graph where vertices are URL’s and edges represent connections/paths. For example, my website wizardofbigdata.wordpress.com has a reference to website datascience.ibm.com and it can be visualized as following.

 

Screen Shot 2017-07-17 at 10.20.01 am

There are two most popularly used graph traversal techniques DFS (depth first search) and BFS (Breadth first search). Both have their own well known areas of applications. In this blog i’l be using BFS , which in simple terms visits all the nodes at the same level before moving to next level down. BFS is used in GPS maps to find the shortest path and is also used in Web crawlers .

Lets write a simple web crawler  program using BFS to find if there are any broken links in my blog site or in the referenced sites. I’ll not be discussing BFS algorithm here but only snippets of my code which are relevant to use case of finding broken links.

The complete code can be downloaded from my github repo link.

Using my blog site as root, I look for urls both http and https using Java regex pattern ,

String regexpattern=”(http|https)://(\\w+\\.)*(\\w+)”;
Pattern pattern=Pattern.compile(regexpattern);
Matcher matcher=pattern.matcher(htmlBody);
while(matcher.find())

As the URLs are discovered , I add them to a Set if not already added , this allows us to prevent the code from revisiting sites.

while(matcher.find())
{
String w=matcher.group();
if(!visited.contains(w))
{
visited.add(w);
urls.add(w);
}
}

Finally the part which marks a URL as broken is shown following, if I hit the exceptions UnknownHost or FileNotFound I assume they are broken.

catch(UnknownHostException e)
{
System.out.println(“Broken Url Found:”+url);
}
catch(FileNotFoundException e)
{
System.out.println(“Broken Url Found:”+url);
}

Here is a sample output from the program, it found a broken url while crawling, which is inaccessible from the internet.

https://wizardofbigdata.files.wordpress.com
https://developer.ibm.com
http://www.ibm.com
https://docs.oracle.com
Broken Url Found:http://bdavm040.svl.ibm.com

Further enhancements:

We can further add a logic to control the depth to be traversed , to prevent the program from running infinitely.

 

To summarize in this blog I have tried my best to highlight, that with correct choice of algorithm and few lines of code how a complex problem can be solved easily.

Create an utility to convert Microsoft excel spreadsheet to csv format.

I came across a scenario where I had to convert excel spreadsheets both xls and xlsx formats into csv programmatically .

My search for a library to achieve this landed me here Apache POI .   Apache POI is a Java API built for the purpose of handling Microsoft documents.

The site has an amazing collections of examples, with respect to each document type. For my scenario i.e handling excel sheets one can find list of examples available in the following link https://poi.apache.org/spreadsheet/examples.html .

I used the example ToCSV (http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/ss/examples/ToCSV.java) and created a maven project. The project can be checked out from my git repo https://github.com/bharathdcs/exceltocsv .