Lightweight Web-crawler for finding broken links in Java using BFS .

A definition of graph is set of edges interconnected by verticies. As simple as it sounds but graphs are most widely used data structures in computer science, its applications are broad from GPS maps to mapping relationships and also in areas such as Artificial intelligence .

A Graph has two variations Undirected and Directed Graph. As the name implies in an Undirected graph edges are presumed to be two way, whereas directred graph is where we have to explicity specify the direction of flow of eges . It is also appropriate to assume Internet as a directed graph where vertices are URL’s and edges represent connections/paths. For example, my website wizardofbigdata.wordpress.com has a reference to website datascience.ibm.com and it can be visualized as following.

 

Screen Shot 2017-07-17 at 10.20.01 am

There are two most popularly used graph traversal techniques DFS (depth first search) and BFS (Breadth first search). Both have their own well known areas of applications. In this blog i’l be using BFS , which in simple terms visits all the nodes at the same level before moving to next level down. BFS is used in GPS maps to find the shortest path and is also used in Web crawlers .

Lets write a simple web crawler  program using BFS to find if there are any broken links in my blog site or in the referenced sites. I’ll not be discussing BFS algorithm here but only snippets of my code which are relevant to use case of finding broken links.

The complete code can be downloaded from my github repo link.

Using my blog site as root, I look for urls both http and https using Java regex pattern ,

String regexpattern=”(http|https)://(\\w+\\.)*(\\w+)”;
Pattern pattern=Pattern.compile(regexpattern);
Matcher matcher=pattern.matcher(htmlBody);
while(matcher.find())

As the URLs are discovered , I add them to a Set if not already added , this allows us to prevent the code from revisiting sites.

while(matcher.find())
{
String w=matcher.group();
if(!visited.contains(w))
{
visited.add(w);
urls.add(w);
}
}

Finally the part which marks a URL as broken is shown following, if I hit the exceptions UnknownHost or FileNotFound I assume they are broken.

catch(UnknownHostException e)
{
System.out.println(“Broken Url Found:”+url);
}
catch(FileNotFoundException e)
{
System.out.println(“Broken Url Found:”+url);
}

Here is a sample output from the program, it found a broken url while crawling, which is inaccessible from the internet.

https://wizardofbigdata.files.wordpress.com
https://developer.ibm.com
http://www.ibm.com
https://docs.oracle.com
Broken Url Found:http://bdavm040.svl.ibm.com

Further enhancements:

We can further add a logic to control the depth to be traversed , to prevent the program from running infinitely.

 

To summarize in this blog I have tried my best to highlight, that with correct choice of algorithm and few lines of code how a complex problem can be solved easily.

Create an utility to convert Microsoft excel spreadsheet to csv format.

I came across a scenario where I had to convert excel spreadsheets both xls and xlsx formats into csv programmatically .

My search for a library to achieve this landed me here Apache POI .   Apache POI is a Java API built for the purpose of handling Microsoft documents.

The site has an amazing collections of examples, with respect to each document type. For my scenario i.e handling excel sheets one can find list of examples available in the following link https://poi.apache.org/spreadsheet/examples.html .

I used the example ToCSV (http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/ss/examples/ToCSV.java) and created a maven project. The project can be checked out from my git repo https://github.com/bharathdcs/exceltocsv .

 

 

 

Hadoop file upload utility for secure clusters running on cloud using webhdfs and Knox gateway.

Hadoop clusters hosted on a secure cloud, does not allow direct connections to HDFS ports. Instead connections from outside world applications has to be routed via a  gateway like Knox.

Apache Knox Gateway provides authentication support and a single rest interface to access several Bigdata services namely HDFS ,AMBARI , HIVE . Hence Knox is a perfect choice for routing the external traffic to Bigdata clusters.

External applications, can connect to the Knox Gateway using any Http client and interact with Hadoop by invoking REST API Calls. More details about the Webhdfs is documented in Apache Hadoop documentation
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

In this article I have made an attempt to show users how to build their own upload manager for uploading files to HDFS. The logic can be embedded in any desktop or mobile application allowing users to interact with their Bigdata cluster remotely .

File upload utility.

The utility uses Apache HttpClient library , release 4.5.3 and supports file sizes of upto 5 MB.

The project can be downloaded from my git repo https://github.com/bharathdcs/hadoop_fileuploader

The Application expects a properties file as input , the format and a sample is shown following

knoxHostPort=bluemixcluster.ibm.com:8443
knoxUsername=guest
knoxPassword=guest-password
hdfsFileUrl=/tmp/hdfsfile.txt
dataFile=/Users/macadmin/Desktop/input.txt

Most of the parameters are self explanatory .

The file is created with default permission of world writeable if a different permissions is desired pass the octal value using the following parameter in the properties file

hdfsFilePermission=440

The steps to run the application is as follows .

mvn exec:java -Dexec.mainClass="twc.webhdfs.App" -Dexec.args="/Users/macadmin/Desktop/input.properties"

Once finished you should see the following message confirming the file creation

File creation successful

 

How to merge Kerberos Keytabs without increasing kvno.

In scenarios which requires having same principal as part of multiple keytabs using xst or ktadd will increment the kvno making previous keytabs irrelevant .

For example:- HTTP  service principal is used for spnego authentication by both hadoop service and hbase rest service. If hdfs and hbase keytabs are created individually as shown following

xst -k spnego.service.keytab HTTP/server1.example.com@EXAMPLE.COM

xst -k hbase.service.keytab HTTP/server1.example.com@EXAMPLE.COM

The keytab for hadoop will have an old version of HTTP principal this can be confirmed by running klist

[root@server1 keytabs]# klist -k hbase.service.keytab
Keytab name: FILE:hbase.service.keytab
KVNO Principal
—- ————————————————————————–
2 HTTP/server1.example.com@EXAMPLE.COM

[root@server1 keytabs]# klist -k spnego.service.keytab
Keytab name: FILE:spnego.service.keytab
KVNO Principal
—- ————————————————————————–
1 HTTP/server1.example.com@EXAMPLE.COM

Once the keytabs are configured for HDFS, webhdfs services fails to start because the keytab is no longer valid. If you try to do kinit using HDFS spnego keytab you will notice following error, which is an indication that kvno is modified.

kinit -kt spnego.service.keytab HTTP/server1.example.com
kinit: Password incorrect while getting initial credentials

To list the current kvno in the kerberos server run , the following commands on the kerberos server

kadmin.local

kadmin.local:  getprinc HTTP/server1.example.com
Principal: HTTP/server1.example.com@EXAMPLE.COM
Expiration date: [never]
Last password change: Wed Mar 29 18:24:35 PDT 2017
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Wed Mar 29 18:24:35 PDT 2017 (nn/admin@EXAMPLE.COM)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 4
Key: vno 2, aes256-cts-hmac-sha1-96, no salt
Key: vno 2, aes128-cts-hmac-sha1-96, no salt
Key: vno 2, des3-cbc-sha1, no salt
Key: vno 2, arcfour-hmac, no salt
MKey: vno 1
Attributes:
Policy: [none]

Best practice for creating keytabs with overlapping principals.

1. Generate a keytab ,containing only the overlapping principal

xst -k spnego.service.keytab HTTP/server1.example.com@EXAMPLE.COM

2. For all other keytabs which need this principal use ktutil command as shown following

ktutil
ktutil:  rkt spnego.service.keytab
ktutil:  rkt hbase.service.keytab
ktutil:  wkt hbase.service.keytab
ktutil:  exit

rkt loads all the principals in the keytab to the buffer, wkt creates a new keytab with all the principals currently in the buffer.

After executing the ktutil commands ,hbase.service.keytab will contain the HTTP principal with same kvno.

[root@hives1 ~]# klist -k hbase.service.keytab
Keytab name: FILE:hbase.service.keytab
KVNO Principal
—- ————————————————————————–
1 HTTP/server1.example.com@EXAMPLE.COM

Custom Spark Streaming Adapter for reading Weather data from IBM The Weather Company API.

Most of the real world price models use weather as one of the dimensions. Weather patterns can influence the pricing decisions for commodities. But how do you obtain the weather data in realtime to use it in spark analytics or persist for offline BI analysis .

In this article I will be discussing about IBM Weather Company API from Bluemix and how to read the data from the API for realtime analytics in Spark .

IBM Bluemix offers a free subscription plan for Weather Company API https://console.ng.bluemix.net/catalog/ .

Screen Shot 2017-03-16 at 3.58.15 pm

TWC API has various REST Urls one can use to acccess weather data releated to particular geo location, post code and so on.. More details is documented  here .

Reading the JSON string in Spark Streaming.

Spark streaming does not have out of box support for reading JSON from remote REST API. Hence users should create a custom adapter . Custom adapters should extend Receiver class from Spark streaming and override start() and stop() function .

Once the adapter is created we have to use the same in SparkStreamingContext as shown in following snippet

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(20));

JavaReceiverInputDStream<String> lines = ssc.receiverStream(new WcReceiver(args[0]));

Here WCReceiver is my custom HTTP adapter to read weather json data. The complete project is available in my git repo

The JSON data is further persisted to HDFS so that external bigsql or hive tables can be mounted on it and used as a dimension for analytics .

Steps to run the project

  • Compile the program using following maven command

           mvn compile package

  • Copy the generated jar file from target folder to spark cluster
    target/wc-0.0.1-SNAPSHOT.jar
  • To test the program launch it using spark-submit local mode.The program accepts two parameters Weather data API URL and HDFS output folder as shown following.

spark-submit –class “spark.wc.AnalyzeWeather” –master local[4] wc-0.0.1-SNAPSHOT.jar  <weather data url> /tmp/output.

 

 

 

 

 

Realtime anomaly detection in timeseries data

How to detect the anomalies or deviations in real-time multivariate time-series data ?

IBM Streams has a timeseries toolkit which allows users to analyse real-time timeseries data.

One important realtime use-case is anomaly detection which is applied in several domains or industries

More technical details can be found in the following link

http://www.ibm.com/developerworks/library/bd-streamsanomalydetection/

Monitoring memory usage for Bigdata processes using attach API and notifying users

Recently I have been asked by customers for a way to monitor heap memory growth and alert if it exceeds a certain thresholds. Ambari provides robust alerting and monitoring capabilities but there is limited support in monitoring heap sizes for services like HiveServer2. Although there are several monitoring and notification applications available in the market, I have made an attempt to highlight how to build our own monitoring and notification application with unlimited customization.

Technologies used

1. Java attach API

2. Apache Commons Email

3. Maven

4. OpenJDK 8

The project is uploaded to my git repo and can be accessed using the following link https://github.com/bharathdcs/attach_api_notification

Java Attach API’s allows user program to attach to running JVM’s. The JVM could be hosting NodeManagers , HiveServer2 or any other process. One of the advantages of attach API is you can retrieve the JMX url from JVM and fetch various statistics, one of them is memory consumption as shown in following code snippet

VirtualMachine virtualMachine = VirtualMachine.attach(virtualMachineDescriptor);
String propertyValue = agentProperties.getProperty("com.sun.management.jmxremote.localConnectorAddress");
c.getMBeanServerConnection().getAttribute(new ObjectName("java.lang:type=Memory"), "HeapMemoryUsage");

Using the memory object (bean) returned we can retrieve stats like used, allocated and max memory. More details about MemoryMXBean can be found in the java documentation

Properties File for Parameters

Users specify the memory usage thresholds for alerts and smtp server details to send the notifications using the property file.

Following is a snippet from a sample properties file

procId=3124
smtp_host=smtp.gmail.com
smtp_port=465
smpt_user=test@gmail.com
stmp_password=testing123
threshold=10
email=test@gmail.com

The procId is the process Id of JVM to be attached and monitored**.

provide your SMTP server details using smtp_* variables.

Threshold specifies the threshold for triggering alert when memory usage crosses (threshold/100)*totalHeap.

Steps to run the Appliation

The application uses maven build framework.

1. Compile using “mvn clean compile assembly:single”

2. Run the application using following syntax

java -cp ${JAVA_HOME}/lib/tools.jar:heap-0.0.1-SNAPSHOT-jar-with-dependencies.jar monitor.heap.App sample.properties

The application monitors memory usage continuously and when it exceeds a certain threshold , following email is sent to your configured Inbox.

Possible Future Enhancements

1. Integrating with mobile notification APIs

2. Publish the memory units as JSON enabling java script widgets to chart them.

** – Users can only attach to JVMs owned by them.For example, Hive user can only attach to hive JVM’s .

Publish Ambari Alerts to Google Cloud Messaging using Ambari Alert Dispatcher

Introduction

Apache Ambari is a powerful open source cluster management tool, which simplifies deploying and managing Big Data clusters.

Ambari Server has an in-built alert management which triggers alerts based on conditions set.

The simplest alert example is the one defined for data node process unreachable. This alert is triggered whenever datanode process is down. The following snapshot shows the details of this alert.

Alerts are good however what use has an alert unless it can be notified to end users automatically. I started exploring Ambari to identify if there is a way by which I can define a custom dispatcher for alert events.

And viola I discovered that Ambari has an inbuilt “Alert Dispatcher” which has to be activated and we can define a call back script which will be invoked by the dispatcher .

Here comes the interesting part on how I went about using Ambari alert dispatcher to trigger Google cloud notifications for Android .

Lets look at how to enable Alert dispatcher on Ambari server .

  • Create an alert dispatcher target on Ambari server, but before that check if there is one already defined. check if default alert dispatcher target is already defined , by using following curl command

curl –u admin:password –I –H ‘X-Requested-By:ambari’ http://ambariserver.domain.com:8080/api/v1/alert_targets

Output should be empty JSON.

  • Define your own alert dispatcher target, using following curl statement the settings are part of http payload.

curl -u admin:password -i -H ‘X-Requested-By: ambari’ -X POST -d ‘{“Body”:{“AlertTarget”: {“name”: “syslogger”,”description”: “Syslog Target”,”notification_type”: “ALERT_SCRIPT”,”global”: true}}}’ http://ambariserver.domain.com:8080/api/v1/alert_targets

  • Now the target is set ,  let us bind the dispatcher script that has our code to ambari server

On Ambari server properties file (/etc/ambari-server/conf/ambari.properties) define the following property and it should point to a python or bash script which has your actions defined.

notification.dispatch.alert.script=/var/lib/ambari-server/resources/scripts/gcmmodule.py

I  used this property to invoke a GCM module written in python .

** if you want to use my python code as an example its available in following git repository

https://github.com/bharathdcs/ambari_gcm_alert_dispatcher

  • Restart Ambari server

ambari-server restart

Now its time to put the module to test , trigger an alert by stopping HDFS service from Ambari .

Congratulations we just published Ambari alerts to Android applications.

Port forwarding in a multi-homed Big data cluster.

It is a recommended practice to deploy big data cluster in a multi homed network. Multi homing is process of connecting a node to two different networks. The big data nodes are connected to a private network and we will have a management node interfacing with both private and public/corporate network as shown in following figure

multihomed_cluster
There are two reasons for multi-homing Bigdata cluster
1. To reduce the congestion of corporate network since big data jobs are network intensive.
2. By isolating the network, data flow is secured within the private network.
The client components are deployed on management node so that end user can login and run the analytics.

Services running on management node can bind to both the interfaces like Zookeeper 0.0.0.0 or it can bind to private interface like Hbase master running on 172.121.21.2 making it inaccessible from public network unless you login to management node.

Lets take a scenario where hbase java client from corporate/public network wants to connect to HBase master bound to private network interface, the connection fails since client has no visibility to private network.
To overcome this drawback we can enable port forwarding technique on management node using iptables command .

The following command has been tested working on Redhat Linux 6.x

iptables -t nat -A PREROUTING -i eth1 -p tcp –dport 60000 -j DNAT –to 172.16.150.173:60000

The command redirects traffic from public ethernet interface eth1 port 60000 onto private interface or IP 172.16.150.173.

If you are not aware which is the public interface use the following command to identify,

ip route show | grep default