Hadoop file upload utility for secure clusters running on cloud using webhdfs and Knox gateway.

Hadoop clusters hosted on a secure cloud, does not allow direct connections to HDFS ports. Instead connections from outside world applications has to be routed via a  gateway like Knox.

Apache Knox Gateway provides authentication support and a single rest interface to access several Bigdata services namely HDFS ,AMBARI , HIVE . Hence Knox is a perfect choice for routing the external traffic to Bigdata clusters.

External applications, can connect to the Knox Gateway using any Http client and interact with Hadoop by invoking REST API Calls. More details about the Webhdfs is documented in Apache Hadoop documentation

In this article I have made an attempt to show users how to build their own upload manager for uploading files to HDFS. The logic can be embedded in any desktop or mobile application allowing users to interact with their Bigdata cluster remotely .

File upload utility.

The utility uses Apache HttpClient library , release 4.5.3 and supports file sizes of upto 5 MB.

The project can be downloaded from my git repo https://github.com/bharathdcs/hadoop_fileuploader

The Application expects a properties file as input , the format and a sample is shown following


Most of the parameters are self explanatory .

The file is created with default permission of world writeable if a different permissions is desired pass the octal value using the following parameter in the properties file


The steps to run the application is as follows .

mvn exec:java -Dexec.mainClass="twc.webhdfs.App" -Dexec.args="/Users/macadmin/Desktop/input.properties"

Once finished you should see the following message confirming the file creation

File creation successful


How to merge Kerberos Keytabs without increasing kvno.

In scenarios which requires having same principal as part of multiple keytabs using xst or ktadd will increment the kvno making previous keytabs irrelevant .

For example:- HTTP  service principal is used for spnego authentication by both hadoop service and hbase rest service. If hdfs and hbase keytabs are created individually as shown following

xst -k spnego.service.keytab HTTP/server1.example.com@EXAMPLE.COM

xst -k hbase.service.keytab HTTP/server1.example.com@EXAMPLE.COM

The keytab for hadoop will have an old version of HTTP principal this can be confirmed by running klist

[root@server1 keytabs]# klist -k hbase.service.keytab
Keytab name: FILE:hbase.service.keytab
KVNO Principal
—- ————————————————————————–
2 HTTP/server1.example.com@EXAMPLE.COM

[root@server1 keytabs]# klist -k spnego.service.keytab
Keytab name: FILE:spnego.service.keytab
KVNO Principal
—- ————————————————————————–
1 HTTP/server1.example.com@EXAMPLE.COM

Once the keytabs are configured for HDFS, webhdfs services fails to start because the keytab is no longer valid. If you try to do kinit using HDFS spnego keytab you will notice following error, which is an indication that kvno is modified.

kinit -kt spnego.service.keytab HTTP/server1.example.com
kinit: Password incorrect while getting initial credentials

To list the current kvno in the kerberos server run , the following commands on the kerberos server


kadmin.local:  getprinc HTTP/server1.example.com
Principal: HTTP/server1.example.com@EXAMPLE.COM
Expiration date: [never]
Last password change: Wed Mar 29 18:24:35 PDT 2017
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Wed Mar 29 18:24:35 PDT 2017 (nn/admin@EXAMPLE.COM)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 4
Key: vno 2, aes256-cts-hmac-sha1-96, no salt
Key: vno 2, aes128-cts-hmac-sha1-96, no salt
Key: vno 2, des3-cbc-sha1, no salt
Key: vno 2, arcfour-hmac, no salt
MKey: vno 1
Policy: [none]

Best practice for creating keytabs with overlapping principals.

1. Generate a keytab ,containing only the overlapping principal

xst -k spnego.service.keytab HTTP/server1.example.com@EXAMPLE.COM

2. For all other keytabs which need this principal use ktutil command as shown following

ktutil:  rkt spnego.service.keytab
ktutil:  rkt hbase.service.keytab
ktutil:  wkt hbase.service.keytab
ktutil:  exit

rkt loads all the principals in the keytab to the buffer, wkt creates a new keytab with all the principals currently in the buffer.

After executing the ktutil commands ,hbase.service.keytab will contain the HTTP principal with same kvno.

[root@hives1 ~]# klist -k hbase.service.keytab
Keytab name: FILE:hbase.service.keytab
KVNO Principal
—- ————————————————————————–
1 HTTP/server1.example.com@EXAMPLE.COM

Custom Spark Streaming Adapter for reading Weather data from IBM The Weather Company API.

Most of the real world price models use weather as one of the dimensions. Weather patterns can influence the pricing decisions for commodities. But how do you obtain the weather data in realtime to use it in spark analytics or persist for offline BI analysis .

In this article I will be discussing about IBM Weather Company API from Bluemix and how to read the data from the API for realtime analytics in Spark .

IBM Bluemix offers a free subscription plan for Weather Company API https://console.ng.bluemix.net/catalog/ .

Screen Shot 2017-03-16 at 3.58.15 pm

TWC API has various REST Urls one can use to acccess weather data releated to particular geo location, post code and so on.. More details is documented  here .

Reading the JSON string in Spark Streaming.

Spark streaming does not have out of box support for reading JSON from remote REST API. Hence users should create a custom adapter . Custom adapters should extend Receiver class from Spark streaming and override start() and stop() function .

Once the adapter is created we have to use the same in SparkStreamingContext as shown in following snippet

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(20));

JavaReceiverInputDStream<String> lines = ssc.receiverStream(new WcReceiver(args[0]));

Here WCReceiver is my custom HTTP adapter to read weather json data. The complete project is available in my git repo

The JSON data is further persisted to HDFS so that external bigsql or hive tables can be mounted on it and used as a dimension for analytics .

Steps to run the project

  • Compile the program using following maven command

           mvn compile package

  • Copy the generated jar file from target folder to spark cluster
  • To test the program launch it using spark-submit local mode.The program accepts two parameters Weather data API URL and HDFS output folder as shown following.

spark-submit –class “spark.wc.AnalyzeWeather” –master local[4] wc-0.0.1-SNAPSHOT.jar  <weather data url> /tmp/output.






Realtime anomaly detection in timeseries data

How to detect the anomalies or deviations in real-time multivariate time-series data ?

IBM Streams has a timeseries toolkit which allows users to analyse real-time timeseries data.

One important realtime use-case is anomaly detection which is applied in several domains or industries

More technical details can be found in the following link


Monitoring memory usage for Bigdata processes using attach API and notifying users

Recently I have been asked by customers for a way to monitor heap memory growth and alert if it exceeds a certain thresholds. Ambari provides robust alerting and monitoring capabilities but there is limited support in monitoring heap sizes for services like HiveServer2. Although there are several monitoring and notification applications available in the market, I have made an attempt to highlight how to build our own monitoring and notification application with unlimited customization.

Technologies used

1. Java attach API

2. Apache Commons Email

3. Maven

4. OpenJDK 8

The project is uploaded to my git repo and can be accessed using the following link https://github.com/bharathdcs/attach_api_notification

Java Attach API’s allows user program to attach to running JVM’s. The JVM could be hosting NodeManagers , HiveServer2 or any other process. One of the advantages of attach API is you can retrieve the JMX url from JVM and fetch various statistics, one of them is memory consumption as shown in following code snippet

VirtualMachine virtualMachine = VirtualMachine.attach(virtualMachineDescriptor);
String propertyValue = agentProperties.getProperty("com.sun.management.jmxremote.localConnectorAddress");
c.getMBeanServerConnection().getAttribute(new ObjectName("java.lang:type=Memory"), "HeapMemoryUsage");

Using the memory object (bean) returned we can retrieve stats like used, allocated and max memory. More details about MemoryMXBean can be found in the java documentation

Properties File for Parameters

Users specify the memory usage thresholds for alerts and smtp server details to send the notifications using the property file.

Following is a snippet from a sample properties file


The procId is the process Id of JVM to be attached and monitored**.

provide your SMTP server details using smtp_* variables.

Threshold specifies the threshold for triggering alert when memory usage crosses (threshold/100)*totalHeap.

Steps to run the Appliation

The application uses maven build framework.

1. Compile using “mvn clean compile assembly:single”

2. Run the application using following syntax

java -cp ${JAVA_HOME}/lib/tools.jar:heap-0.0.1-SNAPSHOT-jar-with-dependencies.jar monitor.heap.App sample.properties

The application monitors memory usage continuously and when it exceeds a certain threshold , following email is sent to your configured Inbox.

Possible Future Enhancements

1. Integrating with mobile notification APIs

2. Publish the memory units as JSON enabling java script widgets to chart them.

** – Users can only attach to JVMs owned by them.For example, Hive user can only attach to hive JVM’s .

Publish Ambari Alerts to Google Cloud Messaging using Ambari Alert Dispatcher


Apache Ambari is a powerful open source cluster management tool, which simplifies deploying and managing Big Data clusters.

Ambari Server has an in-built alert management which triggers alerts based on conditions set.

The simplest alert example is the one defined for data node process unreachable. This alert is triggered whenever datanode process is down. The following snapshot shows the details of this alert.

Alerts are good however what use has an alert unless it can be notified to end users automatically. I started exploring Ambari to identify if there is a way by which I can define a custom dispatcher for alert events.

And viola I discovered that Ambari has an inbuilt “Alert Dispatcher” which has to be activated and we can define a call back script which will be invoked by the dispatcher .

Here comes the interesting part on how I went about using Ambari alert dispatcher to trigger Google cloud notifications for Android .

Lets look at how to enable Alert dispatcher on Ambari server .

  • Create an alert dispatcher target on Ambari server, but before that check if there is one already defined. check if default alert dispatcher target is already defined , by using following curl command

curl –u admin:password –I –H ‘X-Requested-By:ambari’ http://ambariserver.domain.com:8080/api/v1/alert_targets

Output should be empty JSON.

  • Define your own alert dispatcher target, using following curl statement the settings are part of http payload.

curl -u admin:password -i -H ‘X-Requested-By: ambari’ -X POST -d ‘{“Body”:{“AlertTarget”: {“name”: “syslogger”,”description”: “Syslog Target”,”notification_type”: “ALERT_SCRIPT”,”global”: true}}}’ http://ambariserver.domain.com:8080/api/v1/alert_targets

  • Now the target is set ,  let us bind the dispatcher script that has our code to ambari server

On Ambari server properties file (/etc/ambari-server/conf/ambari.properties) define the following property and it should point to a python or bash script which has your actions defined.


I  used this property to invoke a GCM module written in python .

** if you want to use my python code as an example its available in following git repository


  • Restart Ambari server

ambari-server restart

Now its time to put the module to test , trigger an alert by stopping HDFS service from Ambari .

Congratulations we just published Ambari alerts to Android applications.

Port forwarding in a multi-homed Big data cluster.

It is a recommended practice to deploy big data cluster in a multi homed network. Multi homing is process of connecting a node to two different networks. The big data nodes are connected to a private network and we will have a management node interfacing with both private and public/corporate network as shown in following figure

There are two reasons for multi-homing Bigdata cluster
1. To reduce the congestion of corporate network since big data jobs are network intensive.
2. By isolating the network, data flow is secured within the private network.
The client components are deployed on management node so that end user can login and run the analytics.

Services running on management node can bind to both the interfaces like Zookeeper or it can bind to private interface like Hbase master running on making it inaccessible from public network unless you login to management node.

Lets take a scenario where hbase java client from corporate/public network wants to connect to HBase master bound to private network interface, the connection fails since client has no visibility to private network.
To overcome this drawback we can enable port forwarding technique on management node using iptables command .

The following command has been tested working on Redhat Linux 6.x

iptables -t nat -A PREROUTING -i eth1 -p tcp –dport 60000 -j DNAT –to

The command redirects traffic from public ethernet interface eth1 port 60000 onto private interface or IP

If you are not aware which is the public interface use the following command to identify,

ip route show | grep default

Making investment decisions like a data scientist

Buying an investment property is a major financial decision, given the current record low interest rates and inflated property prices there is a lot at stake when deciding the areas to invest in.The idea of this exercise is to show how we can do our own research with available data and tools in aiding this major financial decision . You dont have to be a major statistician or a computer scientist to perform this task, by using IBM Data Science Experience on cloud most of the complexities in building analytics infrastructure, managing and analyzing the data is taken out of picture. So you as the end user can focus on the problem at hand which is buying an investment property. 🙂

The following notebook was authored and exported from IBM Data Science Experience

Analyze Sydney Rental data

This notebook analyses Sydney rental price growth for the last ten years. We will compare the estimated average raise against actual rental raises experienced in the areas. We will plot the relative difference in actual and estimated rental returns.

Table of contents

  1. Data preprocessing
  2. Load data
  3. Explore historical rental growth
  4. Calculate simple moving average
  5. Calculate the variance between forecast and actual rental returns
  6. Summary

1. Data preprocessing

In this notebook, we will explore and compare the rental growth for several Sydney suburbs

The raw  data set is from NSW government rent and sales report.

I have only extracted the data related to Sydney Apartment rental prices and data dimensions has been reformatted to assist the analysis.

I have uploaded the data to IBM Bluemix Spark Object store which we will load and analyze in subsequent sections.

2. Load Data

In this section we will load the Sydney rental data set from object store and extracts it  for analysis

In [12]:
import requests, StringIO, pandas as pd, json, re

def get_file_content(credentials):
"""For given credentials, this functions returns a StringIO object containing the file content."""

url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
'password': credentials['password']}}}}}
headers1 = {'Content-Type': 'application/json'}
resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']==credentials['region']):
url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
s_subject_token = resp1.headers['x-subject-token']
headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
resp2 = requests.get(url=url2, headers=headers2)
return StringIO.StringIO(resp2.content)

#Credentials for reading from the object store.
credentials = {

content_string = get_file_content(credentials)
rental_df = pd.read_csv(content_string)
Month-Of-Year GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
0 Mar-90 165 190 160 160 190 170 150
1 Jun-90 170 195 165 160 190 170 150
2 Sep-90 170 200 165 160 190 190 150
3 Dec-90 170 190 160 160 190 180 150
4 Mar-91 170 195 165 160 190 175 150
Set the index for the data frame which will be useful for charting


In [13]:
rental_df = rental_df.set_index(rental_df["Month-Of-Year"])
rental_df.drop(['Month-Of-Year'], axis=1, inplace=True)
GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
Mar-90 165 190 160 160 190 170 150
Jun-90 170 195 165 160 190 170 150
Sep-90 170 200 165 160 190 190 150
Dec-90 170 190 160 160 190 180 150
Mar-91 170 195 165 160 190 175 150

3. Explore data

Now lets explore the historical rental growth in the areas.

Pandas data frame has a default plot function to chart the frame data.

In [14]:
%matplotlib inline
rental_df.ix[+5:].plot(figsize=(15, 6))




4. Simple moving average forecast

Lets calculate rolling average on the rental data, with window of 10

In [15]:


GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
Sep-92 170 195.0 162.5 160.5 188.5 178.3 150
Dec-92 170 195.0 162.0 160.5 187.5 178.3 150
Mar-93 170 194.8 161.5 160.5 186.5 177.3 150
Jun-93 170 195.8 161.0 160.5 186.0 177.8 150
Sep-93 170 196.3 160.5 160.5 186.0 178.6 150

5. Calculate the variance between forecast and actual rental returns

Lets calculate the variance between predicted rental returns v/s actual and plot the variance.

In [16]:
rental_df_var.ix[+10:].plot(figsize=(15, 6))

5. Summary
In summary, Botany Bay experiences too much variance in rental returns ,Leichhardt comes next. The rest of the areas have had a steady rental returns. This blog is an attempt to show how various tools can be used to aid with major financial decisions in life.<br />

Hadoop translation rules for kerberos principals


We know that HDFS supports Kerberos authentication , but how does HDFS map the Kerberos principals to local unix/linux usernames?

Before we being this topic, lets first understand what a Kerberos principal consists of ,

For example given following principal,we can see there are two componets

there is a realm component DOMAIN.COM and user component hdfs , sometimes user component will comprise of two sub components separated by / . hdfs/admin@DOMAIN.COM

Why convert Kerberos principal to local user ?

HDFS uses ShellBasedUnixGroupsMapping by default , which means it uses linux/unix commands to fetch the group details of a particular user. The group details are further utilized for access control check on HDFS files and folders.

How does HDFS convert principal to local user ?

HDFS translates Kerberos principals using set of regex rules defined in core-site.xml. The hadoop.security.auth_to_local property contains the regex rules.

The default rule is to strip the realm name from the principal ,

i.e hdfs@DOMAIN.COM is converted to hdfs.

The regex pattern is similar to regex in Perl.

Lets look at one of the translation rule, it has 3 parts base , filter and substitution



The base uses $0 to mean the realm, $1 to mean the first component and $2 to mean the second component in username.

For example
consider principal hdfs/admin@DOMAIN.COM ,
here DOMAIN.COM is $0, $1 is admin and $2 is admin.

Filter :
In the following example we are filtering hdfs@DOMAIN.COM

Substitution :
Finally substituting the hdfs@DOMAIN.COM with hdfs. (/hdfs/)

How to test your rules ?

Use the following command to test your regex translation rules

hadoop org.apache.hadoop.security.HadoopKerberosName hdfs@DOMAIN.COM
Name: hdfs@DOMAIN.COM to hdfs

If there are no rules defined the command fails with following error

hadoop org.apache.hadoop.security.HadoopKerberosName hdfs@TEST.COM
Exception in thread “main” org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to hdfs@TEST.COM
at org.apache.hadoop.security.authentication.util.KerberosName.getShortName(KerberosName.java:389)
at org.apache.hadoop.security.HadoopKerberosName.main(HadoopKerberosName.java:82)