How to merge Kerberos Keytabs without increasing kvno.

In scenarios which requires having same principal as part of multiple keytabs using xst or ktadd will increment the kvno making previous keytabs irrelevant .

For example:- HTTP  service principal is used for spnego authentication by both hadoop service and hbase rest service. If hdfs and hbase keytabs are created individually as shown following

xst -k spnego.service.keytab HTTP/

xst -k hbase.service.keytab HTTP/

The keytab for hadoop will have an old version of HTTP principal this can be confirmed by running klist

[root@server1 keytabs]# klist -k hbase.service.keytab
Keytab name: FILE:hbase.service.keytab
KVNO Principal
—- ————————————————————————–

[root@server1 keytabs]# klist -k spnego.service.keytab
Keytab name: FILE:spnego.service.keytab
KVNO Principal
—- ————————————————————————–

Once the keytabs are configured for HDFS, webhdfs services fails to start because the keytab is no longer valid. If you try to do kinit using HDFS spnego keytab you will notice following error, which is an indication that kvno is modified.

kinit -kt spnego.service.keytab HTTP/
kinit: Password incorrect while getting initial credentials

To list the current kvno in the kerberos server run , the following commands on the kerberos server


kadmin.local:  getprinc HTTP/
Principal: HTTP/
Expiration date: [never]
Last password change: Wed Mar 29 18:24:35 PDT 2017
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Wed Mar 29 18:24:35 PDT 2017 (nn/admin@EXAMPLE.COM)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 4
Key: vno 2, aes256-cts-hmac-sha1-96, no salt
Key: vno 2, aes128-cts-hmac-sha1-96, no salt
Key: vno 2, des3-cbc-sha1, no salt
Key: vno 2, arcfour-hmac, no salt
MKey: vno 1
Policy: [none]

Best practice for creating keytabs with overlapping principals.

1. Generate a keytab ,containing only the overlapping principal

xst -k spnego.service.keytab HTTP/

2. For all other keytabs which need this principal use ktutil command as shown following

ktutil:  rkt spnego.service.keytab
ktutil:  rkt hbase.service.keytab
ktutil:  wkt hbase.service.keytab
ktutil:  exit

rkt loads all the principals in the keytab to the buffer, wkt creates a new keytab with all the principals currently in the buffer.

After executing the ktutil commands ,hbase.service.keytab will contain the HTTP principal with same kvno.

[root@hives1 ~]# klist -k hbase.service.keytab
Keytab name: FILE:hbase.service.keytab
KVNO Principal
—- ————————————————————————–

Custom Spark Streaming Adapter for reading Weather data from IBM The Weather Company API.

Most of the real world price models use weather as one of the dimensions. Weather patterns can influence the pricing decisions for commodities. But how do you obtain the weather data in realtime to use it in spark analytics or persist for offline BI analysis .

In this article I will be discussing about IBM Weather Company API from Bluemix and how to read the data from the API for realtime analytics in Spark .

IBM Bluemix offers a free subscription plan for Weather Company API .

Screen Shot 2017-03-16 at 3.58.15 pm

TWC API has various REST Urls one can use to acccess weather data releated to particular geo location, post code and so on.. More details is documented  here .

Reading the JSON string in Spark Streaming.

Spark streaming does not have out of box support for reading JSON from remote REST API. Hence users should create a custom adapter . Custom adapters should extend Receiver class from Spark streaming and override start() and stop() function .

Once the adapter is created we have to use the same in SparkStreamingContext as shown in following snippet

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(20));

JavaReceiverInputDStream<String> lines = ssc.receiverStream(new WcReceiver(args[0]));

Here WCReceiver is my custom HTTP adapter to read weather json data. The complete project is available in my git repo

The JSON data is further persisted to HDFS so that external bigsql or hive tables can be mounted on it and used as a dimension for analytics .

Steps to run the project

  • Compile the program using following maven command

           mvn compile package

  • Copy the generated jar file from target folder to spark cluster
  • To test the program launch it using spark-submit local mode.The program accepts two parameters Weather data API URL and HDFS output folder as shown following.

spark-submit –class “spark.wc.AnalyzeWeather” –master local[4] wc-0.0.1-SNAPSHOT.jar  <weather data url> /tmp/output.






Realtime anomaly detection in timeseries data

How to detect the anomalies or deviations in real-time multivariate time-series data ?

IBM Streams has a timeseries toolkit which allows users to analyse real-time timeseries data.

One important realtime use-case is anomaly detection which is applied in several domains or industries

More technical details can be found in the following link

Monitoring memory usage for Bigdata processes using attach API and notifying users

Recently I have been asked by customers for a way to monitor heap memory growth and alert if it exceeds a certain thresholds. Ambari provides robust alerting and monitoring capabilities but there is limited support in monitoring heap sizes for services like HiveServer2. Although there are several monitoring and notification applications available in the market, I have made an attempt to highlight how to build our own monitoring and notification application with unlimited customization.

Technologies used

1. Java attach API

2. Apache Commons Email

3. Maven

4. OpenJDK 8

The project is uploaded to my git repo and can be accessed using the following link

Java Attach API’s allows user program to attach to running JVM’s. The JVM could be hosting NodeManagers , HiveServer2 or any other process. One of the advantages of attach API is you can retrieve the JMX url from JVM and fetch various statistics, one of them is memory consumption as shown in following code snippet

VirtualMachine virtualMachine = VirtualMachine.attach(virtualMachineDescriptor);
String propertyValue = agentProperties.getProperty("");
c.getMBeanServerConnection().getAttribute(new ObjectName("java.lang:type=Memory"), "HeapMemoryUsage");

Using the memory object (bean) returned we can retrieve stats like used, allocated and max memory. More details about MemoryMXBean can be found in the java documentation

Properties File for Parameters

Users specify the memory usage thresholds for alerts and smtp server details to send the notifications using the property file.

Following is a snippet from a sample properties file


The procId is the process Id of JVM to be attached and monitored**.

provide your SMTP server details using smtp_* variables.

Threshold specifies the threshold for triggering alert when memory usage crosses (threshold/100)*totalHeap.

Steps to run the Appliation

The application uses maven build framework.

1. Compile using “mvn clean compile assembly:single”

2. Run the application using following syntax

java -cp ${JAVA_HOME}/lib/tools.jar:heap-0.0.1-SNAPSHOT-jar-with-dependencies.jar monitor.heap.App

The application monitors memory usage continuously and when it exceeds a certain threshold , following email is sent to your configured Inbox.

Possible Future Enhancements

1. Integrating with mobile notification APIs

2. Publish the memory units as JSON enabling java script widgets to chart them.

** – Users can only attach to JVMs owned by them.For example, Hive user can only attach to hive JVM’s .

Publish Ambari Alerts to Google Cloud Messaging using Ambari Alert Dispatcher


Apache Ambari is a powerful open source cluster management tool, which simplifies deploying and managing Big Data clusters.

Ambari Server has an in-built alert management which triggers alerts based on conditions set.

The simplest alert example is the one defined for data node process unreachable. This alert is triggered whenever datanode process is down. The following snapshot shows the details of this alert.

Alerts are good however what use has an alert unless it can be notified to end users automatically. I started exploring Ambari to identify if there is a way by which I can define a custom dispatcher for alert events.

And viola I discovered that Ambari has an inbuilt “Alert Dispatcher” which has to be activated and we can define a call back script which will be invoked by the dispatcher .

Here comes the interesting part on how I went about using Ambari alert dispatcher to trigger Google cloud notifications for Android .

Lets look at how to enable Alert dispatcher on Ambari server .

  • Create an alert dispatcher target on Ambari server, but before that check if there is one already defined. check if default alert dispatcher target is already defined , by using following curl command

curl –u admin:password –I –H ‘X-Requested-By:ambari’

Output should be empty JSON.

  • Define your own alert dispatcher target, using following curl statement the settings are part of http payload.

curl -u admin:password -i -H ‘X-Requested-By: ambari’ -X POST -d ‘{“Body”:{“AlertTarget”: {“name”: “syslogger”,”description”: “Syslog Target”,”notification_type”: “ALERT_SCRIPT”,”global”: true}}}’

  • Now the target is set ,  let us bind the dispatcher script that has our code to ambari server

On Ambari server properties file (/etc/ambari-server/conf/ define the following property and it should point to a python or bash script which has your actions defined.


I  used this property to invoke a GCM module written in python .

** if you want to use my python code as an example its available in following git repository

  • Restart Ambari server

ambari-server restart

Now its time to put the module to test , trigger an alert by stopping HDFS service from Ambari .

Congratulations we just published Ambari alerts to Android applications.

Port forwarding in a multi-homed Big data cluster.

It is a recommended practice to deploy big data cluster in a multi homed network. Multi homing is process of connecting a node to two different networks. The big data nodes are connected to a private network and we will have a management node interfacing with both private and public/corporate network as shown in following figure

There are two reasons for multi-homing Bigdata cluster
1. To reduce the congestion of corporate network since big data jobs are network intensive.
2. By isolating the network, data flow is secured within the private network.
The client components are deployed on management node so that end user can login and run the analytics.

Services running on management node can bind to both the interfaces like Zookeeper or it can bind to private interface like Hbase master running on making it inaccessible from public network unless you login to management node.

Lets take a scenario where hbase java client from corporate/public network wants to connect to HBase master bound to private network interface, the connection fails since client has no visibility to private network.
To overcome this drawback we can enable port forwarding technique on management node using iptables command .

The following command has been tested working on Redhat Linux 6.x

iptables -t nat -A PREROUTING -i eth1 -p tcp –dport 60000 -j DNAT –to

The command redirects traffic from public ethernet interface eth1 port 60000 onto private interface or IP

If you are not aware which is the public interface use the following command to identify,

ip route show | grep default

Making investment decisions like a data scientist

Buying an investment property is a major financial decision, given the current record low interest rates and inflated property prices there is a lot at stake when deciding the areas to invest in.The idea of this exercise is to show how we can do our own research with available data and tools in aiding this major financial decision . You dont have to be a major statistician or a computer scientist to perform this task, by using IBM Data Science Experience on cloud most of the complexities in building analytics infrastructure, managing and analyzing the data is taken out of picture. So you as the end user can focus on the problem at hand which is buying an investment property. 🙂

The following notebook was authored and exported from IBM Data Science Experience

Analyze Sydney Rental data

This notebook analyses Sydney rental price growth for the last ten years. We will compare the estimated average raise against actual rental raises experienced in the areas. We will plot the relative difference in actual and estimated rental returns.

Table of contents

  1. Data preprocessing
  2. Load data
  3. Explore historical rental growth
  4. Calculate simple moving average
  5. Calculate the variance between forecast and actual rental returns
  6. Summary

1. Data preprocessing

In this notebook, we will explore and compare the rental growth for several Sydney suburbs

The raw  data set is from NSW government rent and sales report.

I have only extracted the data related to Sydney Apartment rental prices and data dimensions has been reformatted to assist the analysis.

I have uploaded the data to IBM Bluemix Spark Object store which we will load and analyze in subsequent sections.

2. Load Data

In this section we will load the Sydney rental data set from object store and extracts it  for analysis

In [12]:
import requests, StringIO, pandas as pd, json, re

def get_file_content(credentials):
"""For given credentials, this functions returns a StringIO object containing the file content."""

url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
'password': credentials['password']}}}}}
headers1 = {'Content-Type': 'application/json'}
resp1 =, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']==credentials['region']):
url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
s_subject_token = resp1.headers['x-subject-token']
headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
resp2 = requests.get(url=url2, headers=headers2)
return StringIO.StringIO(resp2.content)

#Credentials for reading from the object store.
credentials = {

content_string = get_file_content(credentials)
rental_df = pd.read_csv(content_string)
Month-Of-Year GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
0 Mar-90 165 190 160 160 190 170 150
1 Jun-90 170 195 165 160 190 170 150
2 Sep-90 170 200 165 160 190 190 150
3 Dec-90 170 190 160 160 190 180 150
4 Mar-91 170 195 165 160 190 175 150
Set the index for the data frame which will be useful for charting


In [13]:
rental_df = rental_df.set_index(rental_df["Month-Of-Year"])
rental_df.drop(['Month-Of-Year'], axis=1, inplace=True)
GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
Mar-90 165 190 160 160 190 170 150
Jun-90 170 195 165 160 190 170 150
Sep-90 170 200 165 160 190 190 150
Dec-90 170 190 160 160 190 180 150
Mar-91 170 195 165 160 190 175 150

3. Explore data

Now lets explore the historical rental growth in the areas.

Pandas data frame has a default plot function to chart the frame data.

In [14]:
%matplotlib inline
rental_df.ix[+5:].plot(figsize=(15, 6))




4. Simple moving average forecast

Lets calculate rolling average on the rental data, with window of 10

In [15]:


GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
Sep-92 170 195.0 162.5 160.5 188.5 178.3 150
Dec-92 170 195.0 162.0 160.5 187.5 178.3 150
Mar-93 170 194.8 161.5 160.5 186.5 177.3 150
Jun-93 170 195.8 161.0 160.5 186.0 177.8 150
Sep-93 170 196.3 160.5 160.5 186.0 178.6 150

5. Calculate the variance between forecast and actual rental returns

Lets calculate the variance between predicted rental returns v/s actual and plot the variance.

In [16]:
rental_df_var.ix[+10:].plot(figsize=(15, 6))

5. Summary
In summary, Botany Bay experiences too much variance in rental returns ,Leichhardt comes next. The rest of the areas have had a steady rental returns. This blog is an attempt to show how various tools can be used to aid with major financial decisions in life.<br />

Hadoop translation rules for kerberos principals


We know that HDFS supports Kerberos authentication , but how does HDFS map the Kerberos principals to local unix/linux usernames?

Before we being this topic, lets first understand what a Kerberos principal consists of ,

For example given following principal,we can see there are two componets

there is a realm component DOMAIN.COM and user component hdfs , sometimes user component will comprise of two sub components separated by / . hdfs/admin@DOMAIN.COM

Why convert Kerberos principal to local user ?

HDFS uses ShellBasedUnixGroupsMapping by default , which means it uses linux/unix commands to fetch the group details of a particular user. The group details are further utilized for access control check on HDFS files and folders.

How does HDFS convert principal to local user ?

HDFS translates Kerberos principals using set of regex rules defined in core-site.xml. The property contains the regex rules.

The default rule is to strip the realm name from the principal ,

i.e hdfs@DOMAIN.COM is converted to hdfs.

The regex pattern is similar to regex in Perl.

Lets look at one of the translation rule, it has 3 parts base , filter and substitution



The base uses $0 to mean the realm, $1 to mean the first component and $2 to mean the second component in username.

For example
consider principal hdfs/admin@DOMAIN.COM ,
here DOMAIN.COM is $0, $1 is admin and $2 is admin.

Filter :
In the following example we are filtering hdfs@DOMAIN.COM

Substitution :
Finally substituting the hdfs@DOMAIN.COM with hdfs. (/hdfs/)

How to test your rules ?

Use the following command to test your regex translation rules

hadoop hdfs@DOMAIN.COM
Name: hdfs@DOMAIN.COM to hdfs

If there are no rules defined the command fails with following error

hadoop hdfs@TEST.COM
Exception in thread “main”$NoMatchingRule: No rules applied to hdfs@TEST.COM

Kerberos Series – Part II (Renewable Tickets)

The following instructions will help in configuring the Kerberos tickets as renewable.

On KDC server node modify /etc/krb5.conf file add the property max_renewable_life as shown following

admin_server =
kdc =
max_renewable_life = 7d

Restart KDC server daemon

service krb5kdc restart

On Kerberos client nodes, modify the /etc/krb5.conf file modify the default value for renew_lifetime. If the property is missing add it.

renew_lifetime = 7d
forwardable = true
default_realm = DOMAIN.COM
dns_lookup_realm = false
dns_lookup_kdc = false

Login to kadmin or kadmin.local and modify the max renew life property for the required principals and krbtgt service principal

kadmin -p root/admin

kadmin: modprinc -maxrenewlife “20days” krbtgt/DOMAIN.COM
Principal “krbtgt/DOMAIN.COM@DOMAIN.COM” modified.
kadmin: modprinc -maxrenewlife “20days” hdfs/DOMAIN.COM

Congratulations you have successfully made your Kerberos tickets renewable

How to renew the tickets.

Tickets are allowed to be renewed within their renewable lifetime using the following command

kinit -R hdfs/DOMAIN.COM

Notice we are not passing any keytab nor we enter any passwords while renewing the tickets ,this is the advantage with renewable tickets .

Additional Information

Renewing tickets does not extend renewable lifetime of tickets but only the ticket lifetime

If you want to set the renewable lifetime of ticket different from default renewable life time set in /etc/krb5.conf , pass the parameter -r to kinit command

kinit -r “5d” -kt /etc/security/keytabs/hdfs.service.keytab hdfs@DOMAIN.COM

This sets the renew lifetime to 5 days instead of configured default 7 .

Please note : once the tickets are made renewable , renew life time cannot be set less than ticket expiration life time. By default renew life time will be set same as ticket expiration time.
For example,
In following kinit I have specified renew life time to be 1 day and ticket life time to be 2 days , kerberos ignores my renew time and instead sets it to same as expiration time.

kinit -l”2d” -r”1d” -kt /etc/security/keytabs/hdfs.service.keytab hdfs/DOMAIN.COM
Ticket cache: FILE:/tmp/krb5cc_2824
Default principal: hdfs/DOMAIN.COM

Valid starting Expires Service principal
05/22/16 20:36:19 05/24/16 20:36:19 krbtgt/DOMAIN.COM@DOMAIN.COM
renew until 05/24/16 20:36:19