Making investment decisions like a data scientist

Buying an investment property is a major financial decision, given the current record low interest rates and inflated property prices there is a lot at stake when deciding the areas to invest in.The idea of this exercise is to show how we can do our own research with available data and tools in aiding this major financial decision . You dont have to be a major statistician or a computer scientist to perform this task, by using IBM Data Science Experience on cloud most of the complexities in building analytics infrastructure, managing and analyzing the data is taken out of picture. So you as the end user can focus on the problem at hand which is buying an investment property. 🙂

The following notebook was authored and exported from IBM Data Science Experience

Analyze Sydney Rental data

This notebook analyses Sydney rental price growth for the last ten years. We will compare the estimated average raise against actual rental raises experienced in the areas. We will plot the relative difference in actual and estimated rental returns.

Table of contents

  1. Data preprocessing
  2. Load data
  3. Explore historical rental growth
  4. Calculate simple moving average
  5. Calculate the variance between forecast and actual rental returns
  6. Summary

1. Data preprocessing

In this notebook, we will explore and compare the rental growth for several Sydney suburbs

The raw  data set is from NSW government rent and sales report.

I have only extracted the data related to Sydney Apartment rental prices and data dimensions has been reformatted to assist the analysis.

I have uploaded the data to IBM Bluemix Spark Object store which we will load and analyze in subsequent sections.

2. Load Data

In this section we will load the Sydney rental data set from object store and extracts it  for analysis

In [12]:
import requests, StringIO, pandas as pd, json, re

def get_file_content(credentials):
"""For given credentials, this functions returns a StringIO object containing the file content."""

url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
'password': credentials['password']}}}}}
headers1 = {'Content-Type': 'application/json'}
resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
if(e1['type']=='object-store'):
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']==credentials['region']):
url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
s_subject_token = resp1.headers['x-subject-token']
headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
resp2 = requests.get(url=url2, headers=headers2)
return StringIO.StringIO(resp2.content)

#Credentials for reading from the object store.
credentials = {
'auth_url':'https://identity.open.softlayer.com',
'project':'----',
'project_id':'--',
'region':'dallas',
'user_id':'--',
'domain_id':'--',
'domain_name':'1123361',
'username':'--',
'password':"""--""",
'filename':'Sydney_unit_rent.csv',
'container':'notebooks',
'tenantId':'s5ec-15c742976f006e-2278ae9ffa81'
}

content_string = get_file_content(credentials)
rental_df = pd.read_csv(content_string)
rental_df.head()
Out[12]:
Month-Of-Year GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
0 Mar-90 165 190 160 160 190 170 150
1 Jun-90 170 195 165 160 190 170 150
2 Sep-90 170 200 165 160 190 190 150
3 Dec-90 170 190 160 160 190 180 150
4 Mar-91 170 195 165 160 190 175 150
Set the index for the data frame which will be useful for charting

 

In [13]:
rental_df = rental_df.set_index(rental_df["Month-Of-Year"])
rental_df.drop(['Month-Of-Year'], axis=1, inplace=True)
rental_df.head()
Out[13]:
GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
Month-Of-Year
Mar-90 165 190 160 160 190 170 150
Jun-90 170 195 165 160 190 170 150
Sep-90 170 200 165 160 190 190 150
Dec-90 170 190 160 160 190 180 150
Mar-91 170 195 165 160 190 175 150

3. Explore data

Now lets explore the historical rental growth in the areas.

Pandas data frame has a default plot function to chart the frame data.

In [14]:
%matplotlib inline
rental_df.ix[+5:].plot(figsize=(15, 6))
Out[14]:

 

 

 

4. Simple moving average forecast

Lets calculate rolling average on the rental data, with window of 10

In [15]:

 

rental_df_mean=pd.rolling_mean(rental_df,10)
rental_df_mean.ix[+10:].head()
Out[15]:
GREATER SYDNEY Inner Ring Ashfield Botany Bay Lane Cove Leichhardt Marrickville
Month-Of-Year
Sep-92 170 195.0 162.5 160.5 188.5 178.3 150
Dec-92 170 195.0 162.0 160.5 187.5 178.3 150
Mar-93 170 194.8 161.5 160.5 186.5 177.3 150
Jun-93 170 195.8 161.0 160.5 186.0 177.8 150
Sep-93 170 196.3 160.5 160.5 186.0 178.6 150

5. Calculate the variance between forecast and actual rental returns

Lets calculate the variance between predicted rental returns v/s actual and plot the variance.

In [16]:
rental_df_var=rental_df[+10:]-rental_df_mean[+10:]
rental_df_var.ix[+10:].plot(figsize=(15, 6))
Out[16]:

5. Summary
In summary, Botany Bay experiences too much variance in rental returns ,Leichhardt comes next. The rest of the areas have had a steady rental returns. This blog is an attempt to show how various tools can be used to aid with major financial decisions in life.<br />

Hadoop translation rules for kerberos principals

Introduction

We know that HDFS supports Kerberos authentication , but how does HDFS map the Kerberos principals to local unix/linux usernames?

Before we begin this topic, lets first understand what a Kerberos principal consists of ,

For example given following principal,we can see there are two componets
hdfs@DOMAIN.COM

there is a realm component DOMAIN.COM and user component hdfs , sometimes user component will comprise of two sub components separated by / . hdfs/admin@DOMAIN.COM

Why convert Kerberos principal to local user ?

HDFS uses ShellBasedUnixGroupsMapping by default , which means it uses linux/unix commands to fetch the group details of a particular user (i.e id username). The group details are further utilized for access control check on HDFS files and folders.

How does HDFS convert principal to local user ?

HDFS translates Kerberos principals using set of regex rules defined in core-site.xml. The hadoop.security.auth_to_local property contains the regex rules.

The default rule is to strip the realm name from the principal ,

i.e hdfs@DOMAIN.COM is converted to hdfs.

The regex pattern is similar to regex in Perl.

Lets look at one of the translation rule, it has 3 parts base , filter and substitution

RULE:[1:$1@$0](hdfs@DOMAIN.COM)s/.*/hdfs/

Base:
RULE:[1:$1@$0]

The base uses $0 to represent the realm, $1 for the first component and $2 means the second component in username.

For example
consider principal hdfs/admin@DOMAIN.COM ,
here DOMAIN.COM is $0, $1 is admin and $2 is admin.

Filter :
In the following example we are filtering hdfs@DOMAIN.COM
(hdfs@DOMAIN.COM)s/.*/

Substitution :
Finally substituting the hdfs@DOMAIN.COM with hdfs. (/hdfs/)

How to test your rules ?

Use the following command to test your regex translation rules

hadoop org.apache.hadoop.security.HadoopKerberosName hdfs@DOMAIN.COM
Name: hdfs@DOMAIN.COM to hdfs

If there are no rules defined the command fails with following error

hadoop org.apache.hadoop.security.HadoopKerberosName hdfs@TEST.COM
Exception in thread “main” org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to hdfs@TEST.COM
at org.apache.hadoop.security.authentication.util.KerberosName.getShortName(KerberosName.java:389)
at org.apache.hadoop.security.HadoopKerberosName.main(HadoopKerberosName.java:82)

Kerberos Series – Part II (Renewable Tickets)

The following instructions will help in configuring the Kerberos tickets as renewable.

On KDC server node modify /etc/krb5.conf file add the property max_renewable_life as shown following

[realms]
DOMAIN.COM = {
admin_server = kdcserver.domain.com
kdc = kdcserver.domain.com
max_renewable_life = 7d
}

Restart KDC server daemon

service krb5kdc restart

On Kerberos client nodes, modify the /etc/krb5.conf file modify the default value for renew_lifetime. If the property is missing add it.

[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = DOMAIN.COM
dns_lookup_realm = false
dns_lookup_kdc = false

Login to kadmin or kadmin.local and modify the max renew life property for the required principals and krbtgt service principal

kadmin -p root/admin

kadmin: modprinc -maxrenewlife “20days” krbtgt/DOMAIN.COM
Principal “krbtgt/DOMAIN.COM@DOMAIN.COM” modified.
kadmin: modprinc -maxrenewlife “20days” hdfs/DOMAIN.COM

Congratulations you have successfully made your Kerberos tickets renewable

How to renew the tickets.

Tickets are allowed to be renewed within their renewable lifetime using the following command

kinit -R hdfs/DOMAIN.COM

Notice we are not passing any keytab nor we enter any passwords while renewing the tickets ,this is the advantage with renewable tickets .

Additional Information

Renewing tickets does not extend renewable lifetime of tickets but only the ticket lifetime

If you want to set the renewable lifetime of ticket different from default renewable life time set in /etc/krb5.conf , pass the parameter -r to kinit command

kinit -r “5d” -kt /etc/security/keytabs/hdfs.service.keytab hdfs@DOMAIN.COM

This sets the renew lifetime to 5 days instead of configured default 7 .

Please note : once the tickets are made renewable , renew life time cannot be set less than ticket expiration life time. By default renew life time will be set same as ticket expiration time.
For example,
In following kinit I have specified renew life time to be 1 day and ticket life time to be 2 days , kerberos ignores my renew time and instead sets it to same as expiration time.

kinit -l”2d” -r”1d” -kt /etc/security/keytabs/hdfs.service.keytab hdfs/DOMAIN.COM
Ticket cache: FILE:/tmp/krb5cc_2824
Default principal: hdfs/DOMAIN.COM

Valid starting Expires Service principal
05/22/16 20:36:19 05/24/16 20:36:19 krbtgt/DOMAIN.COM@DOMAIN.COM
renew until 05/24/16 20:36:19

Kerberos Series -Part I (Ticket Lifetime)

         Kerberos is an integral part of Bigdata cluster security infrastructure. Setting up and configuring Kerberos cluster can be overwhelming for beginners. Through this blog I am trying to simplify different administration tasks involved with Kerberos.
 Kerberos infrastructure consists of client and server components. Server consists of two parts daemons , KDC server whose job is to issue and validate tickets . Kadmin server which is used for administrating the KDC.
Modify the default life time (24hrs) of Kerberos ticketsIf for some reason you want to modify the default lifetime of Kerberos tickets, following steps will help

1. Edit the file /etc/krb5.conf on KDC server node, add max_life property for the domains you intend to modify the ticket lifetime ,

[realms]
DOMAIN.COM = {
admin_server = kdcserver.domain.com
kdc = kdcserver.domain.com
max_life = 180d
}
2. Restart kdc server, using following command
service krb5kdc restart
3. On Kerberos client machines modify default ticket lifetime , by modifying the /etc/krb5.conf file as following,
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = DOMAIN.COM
ticket_lifetime = 20d
dns_lookup_realm = false
dns_lookup_kdc = false
default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

This is optional step, if not specified default lifetime will be 24 hours , it has to be overridden from kinit command to modify the lifetime.

4. Login to kadmin or kadmin.local and modify the max life for principals and krbtgt service principal

kadmin -p root/admin

kadmin: modprinc -maxlife “20days” krbtgt/DOMAIN.COM
Principal “krbtgt/DOMAIN.COM@DOMAIN.COM” modified.

kadmin: modprinc -maxlife “20days” hdfs/DOMAIN.COM

Congratulations you have successfully extended lifetime of your Kerberos principal tickets.

Additional information: If you decide not to set the default lifetime of tickets on a client machine . Then Ignore step 3, instead pass the ticket lifetime as a parameter to kinit command as shwon following

kinit -l “10d” -kt /etc/security/keytabs/hdfs.keytabs hdfs/DOMAIN.COM