Disclaimer: The analysis was done in this project touches a sensitive issue in India. So I never convince anybody to trust my model.

A real human society is so complex that “all the things may be interconnected in a different way than in the model.”

Imagine you are presented with a dataset of “Hate Crimes” in India and asked how to minimize these crimes by analyzing other factors. This is the problem I am taking in hand to solve and analyze with a minimum number of resources. Some can say that education and providing jobs to youth in India by the government could solve this problem and yes you are right. You will see that relationship soon. You can also make your best guess by visualizing many other factors that I will present here.

Next post, I’ll create the Machine Learning model and show you a simple product which will take you to another leap of thought process.

Of course, we can’t or our policymakers can’t experiment on this analysis as it is purely for me and my audience’s learning process for Machine Learning experiment. But however we can do such experiment on virtual people if we have a massive dataset and that’s what exactly I’ll try to show in Modeling Phase of next article after analyzing this small three years of data. In my modeling techniques, in some cases, I am able to predict more accurately than simple linear regression analysis. One of the analysis shows that more the religious party in power in India; more are such hate crimes would occur and people can be secularized with few factors we will discuss as we discuss our model.

If our leaders can use AI/ML to predict which policy will produce the best outcome, maybe we’ll end up with a healthier and happier world. But it’s also a dangerous idea: What’s “best” is in the eye of the beholder, after all.

I wish these incidences would get a pause but it will not happen soon. So one of the hypothesis is how these events will stop immediately. My guess is that some ecological disaster could trigger the halt completely though I have no model yet to show it yet.

Another hypothesis is mutually escalating violence is likeliest to occur if there’s a small disparity in size between the majority and minority groups in India. I guess we can’t predict that ration till we don’t have sufficient data.

A regression model is a great tool and can be used for some very good recommendations. I am so amazed to see that when I have started working on this project, my regression model on this data predicted the clashes that happen in India in April 2018. It matched the model beautifully. It was really exciting.

So let us start and here is the approach I am using to execute this project.

  1. PLAN
  2. BUILD
  3. EXECUTE
  4. EXPLORE

PLAN: The goal of this project to show you the tools and techniques that I have used for a Machine Learning project. All of the graphs, data, and model are transparent and the code is always online.

Planning is the phase where we discuss from getting the data, cleaning the data and understanding the requirement. So we have two kinds of dataset “Hate Crime” and “District-wise education data”.

“Hate Crime” dataset: This data is taken from the Amnesty International ‘Halt the Hate’ campaign website. “Hate Crime” data is hate crimes against Dalits, Muslims, Adivasis, Transgender people, Christians, and ‘Other’ vulnerable groups. The term ‘hate crime’ is generally applied to criminal acts against people based on their real or perceived membership of a particular group, such as caste, religion or ethnicity, among others.

To get more features into our dataset I need to join it with education data i.e. “District-wise education data”: This dataset provides the complete information about primary and secondary district wise education for the academic year 2015-16. There are many inferences can be made from this dataset. It contains 680 observations and 819 variables. You can download this data from this link.

Both the datasets are in csv format. Let us see some variables present in above Datasets.

Categorical Variables are below:-

Nominal Variables:- Columns “District”, “State”, “victim_name”

Dichotomous Variables:- A special case of nominal variables with exactly two categories. identity_dalit, identity_muslim, identity_adivasi, identity_transgender, identity_christian, identity_others are the some example of Dichotomous Variables present in our dataset.

Ordinal variables:- Just like nominal variables, we can have two or more categories in ordinal variables with an added condition that the categories are ordered. For example “victims age” is Ordinal variable between 10 to 70 where age below 17 is the minor and above 50 is senior citizens and between 18 and 38 are considered as adults.

Continuous Variables are subdivided into Interval and Ratio: –

Interval: Date and Months of incidence are Interval Continuous variables.

Ration: – SEXRATIO, TOTPOPULAT, and GROWTHRATE in “District wise education” are Ration Continuous variables.

Hypothesis or understanding of data by asking below questions.

  1. What is the age group are victims of such crimes?
  2. What is the time of the month or year when these crimes actually happen more?
  3. District level Growth rate have any impact on such events.
  4. Does political party in power have any relationship with incidence or age of victims?
  5. Does Minority group such as Muslim have the same pattern as other Minority Dalit Group?
  6. Do dates having any relation with these issues occurring? (In India most of the small industries pay the salary between 1 to 10 and yes there is relation found)
  1. Does the age of victim have any relation with the growth rate of the city or with the population of that city?

In this phase of data analysis we carry out to assess the data structure, including checking naming conventions, identifying duplicates, merging data, and further cleaning the data if required. Initial data analysis will help identify any additional data requirement, which is why you see a small leap of feedback loop built in to the process flow.

library(“dplyr”)
library(ggplot2)
options(scipen=999)
setwd(“C:/Users/victor/data”)

aa_2015 <- read.csv(‘AA_2015.csv’, stringsAsFactors = F)
aa_2016 <- read.csv(‘AA_2016.csv’, stringsAsFactors = F)
aa_2017 <- read.csv(‘AA_2017.csv’, stringsAsFactors = F)

dt <- rbind(aa_2015,aa_2016,aa_2017)

edu = read.csv(‘2015_16_Statewise_Secondary.csv’, stringsAsFactors = F)

hate_edu = merge(dt, edu, by.x = ‘state’, by.y = ‘statname’)

# remove few outliner in the data for to keep data normal
# age = 0 when victims age is not identified
# age > 75 removes the age 80 one row to keep data looks normal
dt2 <- filter(dt, victim_ages > 0)
dt2 <- filter(dt2, victim_ages < 75)

 

EXPLORE :

In this part, a more rigorous analysis is done by creating hypotheses, sampling data using various techniques, checking the statistical properties of the sample, and performing statistical tests to reject or accept the hypotheses.

Here is summary and Histogram on Victims Age:-

summary(dt2$victim_ages)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    1.00   17.00   23.00   26.86   35.00   70.00

hist(dt2$victim_ages, breaks = 100, main = “Histogram on Victims Age”)

Sample Distribution Test

# Kolmogorov-Smirnov test –Sample Distribution Test: As there are three samples available we are interested to know whether they are from the same distribution or not. To verify this we will run KS test using code below

ks.test(sin_2016$victim_ages,sin_2017$victim_ages,alternative=”two.sided”)

Two-sample Kolmogorov-Smirnov test

data:  sin_2016$victim_ages and sin_2017$victim_ages
D = 0.19219, p-value = 0.0007637
alternative hypothesis: two-sided

The result shows the low P-value which says that there are variations in age and not confirm the datasets are from the same distribution. As we don’t have the true mean so we can’t be sure whether there are from the same distribution.

Let’s now visualize by generating the histogram.

par(mfrow =c(1,3))
hist(sin2015$victim_ages, breaks=40, col="red", xlab="Victims Age",
     main="Histogram for Sample Data 2015")
hist(sin_2016$victim_ages, breaks=40, col="green", xlab="Victims Age",
     main="Histogram for Sample Data 2016")
hist(sin_2017$victim_ages, breaks=40, col="blue", xlab="Victims Age",
     main="Histogram for Sample Data 2017")


Above plot gives us some hint that the sample are belong to same distribution.

Now we will do a formal test on the mean of the outstanding balance from the population and our random sample

t.test(sin_2016$victim_ages,sin_2017$victim_ages)

Below is the result:

Welch Two Sample t-test
data:  sin_2016$victim_ages and sin_2017$victim_ages
t = 1.557, df = 396.36, p-value = 0.1203
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7075485  6.0956983
sample estimates:
mean of x mean of y
17.83621  15.14213

Mean value we are getting p-value more than 5% and reject alternate hypothesis and support null hypothesis that data may be from the same distribution. 95 percent confidence interval is also a good indicator here.

Observe Skewness

As variance is a measure of spread, skewness measures asymmetry about the mean of the probability distribution of a random variable.

• Negative skew: The left tail is longer; the mass of the distribution is concentrated on the right. The distribution is said to be left-skewed, left-tailed, or skewed to the left.

• Positive skew: The right tail is longer; the mass of the distribution is concentrated on the left. The distribution is said to be right-skewed, right-tailed, or skewed to the right.

Now with Histogram let us observe the Skewness in the Data:-

Lest plot the histogram and see the skew using below code.

hist(dt2$victim_ages, breaks = 100, main = “Normal DIstribution”)

It shows the age of the victim is not symmetrical and negative skewed distribution.

Kurtosis

Kurtosis is a measure of peakedness and tailedness of the probability distribution of a random variable. Similar to skewness, kurtosis is also used to describe the shape of the probability distribution function

Now lest see the Kurtosis Plot to verify that if our data is distributed normally and verify the density of ORDINAL VARIABLE.

Above shows some signs that there two peaks means there are two age groups of victims. This Distribution with a kurtosis value > 3. Its a higher peak and longer tails than a normal distribution.

Let us draw our attention to the time when these crimes occurred and see if it have any relationship with salary dates. As salary dates in India of many private industries are between 1 to 15.

summary(dt2$Date
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00    7.00   16.00   15.94   24.00   31.00

hist(dt2$Date, breaks = 100, main = “Histogram on Time of incidence”)

Below we plot histogram on the growth rate of states vs the number of times these incidences occurred in states.

hist(dt2$GROWTHRATE, breaks = 100, main = “Histogram on Growth Rate”)

Let us draw Normal Distribution and Probability Density Function (PDF) which will return the height of the probability distribution at each point.

test_x <-  as.vector(dt2$victim_ages)

y <- dnorm(test_x, mean = mean(dt2$victim_ages), sd = sd(dt2$victim_ages))

plot(test_x,y, main = “Returns the height of the probability distribution at each point.”)

test_x <-  as.vector(dt2$Date)

y <- dnorm(test_x, mean = mean(dt2$Date), sd = sd(dt2$Date))

plot(test_x,y, main = “Returns the height of the probability distribution at each point.”)

Cumulative Distribution Function (CDF)

y <- pnorm(test_x, mean = mean(dt2$victim_ages), sd = sd(dt2$victim_ages),lower.tail=FALSE)

plot(test_x,y, main = “Cumulative Distribution Function.”)

y <- pnorm(test_x, mean = mean(dt2$Date), sd = sd(dt2$Date),lower.tail=FALSE)

plot(test_x,y, main = “Cumulative Distribution Function on Date.”)

Now lets draw our variables for normal distribution.

test_x <-  as.vector(dt2$victim_ages)

y <- rnorm(test_x, mean = mean(dt2$victim_ages), sd = sd(dt2$victim_ages))

plot(test_x,y, main = “Generate Random variables for Normal Distribution”)

Lets assume we have true mean 24. Now how this sample we have performed t-test with true mean.

t.test(dt2$victim_ages, mu=24)

One Sample t-test

data:  dt2$victim_ages
t = 3.0448, df = 236, p-value = 0.002592
alternative hypothesis: true mean is not equal to 24
95 percent confidence interval:
25.00829 28.70479
sample estimates:
mean of x
26.85654

p-value indicates some support with our null hypothesis.

A boxplot is a wonderful representation of the degree of dispersion (spread), skewness, and outliers in a single plot. It is the compact way of representing the five number summary namely median, first and third quartiles (25th and 75th percentile) and min and max.

boxplot(I(TOTPOPULAT + GROWTHRATE) ~ identity_dalit, data = dt2) title(“identity_dalit varies with growth rate”)

Party in power seems some effect on incidents. As seems there are BJP have many outliners and causeing an effect on states even where the growth rate is good. As we understand with data that more the growth rate less are the chances that we see hate crimes. A social economist might intuitively generate some insights by just glancing at this plot, however, a naive analyst like me might end up producing some erroneous conclusions if didn’t give attention to the details.

boxplot(I(TOTPOPULAT + GROWTHRATE) ~ party_in_power, data = dt2) title(“party in power varies with growth rate”)

Another boxplot here to show a relationship between states and victims age.

ggplot(dt2, aes(x = factor(state), y = victim_ages)) + geom_boxplot() + xlab(“state”) + ylab(“Victims Age”)

Alternate of box plot is violin plot:-

p <- ggplot(dt2, aes(factor(state), GROWTHRATE))

p + geom_violin()

Lets draw project 5 dimension data on two dimension surface.

ggplot(dt2, aes(state, victim_ages)) + geom_point(aes(color = factor(identity_dalit), size = GROWTHRATE), alpha = 0.3) + xlab(“state”) + ylab(“Victims Age”) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Let us try to make it more visible

dt2$GROWTHRATE = dt2$GROWTHRATE^2

ggplot(dt2, aes(state, victim_ages)) + geom_point(aes(color = factor(identity_dalit), size = GROWTHRATE, shape = factor(caste_related_violence)), alpha = 0.3) + xlab(“state”) + ylab(“Victims Age”) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

dt2$GROWTHRATE = dt2$GROWTHRATE^2

ggplot(dt2, aes(victim_ages, TOTPOPULAT)) + geom_point(aes(color = factor(identity_dalit), size = GROWTHRATE, shape = factor(caste_related_violence)), alpha = 0.3) + xlab(“Victims Age”) + ylab(“Total Population”) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now lets draw scattered plot matrix.

options(repr.plot.width = 3, repr.plot.height = 3)

scatterplotMatrix(~victim_ages + TOTPOPULAT + GROWTHRATE, data = dt2)

Regression – linear or Non Linear problem

If you want to know more about residual plot go to my another blog using this link. Let’s draw residual plot to see whether there is a relationship between dependent and independent variables or we have a regression or non-regression use case.

For this, we take Growth rate vs Population and Growth rate vs Victims Age.

summary(lm(GROWTHRATE ~ victim_ages, data = dt2))

Call:
lm(formula = GROWTHRATE ~ victim_ages, data = dt2)
Residuals:
Min      1Q  Median      3Q     Max
-43.062  -4.868  -0.828   4.562  56.609
Coefficients:
Estimate Std. Error t value            Pr(>|t|)
(Intercept) 18.25824    1.47817  12.352 <0.0000000000000002 ***
victim_ages -0.03905    0.04898  -0.797               0.426

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.67 on 230 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared:  0.002756, Adjusted R-squared:  -0.00158
F-statistic: 0.6356 on 1 and 230 DF,  p-value: 0.4261

summary(lm(GROWTHRATE ~ TOTPOPULAT, data = dt2))

Call:
lm(formula = GROWTHRATE ~ TOTPOPULAT, data = dt2)
Residuals:
Min      1Q  Median      3Q     Max
-38.705  -3.827  -1.052   3.725  58.495
Coefficients:
Estimate    Std. Error t value
(Intercept) 13.1537877687  1.3267443077   9.914
TOTPOPULAT   0.0000015070  0.0000004215   3.575
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
TOTPOPULAT              0.000427 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.4 on 230 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared:  0.05264,  Adjusted R-squared:  0.04852
F-statistic: 12.78 on 1 and 230 DF,  p-value: 0.000427

Let us draw a histogram on the sum of the number of incidence and party in power of that state.

Now we are plotting map of India where we show geographically the number of victims in that area.

Lets draw family of histogram to compare to factors in our dataset. Let us we want to see the variation in date of incident with party in power.

options(repos = c(CRAN = “http://cran.rstudio.com”))

install.packages(‘gridExtra’)

hist.plot = function(df, col, bw, max, min){  ggplot(df, aes_string(col)) +   geom_histogram( binwidth = bw) +  xlim(min, max)}

——————

hist.family = function(df1, df2 , col1, col2, num.bin = 30){

require(ggplot2)

require(gridExtra)

## Compute T Test

paired = FALSE

t <- t.test(df1[, col1], df2[, col2], paired = paired)

## Compute bin width

max = max(c(df1[, col1], df1[, col2]))

min = min(c(df1[, col1], df1[, col2]))

bin.width = (max – min)/num.bin

## Create a first histogram

p1 = hist.plot(df1, col1, bin.width, max, min)

p1 = p1 + geom_vline(xintercept = mean(df1[, col1]), color = ‘red’, size = 1)

## Create a second histogram

p2 = hist.plot(df2, col2, bin.width, max, min)

p2 = p2 + geom_vline(xintercept = mean(df2[, col2]),color = ‘red’, size = 1)

## Now stack the plots

grid.arrange(p1, p2, nrow = 2, ncol = 1)

print(t)

}

—————-

datF1 = dt2[dt2$party_in_power == ‘BJP’, ]

datF2 = dt2[dt2$identity_dalit != ‘No’, ]

hist.family(datF1, datF2, ‘Date’, ‘Date’)

Some opinions or recommendations using simple plots:-

1. Relationship between state Litracy Rate and number of incidence in that state.

After the above visualization let try to apply simple recommendations on analyzed features. Below plots shows a clear relationship between the literacy rate and number of incidences. Less the literacy rate of any state more is the number of hate crime/incidence and More the literacy rate less the hate crimes.

hate_edu <- read.csv(‘hate_edu_merged.csv’, stringsAsFactors = F)

ggplot(hate_edu, aes(x=mean_literacy_rate, y=sum_incidents)) + geom_point() + geom_smooth(method=”lm”)

Here we can some points outside the range we can filter them out by setting xlim() and ylim().

g <- ggplot(hate_edu, aes(x=mean_literacy_rate, y=sum_incidents)) + geom_point() + geom_smooth(method=”lm”)  # set se=FALSE to turnoff confidence bands

g1 <- g + coord_cartesian(xlim=c(75,95), ylim=c(0, 70))  # zooms in

plot(g1)

Now let us take another factor mean populations also into our plot.

Now lets draw a normal distribution of growth rate of state vs the frequency of crimes in that state.

rbind(hate_dist_edu$DISTRICTS, hate_dist_edu$District)

hate_dist_edu[is.na(hate_dist_edu$DISTRICTS),7:8]

summary(dist_dt$GROWTHRATE)

sd(dist_dt$GROWTHRATE)

hist(hate_dist_edu$GROWTHRATE, breaks = 100, main = “Normal DIstribution”)

2. Incidents relations with month of year.

by_month = dt %>% group_by(Month.of.Incident) %>%summarize(mean_victims = mean(number_of_victims, na.rm = TRUE))

by_month$Month.of.Incident = factor(by_month$Month.of.Incident, levels = c(‘January’,’February’,’March’,’April’,’May’,’June’,’July’,’August’,’September’,’October’,’November’,’December’))

ggplot(by_month, aes(x=Month.of.Incident, y=mean_victims, fill=Month.of.Incident))+ geom_bar(stat=”identity”, color=”black”) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

 

BUILD

Most of the analytic projects either die out in the first or second phase; however, the on that reaches this phase, has a great potential to be converted into a data product. This phase requires a careful study of whether a machine learning kind of model is required or a simple descriptive analysis done in the first two phases is more than sufficient. In the industry, unless you don’t show a ROI on effort, time, and money required in building a ML model, the approval from the management is hard to come by. And since, many ML algorithms are kind of a blackbox where at times, the output is difficult to interpret, the business rejects them outright in the very beginning.

Here I’ll show you what are the steps involved once we decided what model is suitable for your requirement. You can also refer to my article on feature selection. To interpret the regression analysis to refer to my this article.

Step 1: Join/Select the datasets with Features and Keys.

Step 2: Select features.

Step 3:  Split the data into 70:30 where 70% go to train the model and 30% use for scoring.

Step 4: Chose the right model for your data. Here, for testing and show components involved, I am using binary classifier for “Two-Class Boosted Decision Tree” algorithm.

Step 5: Train model: To train my model I have used the “identity_dalit” column.

Step 6 : Score model: Now pass test data to score the model. To understand what is ROC curve and other terminology, refer to my this article.

Step 7: Evaluate the model.

 

So, I guess that I pass a few criteria and still decide to build the ML model, then comes time to understand the technicalities of each algorithm and how it works on this particular set of data. Once the model is built, it’s always good to ask if the model satisfies our findings in the initial data analysis. If not, then it’s advisable to take a small leap of feedback loop. I want to Build Data Product in the process flow before the evaluation phase (not a full-fledged product, but it will be a small Excel sheet presenting all the analysis done until this point). It could even be a descriptive model that articulates the way I approached the problem to present the analysis. Next article we will discuss the relevant model for our use case and evaluation its performance.

Happy Machine Learning…

Leave a Reply

Your email address will not be published. Required fields are marked *