Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it.

Normality Test for Data using R

Hello Data Experts,

Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/exploratory-data-analysis-using-r.html “ Exploratory Data Analysis using R where I had covered four moments of statistics. I will help recap all those 4 moments

There are 4 moments of statistics.
*  First step covers Mean, Median and Mode, it is a measure of central tendency.
Second step covers Variance Standard Deviation, Range, it is a measure of dispersion.
Third step covers Skewness, it is a measure of asymmetry.
Fourth step covers Kurtosis, it is a measure of peakness.

To complete Exploratory Data Analysis, 4 moments cover very basic aspects however there are few other techniques which are necessary/mandatory for an individual to understand.  Now we will focus on Graphical Visualization Techniques and Standard Normal Distribution.

Let me continue with my example from last blog with CarsMileage dataset.

CarsMileage <- c(12, 14, 12.5, 13.5, 15, 10, 11, 12, 12, 14, 12, 11.5, 12.5, 13.5, 15, 10.5, 15, 12, 14, 14)

Bar Graph is the basic graph that is drawn on the datasets to understand comparative analysis of each data point. Bar Graph can be placed Horizontal or Vertical.

barplot(CarsMileage)

Histogram, represent different heights of bars where each bar and height has its interpretation. Height represent count or frequency whereas each bin reflects event by logically grouping an event in a specific range.  Distribution of these bins/bars reflect a distribution analysis. Taking an example of retail, low frequency can be an area of opportunity for a business to focus on to grab niche market, where as higher frequency is already encashed.

hist(CarsMileage)

Boxplot, reflect data distribution in quartiles. This graph play a vital role in understanding the data especially outliers. It has six key measures

·         Outliers
·         Lower extreme
·         1st Quartile
·         Median
·         3rd Quartile
·         Upper Extreme
 
boxplot(CarsMileage)

Transposing boxplot horizontally help us understand data distribution.  

Just like data is distributed and there are graphs described about help understand Skewness etc. We will touch upon important aspect of Probability and Probability distribution.

Probability in simple terms defined as count of happening of the expecting event/ Total number of events.

Probability distribution: If Probability values drawn on Y axis and Random variable values are drawn on X Axis, that represents Probability distribution.  It has certain characteristics:
·         Random variable value ranges from -Infinite to +Infinite.
·         Probability associated with any single value is always Zero because if we go by the logic of probability where chances of happening expected event is 1 whereas total event that could happen are infinite thus 1/Infinite is always ZERO.
·         Area under probability distribution curve will be 1.

These concepts are important as they lay down the foundation for statistical analysis.

Normal Distribution is characterized as when Mean = Median = Mode for probability distribution and forms a perfect bell curve. Normal distribution curve is symmetrical across Mean (Central Tendency)  that is 50% area under curve on both sides of mean. Standard Deviation under bell curve represents sigma level starting 1 Sigma to 6 Sigma where Six Sigma represent 3.4 DPMO (Defects per Million Opportunities). Let us understand how 1 Sigma is different from 6 Sigma. If we take an example of Photoshoot where photographer takes photos for models. If photographer is professional and proficient number of retakes will be low. so  what does retake shows, if photographer is skilled at 6 Sigma he will take 3.4 retakes out of million photoshoots whereas if Photographer is  1 Sigma skilled he will take 32 retakes in every 100 photoshoots. In photography it is fine to have retakes however industries like Airline and Healthcare which are related to human life and death, it is a norm to have 6 Sigma adherence.  Let me share Sigma values:
1 Sigma reflect 68.26% adherence.
2 Sigma reflects 95.44% adherence
3 Sigma reflects 99.77% adherence
4 Sigma reflects 99.9937% adherence
5 Sigma reflects 99.99997% adherence
6 Sigma reflects 99.9999998& adherence.

Let us take hypothetical example from BPO where # of incidents could be a result of
·         # of new rollout to production.
·         # of test scenarios
·         # of line of code  

In this example, there are various parameter which can influence number of defects based on various factors of different unit. To avoid any influence, we should change each unit value to a unit less value. Approach to convert different unit value to unit less value is called standardization.  Plotting the standardized data distribution is called Standardized Normal distribution. Key characteristic of standardized data is Mean will be Zero and Standard Deviation will be always be 1. So if we have an output dependent of multiple inputs of varied units and size, it is recommended to standardized it make it unit less. Once  data is unit less , output will not get influenced by any specific parameter. 

Statistically standardized value can be derived using a formula ((X-Mean)/Standard Deviation) or using excel we can use "norm.dist" inbuilt statistical formula to achieve the same. R makes life simple and keep all these complexity as black box for data scientist to focus on the core. "pnorm" can be used in R to get the same outcome. It is a straight forward simple formula which can help us derive standard data in a single shot by executing following command using R.

scale(CarsMileage) # this will result in standardized data.

Output of this will be as below
[1,] -2.5006312
[2,]  0.5304369
[3,] -0.2273301
[4,]  0.2778479
[5,]  1.0356149
[6,]  2.0459710
[7,] -0.9850971
[8,] -0.4799191
[9,] -0.4799191
[10,]  0.5304369
[11,] -0.4799191
[12,] -0.7325081
[13,] -0.2273301
[14,]  0.2778479
[15,]  1.0356149
[16,] -1.2376862
[17,]  1.0356149
[18,] -0.4799191
[19,]  0.5304369
[20,]  0.5304369
 
Once data is standardized, plotting of these points will follow standardized normal distribution or Z distribution. Probability of any value will be Zero but we can calculate the probability of a value grater than or less than using standard normal distribution technique. To derive the probability we can use Z table from statistics where values are in there in X axis and Y axis that matching value will be the probability. This is very time consuming if done manually. But by using R it is very simple and probability value can be derived using is single command.

# pnorm(Probability of particular value, Mean, Standard Deviation)
pnorm(14,mean(CarsMileage),sd(CarsMileage))
 
Probability of mileage to be 14 will be 78%

What is probability of value to be in a range i.e., greater than X and lesser than y value? Using R it can be derived as shown below?
pnorm(y, mean, SD) - pnorm(x, mean, SD)


All this is relevant if data follows normal distribution, so it is important for us to understand if data follows normal distribution of not using R. If the outcome from below commands have straight line that reflects data is Normalized and good for statistical analysis.

qqnorm(CarsMileage)
qqline(CarsMileage)

Now that we had talked about key concept, it is very important that data follows normal distribution to have meaningful statistical analysis. If data in its original form does not follow Normal distribution, we should try transforming that dataset and measure Normality. Data can be transformed by executing Square Root, Cube Root or Exponential equations. There are multiple other ways to transform data and check its normality.

Once normality of data is derived, right statistical analysis can be executed and help infer right outcome to make right decisions.

I hope this blog helped you gain insight into Normal Probability distribution and normality check. In next blog, I will cover what statistical analysis techniques can be executed on normal and non-normal data. This completes the Exploratory Data Analysis, we covered four moments of statistical decision and normality check of data. This is the key foundation for rest of the statistical word since 60% -70% of statistician’s time goes in executing EDA to make sure data captured is complete and unbiased which will help one take right decision and right outcome.  Getting right data be it a population or sample it is very important for one to make a right choice. Now that basics for EDA we will cover “Advance statistical concepts like confidence level using R Studio”.

Thank you for going through this blog, I hope it helped you built sound foundation of statistics using R. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs. 

Thank you...
Outstanding Outliers:: "AG".  

 

 

Comments

Popular posts from this blog

Z and T distribution values using R

Hello Data Experts, Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html “ Normality test using R as part of advanced Exploratory Data Analysis where I had covered four moments of statistics and key concept around probability distribution, normal distribution and Standard normal distribution. Finally, I had also touched upon how to transform data to run normality test. I will help recap all those 4 moments. Those 4 moments of statistics. First step covers Mean, Median and Mode, it is a measure of central tendency. Second step covers Variance Standard Deviation, Range, it is a measure of dispersion. Third step covers Skewness, it is a measure of asymmetry. Fourth step covers Kurtosis, it is a measure of peakness. To get standardized data use “scale” command using R whereas run “pnorm” command to get probability of a value using Z distribution. To understand if data follows normality we can e

Practical usage of RStudio features

Hello Data Experts, Let me continue from my last blog Step by Step guide to install R :: “Step by Step guide to install R” where I had shared steps to install R framework and R Studio on windows platform. Now that we are ready with Installation and R Studio, I will take you through R Studio basics. R Studio has primarily 4 sections with multiple sub tabs in each window: Top Left Window: Script editor: It is for writing, Saving and opening R Scripts. Commands part of Script can also be executed from this window. Data viewer: Data uploaded can be viewed in this window.   Bottom Left Window: Console: R Commands run in this window.   Top Right Window: Workspace: workspace allow one to view objects and values assigned to them in global environment. Historical commands: There is an option to search historical commands from beginning till last session. Beauty of this editor is that historical commands are searchable. Once historical commands are searched they can be

Code Branch and Merge strategies

Learn Git in a Month of Lunches Hello Everyone, IT industry is going through a disruptive evolution where being AGILE and adopting DevOps is the key catalytic agent for accelerating the floor for success. As explained in my earlier blog, they complement each other rather than competing against one another. If Leaders will at the crossroad where in case they need to pick one what should be their pick. There is no right or wrong approaching, it depends on the scenario and dynamics for the program or project. I would personally pick #DevOps over Agile as its supremacy lies in ACCELERATING delivery with RELIABILITY and CONSISTENCY . This path will enable and empower development teams to be more productive and prone to less rework. Does this mean adopting DevOps with any standard will help reap benefits? In this blog, I will focus on importance of one of the standard and best practice around Code branching and merging strategy to get the desired outcome by adopting DevOps. To