Skip to main content

Is today's world all about creativity and ideation?

Are they the seeds to be nurtured to bring in automation, innovation and transformation.  There is a saying, necessity is the mother of invention. I would say, innovation is amalgamation of creativity and necessity.  We need to understand the ecosystem, to apply creativity and identify the ideas to bring in change. We need to be competent with changing ecosystem and think beyond the possible. What is the biggest challenge in doing this? "Unlearning and Learning", we think the current ecosystem is the best. Be it health, finserve, agriculture or mechanical domain, we need to emphasize with the stakeholders, to come up with the strategy to drive. The very evident example here is the quality of life is changing every millisecond. Few decades back the phone connection was limited to few, but today all the millennials are having a mobile phone. Now phone is not just a medium to talk, but are so powerful devices that an innovative solution can be developed on it.

Exploratory Data Analysis using R

Hello Data Experts,

Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/object-types-and-visualization-using-r.htmll “ Data Types and Visualization using R” where we discussed Object/Data Types like Vector, List, Factor, Data Frame,  Array and Matrix. I did cover how to have a different graphical representation like Line Graph, Scatter Graph, Pie chart, Bar Graph, Histogram and Boxplot Graph.

Let us now move forward with core statistical fundamentals for any statistical problem. We will learn about Exploratory Data Analysis. Why do we need it and how to perform Exploratory Data Analysis (EDA)?

Let me first help you understand what it is EDA is important for Data Scientist because it will help us
  • Understand key attributes about the data like Mean, Median etc.
  • EDA will help us visualize if there are any anomalies like outliers.
  • Data visualization help us detect if there is any pattern like direct or indirect relationship between 2 set data points.
  • Identify data errors, data inconsistency (like Skewness).
  • It will help us validate assumptions if data (sample/Population) is appropriate for statistical modeling.
  • It will help us identify the right statistical model and avoid biasness.
  • It will help us assess strength and direction of data between Input and output variables.
There are 4 moments of statistics to complete EDA.
  • First step covers Mean, Median and Mode, it is a measure of central tendency.
  • Second step covers Variance, it is a measure of dispersion.
  • Third step covers Skewness, it is a measure of asymmetry.
  • Fourth step covers Kurtosis, it is a measure of peakness.
First Moment:
Mean is the average value of datasets. It can be influenced by the outliers.  It could also be a measure of central tendency.
Median is the middle most value of the sorted data set, it is partially influenced by outlier. It reflects better distribution of data. It is a better representation of central tendency as there is a lower chance of it getting influenced by outlier.  If there are odd number of values in a dataset, then (N+1)/2 th value will be median whereas if there are even number of values then average of N th and (N+1) th value will be the Median value.   
Mode is the value which is there most of the times in the dataset i.e., search for most frequent value used.

Second Moment:
Variance is how spread out are value from the mean. Sum of all residuals of values to Mean should be Zero.
Standard Deviation is calculated as the square root of variance. Statistically Standard deviation reflects how close are values to the mean and spread. SD helps us determines margin of error, confidence level and significance level as well.

Third Moment:
It is a measure of Skewness where it helps us depicts which side distribution of data is tapered, as against mean. There could be Positive or Negative Skewness.
Negative Skewness reflects left long tail and data is more distributed toward right, for example 3,4,5,6, 8,9
Positive Skewness reflects right long tail and data is more distributed toward left. for example, 4,6, 8,9, 10

Fourth Moment:
It is a measure of tailedness.
Positive Kurtosis defined Thin Peak with no long tail. If it needs to be explain in retail domain thin peak covers items like bread and milk are sold maximum.
Negative Kurtosis defines wider peak with long tail.

Let us now take an object having 20 values

CarsMileage <- c(12, 14, 12.5, 13.5, 15, 10, 11, 12, 12, 14, 12, 11.5, 12.5, 13.5, 15, 10.5, 15, 12, 14, 14)

Let us derive all 4 moments now:

First moment:

Mean:
# Get Mean value
mean(CarsMileage)

Output will be 12.8

Median:
# Get Median value
median(CarsMileage)

Output will be 12.5

Mode:
# Get mode value
mode(CarsMileage)

Output will be Numeric

Second Moment:

Variance:
# Get Variance
var(CarsMileage)

Output will be 2.24736

Standard Deviation:
# Get Standard Deviation
sd(CarsMileage)
 
Output will be 1.499123

Third moment and Fourth moment:
Skewness and Peakness there is no direction function in R but can be derived using R. I will cover how to derive third and fourth moment in my future blogs.

I hope you must have got the essence and importance of four moments in statistical analysis. This is the foundation for rest of the statistical word. 50-60% of the time of statistician goes in executing EDA to make sure data captured is complete and unbiased which will help one take right decision and right outcome.  Getting right data be it a population or sample it is very important for one to make a right choice. Now that basics for EDA is clear I will help one with visualization to explore advance statistical problems. In my next blog, I will cover “Advance statistical concepts using R Studio”.

Thank you for sparing time and going through this blog I hope it helped you built sound foundation of statistics using R. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs.  

Thank you...
Outstanding Outliers:: "AG".  

Comments

Popular posts from this blog

Z and T distribution values using R

Hello Data Experts, Let me continue from my last blog http://outstandingoutlier.blogspot.in/2017/08/normality-test-for-data-using-r.html “ Normality test using R as part of advanced Exploratory Data Analysis where I had covered four moments of statistics and key concept around probability distribution, normal distribution and Standard normal distribution. Finally, I had also touched upon how to transform data to run normality test. I will help recap all those 4 moments. Those 4 moments of statistics. First step covers Mean, Median and Mode, it is a measure of central tendency. Second step covers Variance Standard Deviation, Range, it is a measure of dispersion. Third step covers Skewness, it is a measure of asymmetry. Fourth step covers Kurtosis, it is a measure of peakness. To get standardized data use “scale” command using R whereas run “pnorm” command to get probability of a value using Z distribution. To understand if data follows normality we can e

Code Branch and Merge strategies

Learn Git in a Month of Lunches Hello Everyone, IT industry is going through a disruptive evolution where being AGILE and adopting DevOps is the key catalytic agent for accelerating the floor for success. As explained in my earlier blog, they complement each other rather than competing against one another. If Leaders will at the crossroad where in case they need to pick one what should be their pick. There is no right or wrong approaching, it depends on the scenario and dynamics for the program or project. I would personally pick #DevOps over Agile as its supremacy lies in ACCELERATING delivery with RELIABILITY and CONSISTENCY . This path will enable and empower development teams to be more productive and prone to less rework. Does this mean adopting DevOps with any standard will help reap benefits? In this blog, I will focus on importance of one of the standard and best practice around Code branching and merging strategy to get the desired outcome by adopting DevOps. To

“OUTCOME” or “OUTPUT” driven Agile

Hello All,     Nowadays IT industry is bombarded with articles on Agile with loud and clear message #BeLean. Everyone around teaches AGILE as in #GOAGILE, #BEAGILE, #AGILITYLEADS and many more hashtags around #ONLYAGILE. Lean Engineering gurus have been coaching corporates to go #AGILE and be #LEAN. Literal English meaning of being Agile is to be nimble, to be able to adapt to the changing needs of company to achieve goals as to what is desired by business. But why do we need Agility, is it to be able to achieve outcome i.e., #BusinessesNeed with speed i.e., #Velocity? I am perplexed with what I keep hearing around Agile practices and I firmly believe we should try to understand the rational for being Agile by choosing right “O”, either go #Outcome or #Output. What will you prefer without reading this blog, Output or Outcome?   Let me take you two decades back when there was a need for transformation. Transformation from big-bang i.e., #waterfall to iterative i.e., #lea