Looking at
DataDistributions
Statistics
is the science of conducting studies to collect, organize, summarize, analyze,
and draw conclusions from data.
Uses and Misuse of Statistics
1.
Statistical
techniques can be used to describe data, compare two or more data sets,
determine if a relationship exists between variables, test hypotheses, and make
estimates about population measures.
2.
The misuse of
statistical techniques to sell products that do not work properly, or to get
our attention by using statistics to evoke fear, shock, and outrage.
There is a saying that has been around for a long time
which illustrate this point: “There are three types of lies – lies, damn lies,
and statistics.”
Individuals: Objects described by a set of data. Individuals
may be people, but they may also be animals or things.
Population: a class of individuals.
Sample: a
subset of a given population. The idea of sampling is to study a part in
order to gain information about the whole.
A variable: a characteristic of an individual. A variable can
take different values for different individuals.
Example: Let us consider a Statistics class (population)
consisting of 12 students (individuals).
Name 
ID # 
Height (unit: cm) 
Number of Math courses taken 
Favorite color 
Queenie 
005 
155 
3 
Blue 
Peter 
002 
176 
3 
Blue 
Yukie 
007 
172 
0 
Orange 
Ying 
123 
165 
0 
Green 
Andy 
098 
168 
6 
Black 
Eric 
006 
172 
6 
Green 
Sylvia 
346 
158 
3 
Brown 
JP 
990 
178 
5 
White 
Daniel 
600 
170 
4 
Yellow 
Susanna 
569 
165 
2 
Brown 
John 
111 
196 
9 
Black 
Bart 
919 
181 
19 
Black 
Variables: ID#
(denoted by I), Height
(denoted by H), Number of Math courses taken (denoted by N), and
Favorite color (denoted by C).
Quantitative Variables:
·
Discrete Variables
·
Continuous Variables
Qualitative (Categorical) Variables:
Describing
distribution with Graphs
Categorical variables:
Pie charts or Bar graphs
Quantitative variables:
Stemplots or Histograms
Another type of classification for variables:
1. The nominal level of measurement classifies data into mutually exclusive (nonoverlapping), exhausting categories in which no order or ranking can be imposed on the data.
Example: Zip Code, Gender (male, female), Eye color (blue, brown, green, hazel)
2. The ordinal level of measurement classifies data into categories that can be ranked; however, precise differences between the the ranks do not exist.
Example: Grade (A, B, C, D, F), Rating scale (poor, good, excellent)
3. The interval level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.
Example: SAT score, IQ, Temperature
4. The ratio level of measurement possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on two different members of the population.
Example: Height, Weight, Time, Salary
Describing
distribution with numbers
Given a list of observations as follows:
1 1 1
1 2 2 3 3
3 5 7 19
# of observations: n = 12
Max = 19
Min = 1
Range = Max – Min = 18
The Range of a given list measures the spread of the distribution.
Mode = 1
Mean = 4
Median (M): 50%
of observations <= median, and
50% of observations >= median
M = (2+3) / 2 = 2.5
Q1 (the first quartile): 25% of observations <=
Q1, and
75% of observations >= Q1.
Q3 (the first quartile): 75% of observations <=
Q3, and
25% of observations >= Q3.
IQR (InterQuartile Range) = Q3 – Q1 = 3
The IQR of a given list measures the spread of the distribution.
SD: Standard Deviation
The SD of a given list measures the spread about the mean.
Formulas for SD:
5number summary:
Example:
Height 

N 

Max 
196 
max 
19 
min 
155 
min 
0 
Range 
41 
Range 
19 
Mode 
172 
Mode 
3 
Mean 
171.3333 
Mean 
5 
Median 
171 
Median 
3.5 
Q1 
165 
Q1 
2.75 
Q3 
176.5 
Q3 
6 
IQR 
11.5 
IQR 
3.25 
SD 
10.89899 
SD 
5.09902 
Boxplot:
The comparison of mean and median.
Both numbers measure the ‘centre’ of a given distribution.
Median is less sensitive to the change of extreme values (i.e. Mean is not resistant).
Similarly, we make a comparison between IQR and SD.
Common:
Both numbers measure the spread the distribution.
SD is not resistant and IQR is less sensitive to the extreme
values.