Looking at Data-Distributions

 

Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.

 

Uses and Misuse of Statistics

1.      Statistical techniques can be used to describe data, compare two or more data sets, determine if a relationship exists between variables, test hypotheses, and make estimates about population measures.

2.      The misuse of statistical techniques to sell products that do not work properly, or to get our attention by using statistics to evoke fear, shock, and outrage.

There is a saying that has been around for a long time which illustrate this point: “There are three types of lies – lies, damn lies, and statistics.”

 

 

Individuals: Objects described by a set of data. Individuals may be people, but they may also be animals or things.

 

Population: a class of individuals.

 

Sample: a subset of a given population. The idea of sampling is to study a part in order to gain information about the whole.

 

A variable: a characteristic of an individual. A variable can take different values for different individuals.

 

 

Example: Let us consider a Statistics class (population) consisting of 12 students (individuals).

 

Name

ID #

Height (unit: cm)

Number of Math courses taken

Favorite color

Queenie

005

155

3

Blue

Peter

002

176

3

Blue

Yukie

007

172

0

Orange

Ying

123

165

0

Green

Andy

098

168

6

Black

Eric

006

172

6

Green

Sylvia

346

158

3

Brown

JP

990

178

5

White

Daniel

600

170

4

Yellow

Susanna

569

165

2

Brown

John

111

196

9

Black

Bart

919

181

19

Black

 

 

 

Variables: ID# (denoted by I), Height (denoted by H), Number of Math courses taken (denoted by N), and Favorite color (denoted by C).

 

 

 

 

 

 

 

-Quantitative Variables:

 

·         Discrete Variables

 

 

 

 

·         Continuous Variables

 

 

 

 

 

-Qualitative (Categorical) Variables:

 

 

 

 

 

 

 

 

 

Describing distribution with Graphs

 

Categorical variables:

    

Pie charts or Bar graphs

 

 

 

 

 

 

Quantitative variables:

 

Stem-plots or Histograms

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Another type of classification for variables:

 

1. The nominal level of measurement classifies data into mutually exclusive (non-overlapping), exhausting categories in which no order or ranking can be imposed on the data.


Example: Zip Code, Gender (male, female), Eye color (blue, brown, green, hazel)

 

2. The ordinal level of measurement classifies data into categories that can be ranked; however, precise differences between the the ranks do not exist.


Example: Grade (A, B, C, D, F), Rating scale (poor, good, excellent)

 

3. The interval level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.


Example: SAT score, IQ, Temperature

 

4. The ratio level of measurement possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on two different members of the population.


Example: Height, Weight, Time, Salary

 

 

 

 

 

Describing distribution with numbers

 

Given a list of observations as follows:

1  1  1  1  2  2  3  3  3  5  7  19

 

# of observations:  n = 12

 

Max = 19

 

Min = 1

 

Range = Max – Min = 18

The Range of a given list measures the spread of the distribution.

 

Mode = 1

 

Mean = 4

 

Median (M):  50% of observations <= median, and

                        50% of observations   >= median

 

M = (2+3) / 2 = 2.5

 

Q1 (the first quartile): 25% of observations <= Q1, and

                                       75% of observations  >= Q1.

 

Q3 (the first quartile): 75% of observations <= Q3, and

                                       25% of observations   >= Q3.

 

IQR (Inter-Quartile Range) = Q3Q1 = 3

The IQR of a given list measures the spread of the distribution.

 

SD: Standard Deviation

The SD of a given list measures the spread about the mean.

 

Formulas for SD:

 

 

 

 

 

 

 

 

5-number summary:

 

 

 

 

 

 

 

 

Example:

 

Height

 

N

 

Max

196

max

19

min

155

min

0

Range

41

Range

19

Mode

172

Mode

3

Mean

171.3333

Mean

5

Median

171

Median

3.5

Q1

165

Q1

2.75

Q3

176.5

Q3

6

IQR

11.5

IQR

3.25

SD

10.89899

SD

5.09902

 

 

 

Box-plot:

 

 

 

 

 

 

 

The comparison of mean and median.

 

Common:

Both numbers measure the ‘centre’ of a given distribution.

Difference:

Median is less sensitive to the change of extreme values (i.e. Mean is not resistant).

 

 

Similarly, we make a comparison between IQR and SD.

Common:

Both numbers measure the spread the distribution.

Difference:

SD is not resistant and IQR is less sensitive to the extreme values.