Looking at Data-Distributions

Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.

Uses and Misuse of Statistics

1.      Statistical techniques can be used to describe data, compare two or more data sets, determine if a relationship exists between variables, test hypotheses, and make estimates about population measures.

2.      The misuse of statistical techniques to sell products that do not work properly, or to get our attention by using statistics to evoke fear, shock, and outrage.

There is a saying that has been around for a long time which illustrate this point: “There are three types of lies – lies, damn lies, and statistics.”

Individuals: Objects described by a set of data. Individuals may be people, but they may also be animals or things.

Population: a class of individuals.

Sample: a subset of a given population. The idea of sampling is to study a part in order to gain information about the whole.

A variable: a characteristic of an individual. A variable can take different values for different individuals.

Example: Let us consider a Statistics class (population) consisting of 12 students (individuals).

 Name ID # Height (unit: cm) Number of Math courses taken Favorite color Queenie 005 155 3 Blue Peter 002 176 3 Blue Yukie 007 172 0 Orange Ying 123 165 0 Green Andy 098 168 6 Black Eric 006 172 6 Green Sylvia 346 158 3 Brown JP 990 178 5 White Daniel 600 170 4 Yellow Susanna 569 165 2 Brown John 111 196 9 Black Bart 919 181 19 Black

Variables: ID# (denoted by I), Height (denoted by H), Number of Math courses taken (denoted by N), and Favorite color (denoted by C).

-Quantitative Variables:

·         Discrete Variables

·         Continuous Variables

-Qualitative (Categorical) Variables:

Describing distribution with Graphs

Categorical variables:

Pie charts or Bar graphs

Quantitative variables:

Stem-plots or Histograms

Another type of classification for variables:

1. The nominal level of measurement classifies data into mutually exclusive (non-overlapping), exhausting categories in which no order or ranking can be imposed on the data.

Example: Zip Code, Gender (male, female), Eye color (blue, brown, green, hazel)

2. The ordinal level of measurement classifies data into categories that can be ranked; however, precise differences between the the ranks do not exist.

Example: Grade (A, B, C, D, F), Rating scale (poor, good, excellent)

3. The interval level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.

Example: SAT score, IQ, Temperature

4. The ratio level of measurement possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on two different members of the population.

Example: Height, Weight, Time, Salary

Describing distribution with numbers

Given a list of observations as follows:

1  1  1  1  2  2  3  3  3  5  7  19

# of observations:  n = 12

Max = 19

Min = 1

Range = Max – Min = 18

The Range of a given list measures the spread of the distribution.

Mode = 1

Mean = 4

Median (M):  50% of observations <= median, and

50% of observations   >= median

M = (2+3) / 2 = 2.5

Q1 (the first quartile): 25% of observations <= Q1, and

75% of observations  >= Q1.

Q3 (the first quartile): 75% of observations <= Q3, and

25% of observations   >= Q3.

IQR (Inter-Quartile Range) = Q3Q1 = 3

The IQR of a given list measures the spread of the distribution.

SD: Standard Deviation

The SD of a given list measures the spread about the mean.

Formulas for SD:

• Population

• Sample

5-number summary:

Example:

 Height N Max 196 max 19 min 155 min 0 Range 41 Range 19 Mode 172 Mode 3 Mean 171.3333 Mean 5 Median 171 Median 3.5 Q1 165 Q1 2.75 Q3 176.5 Q3 6 IQR 11.5 IQR 3.25 SD 10.89899 SD 5.09902

Box-plot:

The comparison of mean and median.

## Common:

Both numbers measure the ‘centre’ of a given distribution.

## Difference:

Median is less sensitive to the change of extreme values (i.e. Mean is not resistant).

Similarly, we make a comparison between IQR and SD.

Common:

Both numbers measure the spread the distribution.

## Difference:

SD is not resistant and IQR is less sensitive to the extreme values.