Probability Distributions

Calculations of the mean and standard deviation from a frequency distribution

Two types of variables tend to arise in statistical experiments, discrete and continuous.

Discrete Variables

When there are a countable number of values that result from observations, we say the variable producing the results is discrete. The nominal and ordinal levels of measurement are often discrete.  Integer value results are also usually discrete.

For example, questionnaire's often generate discrete results:

Never
0
Once a month
1
Once a week
2
Two to three times a week
3
Every day
4
How often do you chew betelnut? o o o o o
How often do you drink sakau en Pohnpei? o o o o o
How often do you drink beer? o o o o o
How often do you drink wine? o o o o o
How often do you drink hard liquor (whisky, rum, vodka, etc.)? o o o o o
How often do you smoke marijuana? o o o o o
How often do you use controlled substances other than marijuana? (methamphetamines, cocaine, crack, ice, shabu, etc.)? o o o o o

The results of such a questionnaire are numeric values from 0 to 4.  For an example of a real student alcohol questionnaire, see: http://www.indiana.edu/~engs/saq.html

Homework: Anonymously complete the above questionairre and hand it in.  Do not put your name on the questionnaire.

Continuous Variables

When there is a infinite (or uncountable) number of values that may result from observations, we say that the variable is continuous.  Physical measurements such as height, weight, speed, and mass, are considered continuous measurements. Bear in mind that our measurement device might be accurate to only a certain number of decimal places. The variable is continuous because better measuring devices should produce more accurate results.

Probability Distributions

Both discrete and continuous variables can have a probability distribution. Intervals (or bins or classes) can be constructed, relative frequencies (or probabilities) can be calculated and a relative frequency histogram can be drawn. This relative frequency histogram depicts the distribution of the data.

The following data consists of 39 body fat measurements for female students at the College of Micronesia-FSM Summer 2001 and Fall 2001. Following the table is a relative frequency histogram, the probability distribution for this data.

BFI fem CUL
x
Frequency
f
Relative Frequency
f/n or P(x)
20.1 2 0.05
24.6 12 0.31
29.2 13 0.33
33.7 5 0.13
38.1 7 0.18
Sum (n): 39 1.00

Relative Frequency Histogram for the BFI for 39 female students

The area under the bars is equal to one, the sum of the relative frequencies. The above diagram consists of five discrete classes.  Later we will look at continuous probability distributions using lines to depict the probability distribution. Imagine a line connecting the tops of the columns:

notes05_histo02.jpg (13942 bytes)

If the columns are removed and the class upper limits are shifted to where the right side of each column used to be:

notes05_histo03.jpg (12480 bytes)

The orange vertical line has been drawn at the value of the mean. This line splits the area under the "curve" in half.  Half of the females have a body fat measurement less than this value, half have a body fat measurement greater than this value.

We could also draw a vertical line that splits the area under the curve such that we have ten percent of the area to the left of the orange line and ninety percent to the right of the orange line.  This line would be at the value below which only ten percent of the measurements occur.

Although this author is playing a bit fast and loose with converting a discrete distribution to a continuous distribution, the example is still useful for this introductory course.  For now consider this last chart to be a probability distribution for the body fat measurements for the 39 females at the College who were measured.

Calculations of the mean and the standard deviation

In some situations we have only the intervals and the frequencies but we do not have the original data. In these situations it would be useful to still be able to calculate a mean and a standard deviation for our data.

If we only have the intervals and frequencies, then we can calculate both the mean and the standard deviation from the class upper limits and the relative frequencies.   Here is the mean and standard deviation for the sample of 39 female students:

BFI fem CUL
x
Frequency
f
Relative Frequency f/n or P(x) Mean:
S(x*P(x))
stdev:
Ö(S((x-xbar.gif (842 bytes))²*P(x)))
20.1 2 0.05 1.03 4.52
24.6 12 0.31 7.58 7.29
29.2 13 0.33 9.72 0.04
33.7 5 0.13 4.32 2.23
38.1 7 0.18 6.86 13.56
Sum: 39 1.00 xbar.gif (842 bytes) = 29.51 S = 27.64
sx = 5.26

A spreadsheet with the above data is available at:
http://www.comfsm.fm/~dleeling/statistics/statistics_fall2001.xls

Note that the results are not exactly the same as those attained by analyzing the data directly. Where we can, we will analyze the original data. This is not always possible. The following table was taken from the 1994 FSM census. Here the data has already been tallied into intervals, we do not have access to the original data.  Even if we did, it would be 102,724 rows, too many for some of the computers on campus.

Age x Total f Relative frequency f/n or P(x) x*P(x) (x-mean)²*P(x)
4 14662 0.14 0.57 57.78
9 15090 0.15 1.32 33.58
14 14944 0.15 2.04 14.90
19 12425 0.12 2.30 3.17
24 9192 0.09 2.15 0.00
29 7042 0.07 1.99 1.63
34 6800 0.07 2.25 6.46
39 5998 0.06 2.28 12.93
44 3131 0.03 1.34 12.05
49 3601 0.04 1.72 21.70
54 2271 0.02 1.19 19.74
59 2089 0.02 1.20 24.74
64 1978 0.02 1.23 30.62
69 1308 0.01 0.88 25.65
74 1169 0.01 0.84 28.31
79 544 0.01 0.42 15.95
84 313 0.00 0.26 10.93
89 99 0.00 0.09 4.06
94 56 0.00 0.05 2.66
98 12 0.00 0.01 0.64
Sums: 102724 1 24.12 327.50
sqrt: 18.10

The mean xbar.gif (842 bytes) = 24.12
The population standard deviation sx = 18.10

A spreadsheet with the above data is available from: http://www.comfsm.fm/~dleeling/statistics/statistics.xls

The result is an average age of 24.12 years for a resident of the FSM in 1994 and a standard deviation of 18.10 years.  This means at least half the population of the nation is under 24.12 years old!  Actually, due to the skew in the distribution, fully 56% of the nation is under 19.  Bear in mind that 56% is in school.  That means we will need new jobs for that 56% as they mature and enter the workplace.  On the order of 57,121 new jobs.

How old are you?  Below, at, or above the mean (average)? Do you have a job?

Note we used the class upper limits to calculate the average age. Potentially this inflates the national average by as much as half a class width or 2.5 years. Taking this into account would yield an average age of 21.62 years old.

There is one more small complication to consider. Since the population of the FSM is growing, the number of people at each age in years is different across the five year span of the class. The age groups at the bottom of the class (near the class lower limit) are going to be bigger than the age groups at the top of the class (near the class upper limit). This would act to further reduce the average age.

Homework: Use the 2000 Census data to calculate the mean age in the FSM in 2000.

Age 2000
4 14782
9 14168
14 14213
19 13230
24 9527
29 7620
34 6480
39 6016
44 5560
49 4650
54 3205
59 1903
64 1733
69 1487
74 993
79 1441
  1. Did the mean age change?
  2. Are you still (below|at|above) the mean age?

Alternate Homework:

Use the following data to calculate the overall grade point average and standard deviation of the grade point data for the Pohnpeian students at the national campus during the terms Fall 2000 and Spring 2001

Grade Point Value
x
Frequency
f
Relative Frequency
f/n or P(x)
Mean:
S(x*P(x))
stdev:
Ö(S((x-xbar.gif (842 bytes))²*P(x)))
4 851 ______ ______ ______
3 1120 ______ ______ ______
2 1023 ______ ______ ______
1 459 ______ ______ ______
0 690 ______ ______ ______

Sums:

______ ______ ______ ______

Sqrt:

______

Statistics  home Lee Ling home COM-FSM home page