| Start the Lab | Math Alive | Welcome Page |
For Problems 1 to 5 , you can use the web page "Statistical Calculations".
Problem 1. Baseball Stats.
a) Hank Aaron was an outfielder for the Braves from 1954 to 1974. Here are the number of home runs hit each year by Aaron:
13, 27, 26, 44, 30, 39, 40, 34, 45, 44, 24, 32, 44, 39, 29, 44, 38, 47, 34, 40, 20.
Mark McGwire played in the major leagues from 1986 to 2001 as a first baseman for the Oakland A's and the St. Louis Cardinals. The number of home runs hit by McGwire in each year are:
Version-1
3, 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 34, 24, 58, 70, 65, 32, 29.
Version-2
3, 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 58, 70, 65, 32, 29.
Please give mean, median, standard deviation and quartiles for each: (you can use the applet on the webpage for these computations). Remember that the definition of quartiles has been given in class. It is also in the lecture Notes.
Answer:
Aaron:
Version-1 McGwire:
Version-2 McGwire:
b) What conclusion would you draw from a comparison of these data? Judging only from the information you calculated above, who would you say was the most consistent player? Explain your answer. Is this borne out by the actual numbers?
Answer:
(For both versions)Although the two means are comparable, Aaron's median is higher, which
means that McGwire must have had more high outliers than Aaron. This is
confirmed by the fact that the numbers above the third quartile for
McGwire, 52, 58, 65, 70, are higher than those for Aaron (which
range from 44 to 47). Aaron also has a much smaller standard deviation,
which means that his distribution is more concentrated - his number of
home runs is more consistent than McGwire's.
Problem 2. Landslide victories in presidential elections.
Here are the percentages of the popular vote won by the successful presidential candidate in each of the presidential elections from 1948 to 2004:
| year | % |
| 1948 | 49.6 |
| 1952 | 55.1 |
| 1956 | 57.4 |
| 1960 | 49.7 |
| 1964 | 61.1 |
| 1968 | 43.4 |
| 1972 | 60.7 |
| 1976 | 50.1 |
| 1980 | 50.7 |
| 1984 | 58.8 |
| 1988 | 53.9 |
| 1992 | 43.3 |
| 1996 | 50.0 |
| 2000 | 47.9 |
| 2004 | 50.9 |
a) Compute the:
b) What are the first and third quartiles?
Answer:
first quartile: 49.65
third quartile: 56.25
Definition: An election is called a landslide if it is at or above the third quartile.
c) Which elections were landslides?
Answer:
1956, 1964, 1972, 1984
Problem 3. Statistics on grades
The following list gives the final grades (out of 100) of 30 students in a calculus course.
Grades (out of 100): 7, 10, 14, 15, 15, 17, 18, 21, 21, 24, 25, 26, 26, 29, 32, 34, 37, 63, 65, 66, 69, 69, 72, 75, 77, 77, 78, 80, 81, 93.
a) Compute the:
b) Do you think the {mean & standard deviation} gives a good summary of the information conveyed in the above data? Explain you answer.
Answer:
The {mean & standard deviation} doesn't give a good summary of the
information conveyed in the data. The large standard deviation illustrates
that there is a large spread in the data, but the mean is not informative -
nobody achieves the mean grade or any grade especially close to it.
c) How about the {median, quartiles and extremes}? Again, explain you answer.
Answer:
The {median, quartiles and extremes} gives a more informative summary of
the data, although still not the full picture. The extremes show specifically
the large distance between the lowest and highest grades. The grades are in
two clusters and the quartiles mark approximate centers of the clusters,
while the median is close to the point where the gap between the two
clusters occurs. However, being given the values of the median and quartiles
does not in itself illustrate that the grades are in two clusters. In
addition quite a few of the grades lie outside the range between the first
and third quartile. The {median, quartiles and extremes} does, though, capture
that the data is quite spread out.
d) Do you think the information conveyed by the mean alone and/or by the median alone is useful?
Answer:
On their own the mean and median are not particularly informative. The
mean approximately marks the center of the gap between the two clusters of
marks, while the median is close to the top of the lower cluster of marks.
Simply being given one or both of the mean and median does not, however,
illustrate the clustering of the grades, nor frequently occurring grades.
e) Calculate the mean and standard deviation of
Answer:
Mean: 21.82
Standard Deviation: 8.18
Mean: 74.23
Standard Deviation: 7.85
f) Do you think the two means you calculated in e) give a useful description of the data? Explain why.
Answer:
These two means give the average grade in each of the two clusters of
marks. In both cases - the lower lower half of the data or the upper half the data-, the standard deviation is small (around 8), so the {mean & standard deviation} description is informative. If we had also been given the information that the data was in two clusters, these two means would be very informative.
g) In your opinion, which combination of the statistical quantities calculated from a) to f) gives the best summary of the information conveyed by the full list of class grades?
Answer:
Since the data is quite spread out and in two clusters, none of the descriptions of the overall data gives a very informative summary of the data. If you were given the information that the data were in two clusters, the means and standard deviations of the two halves of the data would be useful.
In the case of the overall data, the {median, quartiles and extremes} illustrates that the data is quite spread out and give specifically the distance from the lowest to the highest grade. If also given the information that the data was in two clusters, the quartiles would mark the approximate centers of each of the clusters.
The large standard deviation illustrates that the data is quite spread out,
but the mean is not very informative.
Problem 4. Non-normal distribution.
Make a list of 10 numbers for which the mean lies above the third quartile:
Answer:
For instance: 1,2,3,4,5,6,7,8,9,40 This has mean 8.5, median 5.5, first quartile 3 and third quartile 8.
Another example is 1,2,1,2,1,2,1,2,1,1000.
This has mean 101.3, median 1.5, first quartile 1 and
third quartile 2.
Problem 5. A study of jury awards.
A study of the size of jury awards in civil cases (such as injury, liability and medical malpractice) showed that the median award in Cook County, Illinois, was about $8000. But the mean award was about $89,000. Can you explain how this is possible? Create an example with actual numbers as part of your explanation.
Answer:
This is possible if most awards are small, with some huge
outliers. For instance, if most awards are in the $8,000
range, but one in 10 awards is around $820,000, then the
mean would lie around $89,200.
If the probability of an event A (e.g A={people who like cats}) is p% (e.g.59%),
if you find that for a sample of size N (e.g. N=1,000 persons), p(N)% (e.g. 58% ) agree with A (e.g. 580 persons of the N=1,000 persons asked like cats).
Then you calculate the following:
Case 1
If you know the exact value p%, the 95% confidence interval for p(N)% is: p% +/- 2*S % where S=sqrt(p*(100-p)/N) (sqrt is the square root).
This means that there are 95% chance that p(N)% is in the interval [p% - 2*S%, p% + 2*S%]. S is the standard deviation.
Case 2
If you don't know the exact value p%, the 95% confidence interval for p% is: p(N)% +/- 2*S % where S=sqrt(p(N)%*(100-p(N)%)/N) (sqrt is the square root).
This means that there are 95% chance that p% is in the interval [p(N)% - 2*S%, p(N)% + 2*S%]. S is the standard deviation.
In this case, you can use the Confidence Interval web page to compute the confidence interval if p(N)% is an integer.
Problem 6. Unlisted Numbers.
In a particular area code, about 35% of telephones have unlisted numbers. Imagine that you call 300 numbers, each chosen randomly. That is, you pick the different numbers randomly from the whole collection of possible numbers (not all seven-digit numbers are possible phone numbers - for instance, no phone number starts with 0). The number of unlisted numbers you reach this way is more or less normally distributed.
a) What is the mean of this distribution? (That is, what is the average number of unlisted numbers reached in 300 attempts, if a large group of people would try 300 attempts each, independently, and compare their results at the end?)
Answer:
The sample size here is 300. Since 35% of all numbers
are unlisted, we would expect that, on average, 35%
of the 300 attempts would reach an unlisted number,
corresponding to a mean of 105.
b) And what would the standard deviation be (you can use the confidence interval applet to calculate this, or the formula reminded above).
Answer:
Using the formula for standard deviation
where p is the sample proportion in percentage points (35 in this case), and N is the sample size (300 in this case), gives
standard deviation = 2.75%
You can also deduce it from the confidence interval that you can compute with the software, since the 95% confidence interval has limits that lie two standard deviations below, respectively above the mean - so, in this case, 5.40%/2 = 2.70%
If we express the standard deviation in calls/300 rather
than percent, it is 8.25 (8.1 according to the applet).
Problem 7. Visiting Yosemite National Park.
The Forest Service of Yosemite National Park is considering additional restrictions on the number of vehicles allowed to enter the Park. To assess public reaction, the Service asks a random sample of 200 visitors if they favor the proposal. Of these, 126 say "yes".
a) Give a 95% confidence interval for the proportion of all visitors to Yosemite who favor the restrictions.
Answer:
126/200 = .63 = 63% , so our estimated sample proportion
in this case is 63%. (We don't know the true sample
proportion.) With a sample size of 200, this gives a 95%
confidence interval of 56.17% to 69.83% using the formulas or
56.31% to 69.69% using the web page.
b) Are you 95% confident that more than half are in favor? Explain your answer!
Answer:
Yes, we are 95% confident that more than half are in favor, since
our 95% confidence interval lies completely above 50%.
Problem 8. Poll results.
A news report says that a national opinion poll of 1200 randomly selected adults found that 38% thought that they would be worse off during the next year. The news report went on to say that the margin of error in the poll is + or - 3 percentage points with 95% confidence. The poll was carried out by calling random telephone numbers.
a) Using the formula above or the web page, compute for yourself the 95% interval. It is of the form p% +/- 2*S%. What in the interval you find and the one they give?
Answer:
The formula gives 2*S=2.80 i.e. the confidence interval is 38% + or -
2.80%. The web page gives: 38% + or - 2.75%
b) Could you propose explanations for the difference between what you find and what they describe?
Answer:
They reported a variation of + or - 3% while the actual one is 2.8%. One explanation could be that although they asked 1200 people, a sizeable fraction did not have an opinion, and were not counted in the sample size; for a sample size of 1000, for instance, you already get + or - 3.0 percentage points.
It is also possible that they just rounded off.
The 38% figure quoted in the news report could also have
been rounded off - if it had been 38.2%, for instance, then
the 95% confidence interval would have been between
35.4% and 41.00%, and the upper bound 41.00% is 3%
which is 3% away from the 38% figure that they quoted.
c) If we wanted a 90% confidence interval, and not a 95% confidence interval, would the width of the confidence interval be greater or smaller than the + or - 3 percentage points? Why?
Answer:
If we were happy to be less confident in our interval
(90% instead of 95%), then we can take the interval narrower,
so we would take a smaller margin of error. Another way of
saying this, turning the explanation around, is that if we
want to be more certain that the true value lies in the
interval around the mean that we specify, then we should take
the limits wider - the wider the interval, the higher the
chance that it does indeed contain the correct value - but
if the interval is very wide, then this does not give much
information. (After all, there is a 100% chance that the
true percentage value lies between 0 and 100, regardless
what we are talking about!)
Problem 9. Elections.
Suppose that you want to call the result of an election with 95% confidence. The ratio of people who prefer the leading candidate hovers around 70%. Be careful, for this problem you can't use the webpage!
a) How many people should you have in your sample to be 95% confident that more than half the population will vote for this candidate?
Answer:
In this case, even a small sample size suffices. As soon as the sample size exceeds 22, the 95% confidence interval lies completely above 50%.
If we wanted our confidence interval to lie completely above 60%, then having a sample size exceeding 84 would suffice.
Note: it may seem strange that you would still go around
polling if you already know the ballpark figure for the
percentage of people who will vote for the leading candidate.
We are assuming here that the person who goes around doesn't
know; however, the interviewer would (if the sample is
not biased) quickly find out that the lead is this large,
even after interviewing a fairly small sample.
b) How many people should you question if the percentage of the population who prefer this leading candidate is around 60% rather than around 70%?
Answer:
Here the sample size needs to exceed 96 for the 95%
confidence interval to lie completely above the 50% mark
c) And if it were 51%?
Answer:
For a sample size of 9,996 the 95% confidence interval
is 51% + or - 1% , so the sample size should be
at least this number (so, say, 10,000 or more) in order
to call the winner with 95% confidence.
Close elections are much harder to poll than other elections!
Simpson's Paradox.
In class, we saw an instance of Simpson's paradox.
In that case, we had two groups, A and B, and group A was claiming that they were unfairly treated in the graduate admissions process. Indeed, out of a pool of 1100 applicants in group A, only 190 were admitted, while out of a pool of 1100 applicants from group B a total of 910 were admitted.
However, a closer inspection of these (made-up) data showed that there were two different programs to which the candidates could apply. Program 1 had an admission rate of 90%, but program 2 had an admission rate of 10%. Then the numbers were explained by the fact that of the 1100 applicants from group A, 100 had applied to program 1, and 1000 to program 2; while the reverse happened for the applicants from group B. Both programs treated the two groups entirely fairly, but nevertheless the total end result looked skewed.
Here we shall see some more instances of Simpson's paradox.
Problem 10. A tale of two hospitals.
A community has two hospitals. Hospital A is large medical center, while Hospital B is a more fashionable and much more expensive hospital where most patients are wealthy. An article in the local paper claims that a higher percentage of surgery patients die at Hospital A than at Hospital B, and deplores the fact that people who are less well off are disadvantaged. It also recommends to the people in the community that if they can afford it, they should choose to have their surgery in Hospital B rather than A.
A more detailed look at the number of surgery patients in the last few months at both Hospitals, taking into account also whether the incoming patients were in good or poor health, shows the following:
|       | HOSPITAL A | HOSPITAL B | ||
|       | Good Health | Poor Health | Good Health | Poor Health |
| Died | 4 | 57 | 5 | 8 |
| Survived | 559 | 1422 | 585 | 196 |
a) What are the percentages of patients admitted for surgery who are in bad health prior to the operation
b) What are the total percentages of patients who died?
c) What are the percentages of patients in previously good health who died?
d) What are the percentages of patients in previously poor health who died?
e) Try describing this paradox in your own words - Imagine that you have to write a short, one-or-two paragraph article about it in the local newspaper, or that you respond to the article that appeared in the local paper with a letter to the Editor. Don't just repeat the numbers: give a clear explanation of what is happening in these statistics.
Answer:
Letter to the Editor: A TALE OF TWO HOSPITALS.
Statistics are a way of describing, hopefully with few numbers, important effects or trends in complex data. But occasionally too much simplification can be misleading. This happened in your recent article comparing the mortality rate for surgery patients in Ourtown Hospital with that in Waterfront Overlook Medical Center. In any hospital the mortality rate for surgery patients whose health is precarious, but for whom surgery cannot be put off, will be higher than that for patients whose health prior to the surgery is generally good. Ourtown Hospital and Overlook Medical Center are no exceptions. If one hospital happens to have among its patients a higher proportion of seriously ill patients, who have a higher mortality rate, than the other hospital, then this will skew the global mortality rate. This effect, called Simpson's paradox, was at work in your comparison of Ourtown Hospital and Overlook Medical Center yesterday.
Indeed, patients admitted for surgery are in bad health prior to the operation in larger proportion at Ourtown Hospital (1479/2042=72.4%) than at Overlook Medical Center (204/794=25.7%). A more careful analysis distinguishes between these two categories of patients. One finds then that the 61 deaths out of 2042 surgery patients for Ourtown Hospital corresponded to 4 deaths out of 559 previously healthy patients, and 57 out of 1479 previously ill patients, corresponding to mortality rates of 0.7% and 3.8% respectively. A similar breakdown for Overlook Medical Center shows mortality rates of 0.8% (5/590) for patients previously in bad health, and 3.9% (8/204) for patients in bad health. Both these mortality rates are HIGHER than at Ourtown Hospital.
Yours sincerely,
Dr. I.N.Dignant
Head of Surgery Division
Ourtown Hospital
Problem 11. Cancer Statistics.
Mr. Johnson read about a study that showed that:
He promptly took up smoking again and omitted cucumbers and orange juice from his diet, with a secure feeling that his chances of developing cancer were low.
Do you agree with his reasoning? Explain why. To do this, explain which probabilities or conditional probabilities the information above gave to Mr Jonhson and also explain what probabilities or conditional probabilities he needs to end up with his conclusion.
Answer:
Mr. Johnson has insufficient information from which to draw the conclusions he does. For example, his conclusion doesn't take into account the percentages of people who do not have cancer and who eat cucumbers, drink orange juice and smoke.
Also, if we use the notation S = "smoke" and C = "have cancer", then in terms of conditional probabilities, Mr Johnson is drawing conclusions about, for example, P[C|S], whereas he only has information about P[S|C].
Consider, for example, the following possible scenario (where we use the notation OJ = "drink orange juice"). Suppose P[OJ] = .95, and P[S] = .25 . Using the definition of conditional probability, we know that
P[C|OJ] = P[OJ|C]P[C]/P[OJ] = P[C](.8)/(.95) = (.84)P[C], and
P[C|S] = P[S|C]P[C]/P[S] = P[C](.3)/(.25) = (1.2)P[C].
So P[C|OJ] < P[C|S], i.e. the probability that you have cancer given that you drink orange juice is less than the probability that you have cancer given that you smoke. This disagrees with the conclusion drawn by Mr. Johnson.
The probabilities P[OJ] and P[S] given above seem reasonably likely. If the
actual probabilities are different, however, it is possible that Mr. Johnson's
conclusions are more realistic. The main lesson to be learnt, though, is that
Mr. Johnson's conclusions do not follow from the information that he is given
by the study - he would require more information to come to the conclusions he
does.
| Start the Lab | Last modified: March 31, 2003 |