Wednesday, May 16, 2012

05-16-2012

7.2 Review
28. Pollen count distribution for Los Angeles in September is not normally distributed, with μ = 8.0, σ = 1.0, n = 64. Find P(x-bar > 9.0).

Is n ≥ 30 or the population is normally distributed? Population is not normally distributed, but n > 30, it's 64 so we can proceed with the problem and use the CLT.

The Central Limit Theorem says:
1) Shape: Approximately normal
2) Center: μ(subscripted x-bar) = μ, meaning the center for both the population and the sample is the same.
Thus μ = 8.0.
3) Spread: σ(subscripted x-bar) = σ/√ n (standard deviation divided by the square root of the sample size)
σ/√ n = 1.0 / √64 = 1/8 = 0.125

Now we draw a picture of a normal distributed centered at 8.0
Then add on what we're interested in: P(x-bar > 9.0).
So we move to the right and mark 9 then shade the area above 9.

Then use the z score formula to find the z score so we can find the probability.
9-8/0.125 = 8.0

A z-score of 8 is so large it's not listed on our table, meaning it's a very unusual observation. So we report the most extreme value we have, 3.49 = .9998
But since we wanted the area to the right, it's 1-0.9998 = .0002
0.02% probability of observing a pollen count greater than 9.0.



31. Boned trout prices are normally distributed, with μ = 3.10, σ = 0.30, n = 16. Find the sample mean price that is smaller than 90% of all sample means.

Is n ≥ 30 or the population is normally distributed? n < 30, but the population is normally distributed, so we can proceed with the problem and use the CLT.

The Central Limit Theorem says:
1) Shape: Normal
2) Center: μ(subscripted x-bar) = μ, meaning the center for both the population and the sample is the same.
Thus μ = 3.10.
3) Spread: σ(subscripted x-bar) = σ/√ n (standard deviation divided by the square root of the sample size)
σ/√ n = .30 / √16 = 0.30/4 = 0.075

Now we draw a picture of a normal distributed centered at 3.10
Then add on what we're interested in: "price that is smaller than 90% of all sample means." We are interested in the price where 90% of all observations are greater than it. Another way of saying this: we're interested in the 10th percentile.

P(x-bar < .1000).
So we move to the left of center and approximate the 10% mark and shade below it.

Now we're ready to use the z-score formula, unfortunately we're missing an observation but we have the percentage we're interested in! So we go to the Z-table and look for the Z value that corresponds to 0.1000 probability. We're looking at a Z-score of -1.28
So plugging into the z score formula:
x-3.1/0.075 = -1.28
x-3.10 = -0.096
x = 3.004
In conclusion, the 10th percentile for trout prices is $3.004 / lb.



51. College Board reports that the mean increase in SAT math scores of students who attend review courses is 18 points. Assume that the standard deviation is 12 points and that the change in score are not normally distributed. We are interested in the probability that the sample mean score increase is negative, indicating a loss of points after coaching. Suppose we take a sample of 40, and we are interested in P(x-bar<0), indicating a loss of points after coaching.
μ = 18
σ = 12

The Central Limit Theorem says:
1) Shape: Approximately normal (40>30)
2) Center: μ(subscripted x-bar) = μ, meaning the center for both the population and the sample is the same.
Thus μ = 18
3) Spread: σ(subscripted x-bar) = σ/√ n (standard deviation divided by the square root of the sample size)
σ/√ n = 12 / √40 = 12/2√10 = 1.897 = 1.90



Now we're ready to use the z-score formula, so plugging into the z score formula:
(0-18) / 1.897 = -9.49

However, the most extreme value on the z-table is -3.49. Finding this probability on the z-table: 0.0002
In conclusion, the probability that a person who received coaching would earn a 0 on the SAT is 0.02%



7.3 #44)
Shaq lead the league with a 58.4% goal percentage.
a) Find the minimum sample size that produces a sampling distribution of p-hat that is approximately normal
When dealing with proportions we have to make sure  np ≥ 5 AND n(1-p) ≥ 5. This question is asking us to solve for n algebraically.
a1) n(.584)>5 (dividing both sides by .584)
n> 5/.584 (simplifying)
n> 8.56 (rounding up because you cannot have 8.56 goals)
n>9

a2) n(1-.584)>5
n(.416)>5  (dividing both sides by .416)
n > 5 / .416 (simplifying)
n > 12.02 (rounding up because you cannot having 12.02 goals)
n > 13

In conclusion, the minimum sample size required is 13 because that makes both equations true.
13(.584) > 5 and 13(.416)>5

b) Find μ(subscripted p-hat) and σ(subscripted p-hat) when n = 50.

The Central Limit Theorem says:
1) Shape: Approximately normal [np>5 and n(1-p)>5]
2) Center: μ(subscripted p-hat) = μ, meaning the center for both the population and the sample is the same. Thus μ = 58.4
3) Spread: σ(subscripted p-hat) = √ [p(1-p)/n] (square root of the proportion multiplied by 1 minus the proportion divided by sample size)
√ [.584(1-0.584)/50] = √ [.584(1-0.584)/50] = √ 0.2429/50 = √ .0049 = 0.0697

c)Find the probability that in a sample of 200 shots, Shaq would score more than 120 baskets.
120/200 = .6
Given this proportion we can use the z-score formula
(.6-.584) / .0697 Stop! You can't use that standard deviation it only works for a sample size of 50, we need to recalculate it!
√ [.584(1-0.584)/200] = .0349

Returning to our z-score formula...
(.6-.584) / .0349 = .4591 = .46
Taking .46 to the Z table we find a probability of .6772
However we are interested in the probability he would score MORE than 120 so we want the area to the right, you know what that means! 1-.6772 = 0.3228
Note: this answer varies from the book's, this is because they forgot to recalculate the standard deviation to account of the increased sample size.


8.3 #27)
Find the margin of error E for a 95% confidence interval
a) 5 successes in 10 trials.
5/10 = .5
so p = .5
The margin of error formula is: E = Z α/2 (√ [p-hat(1-(p-hat))/n])
Plugging in what we have
E = Z α/2 (√ [0.5(1-0.5)/10]
E = Z α/2 (0.1581)
For this Z α/2 value we need to look at our confidence interval table provided in table 8.1 on pg. 391, similar to table 8.7 on page 422. (might be a good idea to put those on your note card, just saying...)

So Z α/2 for 95% is 1.96
E = 1.96 (.1581)
E = .3099



8.2 Confidence Intervals for Means
Introducing the T-Distribution: Similar to the Z-Distribution, but has more variability (meaning the data is more spread out)
Notice: Due to the extra variability there is more info at the extremes.

Remember: Degrees of Freedom = n-1

Confidence Interval for means = x-bar  ±  T α/2 (s / √n)
Notice that the standard deviation is similar to the one used for CLT means, but here we have s rather than σ. What does this mean? s is the standard deviation of our sample.



8.2 #6)
We are taking a random sample from a normal population with σ unknown. Find T α/2
a) Confidence level 95%, sample size 10
Degrees of freedom = n-1
Plugging in: 10 -1 = 9
Now we go to the t table and look at the 95% confidence level for a df of 9: 2.262

b) Confidence level 95%, sample size 15

Degrees of freedom = n-1
Plugging in: 15 -1 = 14
Now we go to the t table and look at the 95% confidence level for a df of 14: 2.145

c) Confidence level 95%, sample size 20

Degrees of freedom = n-1
Plugging in: 20 -1 = 19
Now we go to the t table and look at the 95% confidence level for a df of 19: 2.093



13. Confidence level 95%, sample size 25, sample mean 10, sample standard deviation 5.
CI: 95%, n = 25, μ = 10, s = 5.
a) T α/2
df = n-1, 25-1 = 24. Go to the T table find CI for 95% and df = 24, 2.064


b)Margin of Error (E) = α/2(s/√n)
Plugging in what we know: 2.064 (5/√25) = 2.064 (1) = 2.064

c) Confidence interval for  μ given the indicated confidence interval
Lower bound = μ - T α/2(s/√n)
10 - 2.064 = 7.936

Upper bound = μ + T α/2(s/√n)
10 + 2.936 = 12.064

Monday, May 14, 2012

05-14-2012

Central Limit Theorem for Proportions (p-hat)


The Central Limit Theorem maintains:
1) Shape: Is normal or approximately normal
2) Center: μ(subscripted p-hat) = P, meaning the center for both the population and the sample is the same.
3) Spread: σ(subscripted p-hat) = √[(p(1-p))/n]


Note: We can only use the CLT for proportion when np ≥ 5 AND n(1-p) ≥ 5

Confidence Intervals - How certain do you want to be that you caught the population value?
Table derived from table 8.1 on pg. 391, similar to table 8.7 on page 422.



Confidence Level (1 - α)100% α α/2 Z α/2
100(1-0.15)% = 85% 0.15 0.075 1.44
100(1-0.10)% = 90% 0.10 0.05 1.645
100(1-0.05)% = 95% .05 0.025 1.96
100(1-0.01)% = 99% .01 0.005 2.576

We observed a bag with a proportion of .48 orange candies, but we don't know that the factory desires a 0.40 proportion for orange candies. Suppose we want to be 95% confident that our observation of 0.48 was within the factory specification.

To find our confidence intervals we'll use the following formula which can be found on p.391
Lower bound = x-bar - Z α/2(σ/√n)
Upper bound = x-bar + Z α/2(σ/√n)

So plugging in to find out lower bound...
Lower bound = 0.48 - 1.96(0.0980) = 0.2879
Upper bound = 0.48 + 1.96(0.0980) = 0.6721

In conclusion, we're 95% that the true proportion of orange candies in a given bag of Reese's Pieces will fall between 0.2879 and 0.6721.

What if we wanted to be 99% confident?
Simply changing the Z α/2 to reflect our desired confidence interval:
Lower bound = 0.48 - 2.576(0.0980) = 0.2276
Upper bound = 0.48 + 2.576(0.0980) = 0.7324
In conclusion, we're 99% that the true proportion of orange candies in a given bag of Reese's Pieces will fall between 0.2276 and 0.7324. 



In all reality we'll never know the true proportion of any given observation, so how do we adjust for this? We continue using the CLT for proportion but replace every instance of P (population proportion) with p-hat (our sample proportion). Use p-hat instead of P

Given
n = 25
p-hat = 0.48
standard deviation = √ [(P(1-P))/n]
But wait, we don't have P, what ever will we do? Use p-hat!
so plugging in: √[(0.48(1-0.48))/25] = 0.0999

Predicting a 95% confidence interval for this data...
Lower bound  = 0.48 - 1.96 (.0999) = 0.2842
Upper bound = 0.48 + 1.96(.0999) = 0.6758




Margin of error  = Z α/2p-hat)
Please note that the margin of error is exactly the same as portion of the confidence interval formula that lies to the right of the operation symbol (±).




Hypothesis Testing - Think of this as "looking for how much evidence we have against the null hypothesis" or "Using Statistics to answer a question"

Suppose that a recent poll reported that 49% of the United States is pro-choice, however you think it's higher.

You need to do a few things:
1) Form a null hypothesis (H0) which assumes the original value is true (status quo).
2) Offer an alternate hypothesis (HA) in which you make your claim. HA: P   P0, P > P0, or P <  P0 . In this example we are assuming P > P0.
3) Get sample and find a test statistic (Use the Z Score formula)
4) Go to the chart and find the P-Value for your Z Score of interest.

  • P(Z > z) for  P > P0
  • P(Z < z) for   P < P0
  • 2P(Z<z) = P   P0 

5) Conclusion

Wednesday, May 9, 2012

05-09-2012


Central Limit Theorem (CLT) for Means (x-bar)  - Allows us to figure out how samples of varying sizes behave in the long run.

The Central Limit Theorem states that:
1) Shape: Is normal or approximately normal
2) Center: μ(subscripted x-bar) = μ, meaning the center for both the population and the sample is the same.
3) Spread: σ(subscripted x-bar) = σ/√ n (standard deviation divided by the square root of the sample size)


Note: We can only use the CLT when n ≥ 30 OR the population is normally distributed

Example 7.12 (p. 357)
μ = 12, 485
σ = 21,973
n = 10
Notice that the standard deviation is enormous looking at the provided histogram we observe that our distribution is right skewed. Thus, the distribution is NOT normally distributed. If we use the CLT to proceed we will get inaccurate data. Therefore we do not continue because we do not have enough information.

Now let's suppose our sample size was 36 as opposed to 10.
Now that n ≥ 30, we can use the CLT.

So using what we know about the CLT we can conclude three things.
1) Shape. Shape will be approximately normal
2) Center. Centered at μ (12, 485)
3) Spread.σ/√ n. Plugging in 21973/√ 36 = 21973/6 = 3662.17

Suppose we are interested in the probability that the mean is greater than 17,000.
p(x-bar > 17,000)
In order to solve this problem, we are going to use the Z-score formula
z = (observed - expected) / standard deviation
The only difference in using the z score formula in conjunction with the CLT is that the standard deviation we plug in is the one we have found using σ/√ n.

So for this problem:
(17000-12485) / 3662.17 = 1.23
Going to the Z-table with this value of 1.23 we find a value of 0.8907.
However the question posed was the probability of finding a sample mean greater than 17,000 so we need to do 1-0.8907 to find the area to the right. Subtracting we find the probability of finding a sample mean higher than 17,000 to be 0.1093. Interpretation: 10.93% chance that a random sample size of 36 cities will give you greater than 17,000 small businesses.



Central Limit Theorem for Proportions (p-hat)


The Central Limit Theorem maintains:
1) Shape: Is normal or approximately normal
2) Center: μ(subscripted p-hat) = P, meaning the center for both the population and the sample is the same.
3) Spread: σ(subscripted p-hat) = √[(p(1-p))/n]


Note: We can only use the CLT for proportion when np ≥ 5 AND n(1-p) ≥ 5



In class example with Reese's Pieces applet.

We are interested in the proportion of candies that are orange.
Setting π = 0.40 (makes the simulation machine produce 40% orange candies)
n = 25
Then draw a sample, Brandon got 0.36.


Applying what the Central Limit Theorem tells us about sampling variability.
1) Shape: Is normal or approximately normal
2) Center: μ(subscripted p-hat) = 0.4
3) Spread: σ(subscripted p-hat) = √[(0.4(1-0.4))/25] = .0979 = .0980

Checking to see if it meets the prerequisite criteria: 25(0.4) = 10. 10 > 5, so we're good there.
25(1-0.4) = 15, 15>5. Both criteria have been met.

What's the probability of observing a bag of Reese's pieces with 24% orange candies?
Z score time!
( 0.24 - 0.4 ) / 0.098 = -1.63
Finding -1.63 on the z table, we find that the probability is .0576.
Interpretation: 5.76% chance of finding a package with 24% or fewer orange candies.

What's the probability of observing a bag of Reese's Pieces with 60% or greater orange candies?
( 0.6 - 0.4 ) / 0.098 = 2.04
Find 2.04 on the Z table, we find the probability to be 0.9793. However, that .9793 refers to the area to the left, we're interested in the area to the right, so we do 1-0.9793 and find the the probability to be .0207.
Interpretation: 2.07% chance of finding a packaged with 60% or greater orange candies in any given package of Reese's Pieces.



If the class continues to take increasingly larger samples we tighten up the variability and come closer to our intended proportion of 0.40. Observe the trend in the table below:


Sample Size (n) Class Low Class High
25 .24 .60
50 .22 .50
75 .29 .53
100 .30 .52
500 .36 .43


The larger the sample size the closer the values are to our intended 0.40.
In order to cut the standard deviation in half you need to quadruple the sample size.

Monday, May 7, 2012

05-07-2012

Sampling Distribution - Pattern of Variability for Samples
Quantitative - Involves numbers, denoted by x-bar and used to predict μ (population parameters)
Qualitative (categorical) - Involves objects other than numbers (e.g. hair color or gender), denoted by p-hat and used to predict P (proportion parameters).

In Class Skittles Exercise
Each student was given a fun size package of skittles and asked to record:
1) Total number of skittles
2) Number of purple and red skittles (combined)
3) Proportion of skittles per package that are red and purple (combined)

The total is quantitative (number)
The proportion is categorical because it's a color reported.
Sample size = 36 students in the class

Normally distributed - Average number of skittles per package 16
X-bar (mean): 15.125
Standard deviation: 1.328

If you are asked to report your mean? Just report the number of candies present in your bag.
Although better estimates come from taking larger sample sizes or taking more samples.


Central Limit Theorem (CLT) for Means (x-bar)  - Allows us to figure out how samples of varying sizes behave in the long run.

The Central Limit Theorem states that:
1) Shape: Is normal or approximately normal
2) Center: μ(subscripted x-bar) = μ, meaning the center for both the population and the sample is the same.
3) Spread: σ(subscripted x-bar) = σ/√ n (standard deviation divided by the square root of the sample size)

Note: We can only use the CLT when n ≥ 30 or the population is normally distributed

Saturday, May 5, 2012

05-03-2012

Walk through on how to solve the types of Problems encountered in 6.5 [video]


7.1 Sampling Error

Sampling error - Absolute value of the difference between statistic and parameter.
Mean: | x-bar -  μ |
Standard deviation: | s - σ |
Proportion: | p-bar -  P |

μ - Population (Parameter) mean

x-bar - Sample (Statistic) mean or "Point estimate"

Monday, April 23, 2012

04-23-2012

Remember me?
The Standard Normal Distribution (Z Distribution) [video]:

  • The area under the curve is always equal to one (1.0), because the total probability for any event is 1.0 (As you may recall the probability of 1 means that the event always happens or has a 100% probability - which is our maximum)
  • The distribution is always centered around the mean (µ), which has a standard deviation of 0. 50% of the data lies above the mean and 50% of the data lies below the mean.
  • To find Z scores: 


Z Table (pdf) - Gives you area under the curve for the specific standard deviation you are interested in.
Normal distribution (java applet) - Helps you visualize what you are trying to find


What is the area to the left of Z-score value 0.57?
First we find the 0.5 row under the Z column on the table, then move across that row to find 0.07. This will give us the area under the curve at 0.57 which is: 0.7157
p(Z<0.57) = 0.7157


Suppose we want to know the area to the right of the Z-score 0.57, how would we do that given that the Z-Table only provides us with area to the LEFT of the Z score? Well we use our knowledge that the entire curve accounts for a total probability of 1. If we remove the section to the left (which we can find easily from the Z-table) we will be left with the area to the right!
p(Z>0.57) = 1 - 0.7157
p(Z>0.57) = 0.2843

Alternatively, the area to the right of the positive value is same as the area to the left of negative value due to the symmetric nature of the distribution.
p(Z>0.57) = p(Z<-0.57) = 0.2843


Remember the Empirical Rule? You know, one standard deviation being 68%, two standard deviations being 95% and three standard deviations being 99.7%? Well let's see how accurate that rule of thumb is.

So what is the probability that Z is greater than negative one standard deviation and less than one standard deviation? p(-1 < Z < 1) ?
To find the area between two points it's best to approach it as two separate problems, so let's find the highest value first.

Let's draw a picture of what we're trying to find so we can visualize the problem, this java applet should help you on your way.

Looking at the Z table, what's the probability that Z is less than positive one? p (Z < 1.00) = 0.8413
Looking at the Z table, what's the probability that Z is less than negative one? p(Z < -1.00) = 0.1587

Since we are interested in the area between -1 and 1, we don't really want the area to the left of -1 as we have found, so we can just subtract it from the area to the left of positive 1 and we will have our answer.
0.8413 - 0.1587 = .6826
p(-1 < Z < 1) = .6826, or 68.26% of the data lies between 1 standard deviation
Pretty close to the Empirical Rule's estimate of 68% of the data falling within 1 standard deviation.

p(-2 < Z < 2) ?
p(Z<2.00) = 0.9772
p(Z<-2.00) = 0.0228

p(-2 < Z < 2)  = 0.9772 - 0.0228
p(-2 < Z < 2)  = 0.9544

Again, pretty to the Empirical Rule's estimate of 95%

p(-3 < Z < 3) ?

p(Z<3.00) = 0.9987
p(Z<-3.00) = 0.0013

p(-3 < Z < 3)  = 0.9987 - 0.0013
p(-3 < Z < 3)  = 0.9974
Also close to the Empirical Rule's estimate of 99.7%

But what if we wanted to know what Z scores correspond with exactly 68%?
Well if we know the entire area under the curve is 100% or 1.0 and we want to capture the middle 68%
1-0.68 = 0.32 so there will be 32% of the data unaccounted for, but this is not at one end, it's distributed evenly at both tails because the distribution is symmetric. So 0.32/2 = 0.16
1-0.16 = 0.8400

Now we go to the Z-Table and look for the value closest to 0.8400, after some searching we find the values 0.8365 (located at 0.98), 0.8389 (located at 0.99), and 0.8413 (located at 1.0). Unfortunately 0.8413 is greater than 0.8400 so we will go with the next highest value 0.8389. So if we're interested in the middle 68% we're looking at p( -0.99 < Z < 0.99)