Monday, April 23, 2012

04-23-2012

Remember me?
The Standard Normal Distribution (Z Distribution) [video]:

  • The area under the curve is always equal to one (1.0), because the total probability for any event is 1.0 (As you may recall the probability of 1 means that the event always happens or has a 100% probability - which is our maximum)
  • The distribution is always centered around the mean (ยต), which has a standard deviation of 0. 50% of the data lies above the mean and 50% of the data lies below the mean.
  • To find Z scores: 


Z Table (pdf) - Gives you area under the curve for the specific standard deviation you are interested in.
Normal distribution (java applet) - Helps you visualize what you are trying to find


What is the area to the left of Z-score value 0.57?
First we find the 0.5 row under the Z column on the table, then move across that row to find 0.07. This will give us the area under the curve at 0.57 which is: 0.7157
p(Z<0.57) = 0.7157


Suppose we want to know the area to the right of the Z-score 0.57, how would we do that given that the Z-Table only provides us with area to the LEFT of the Z score? Well we use our knowledge that the entire curve accounts for a total probability of 1. If we remove the section to the left (which we can find easily from the Z-table) we will be left with the area to the right!
p(Z>0.57) = 1 - 0.7157
p(Z>0.57) = 0.2843

Alternatively, the area to the right of the positive value is same as the area to the left of negative value due to the symmetric nature of the distribution.
p(Z>0.57) = p(Z<-0.57) = 0.2843


Remember the Empirical Rule? You know, one standard deviation being 68%, two standard deviations being 95% and three standard deviations being 99.7%? Well let's see how accurate that rule of thumb is.

So what is the probability that Z is greater than negative one standard deviation and less than one standard deviation? p(-1 < Z < 1) ?
To find the area between two points it's best to approach it as two separate problems, so let's find the highest value first.

Let's draw a picture of what we're trying to find so we can visualize the problem, this java applet should help you on your way.

Looking at the Z table, what's the probability that Z is less than positive one? p (Z < 1.00) = 0.8413
Looking at the Z table, what's the probability that Z is less than negative one? p(Z < -1.00) = 0.1587

Since we are interested in the area between -1 and 1, we don't really want the area to the left of -1 as we have found, so we can just subtract it from the area to the left of positive 1 and we will have our answer.
0.8413 - 0.1587 = .6826
p(-1 < Z < 1) = .6826, or 68.26% of the data lies between 1 standard deviation
Pretty close to the Empirical Rule's estimate of 68% of the data falling within 1 standard deviation.

p(-2 < Z < 2) ?
p(Z<2.00) = 0.9772
p(Z<-2.00) = 0.0228

p(-2 < Z < 2)  = 0.9772 - 0.0228
p(-2 < Z < 2)  = 0.9544

Again, pretty to the Empirical Rule's estimate of 95%

p(-3 < Z < 3) ?

p(Z<3.00) = 0.9987
p(Z<-3.00) = 0.0013

p(-3 < Z < 3)  = 0.9987 - 0.0013
p(-3 < Z < 3)  = 0.9974
Also close to the Empirical Rule's estimate of 99.7%

But what if we wanted to know what Z scores correspond with exactly 68%?
Well if we know the entire area under the curve is 100% or 1.0 and we want to capture the middle 68%
1-0.68 = 0.32 so there will be 32% of the data unaccounted for, but this is not at one end, it's distributed evenly at both tails because the distribution is symmetric. So 0.32/2 = 0.16
1-0.16 = 0.8400

Now we go to the Z-Table and look for the value closest to 0.8400, after some searching we find the values 0.8365 (located at 0.98), 0.8389 (located at 0.99), and 0.8413 (located at 1.0). Unfortunately 0.8413 is greater than 0.8400 so we will go with the next highest value 0.8389. So if we're interested in the middle 68% we're looking at p( -0.99 < Z < 0.99)

Wednesday, April 18, 2012

04-16-2012

Binomial Distribution
1) 2 possible outcomes
2) Fixed number of trials
3) Independent events
4) Probability is constant

The Binomial Formula
p = Probability
n = number of trials
x = number of successes

"At most" means "less than or equal to".  Ex: If I wanted at most 5, I am interested in the probabilities between 0 and 5. [p(0)+p(1)+p(2)+p(3)+p(4)+p(5)]


"At least" means "greater than or equal to". Ex: If I wanted at least 5 in a sample of ten, I am interested in the probabilities between 5 and 10. [p(5)+p(6)+p(7)+p(8)+p(9)+p(10)]




6.2 # 13-18, the experiment is to toss a fair coin three times. Use the binomial formula to find the indicated probabilities.

13. No heads were observed.
p = 0.50 , n = 3, x = 0.
p(0) = 3C0 (0.5)0 (1-.0.5)(3-0)
p(0) = (1) (1) (0.125)
p(0) = 0.125

14. One head was observed.
p = 0.50 , n = 3, x = 1.
p(1) = 3C1 (0.5)1 (1-.0.5)(3-1)
p(1) = (3) (0.5) (0.25)
p(1) = 0.375

15. Two heads were observed
p = 0.50 , n = 3, x = 2.
p(2) = 3C2 (0.5)2 (1-.0.5)(3-2)
p(2) = (3) (0.25) (0.5)
p(2) = 0.375

16. Three heads were observed.
p = 0.50 , n = 3, x = 3.
p(3) = 3C3 (0.5)3 (1-.0.5)(3-3)
p(3) = (1) (0.125) (1)
p(3) = 0.125

17. At most two heads were observed.
"At most" refers to a combined probability, in this case it's "less than or equal to" two heads. So we are interested in the cumulative probability from 0 to 2, which we already found in the previous problems so...
p(0) + p(1) + p(2) = ?
0.125 + 0.375 + 0.375 = 0.875

18. More than two heads were observed
"More than" refers to a combined probability, in this case it's "greater than" two heads. So we are interested in the probabilities greater than 3, however for this problem set there is only one such probability that matches this...
p(3) = 0.125

22. Probability that at least 3 of the next 4 dentists surveyed will recommend sugarless gum. Assume the recommendation is given 95% of the time.
"At least" refers to a combined probability, in this case we're interested in a minimum of 3 successes. So we need to find probability of observing a 3 and the probability of observing a 4.
p(3) = 4C3 (0.95)3 (1-0.95)(4-3)
p(3) = (4) (0.857375) (.05)
p(3) = 0.1715

p(4) = 4C4 (0.95)4 (1-0.95)(4-4)
p(4) = (1) (0.81450625) (1)
p(4) = 0.8145

p(3+4) = 0.8145 + 0.1715
p(3+4) = 0.8145 + 0.1715
p(3+4) = 0.9860

If you are unclear on any of this, please watch this video.

Wednesday, April 11, 2012

04-11-2012

Section 5.3, page 234
Sampling with replacement - Each trial is an independent event. (e.g. pulling names from a hat and returning the names to the hat after they've been pulled)

Sampling without replacement - Each trial is a dependent event, odds of an event happening increase with each trial (e.g pulling a name from a hat, and don't replace them, if you continue to pull names your name will eventually come up- thus your odds increase with each trial).



ELISA test of HIV example
The ELISA test reports a positive result 99.6% if blood has HIV, therefore it reports a false negative (meaning the test says you don't have HIV, but you do) 0.4% of the time (1-.996 = .004)

If blood has no HIV the test reports a negative result 98% of the time, conversely the false positive (meaning the test says you have HIV, but you don't) rate is 2% (1-.98 = .02).

If the prevalence of HIV is 0.5% and we collect blood samples from 100,000 randomly selected people.

How many people will have HIV?
Well if 0.5% of the population have HIV and we have 100,000 people, we simply multiply the population by the percentage: (0.005)*(100000) = 500.
500 people will have HIV


How many people do not have HIV?
100000 - 500 = 99500.
99500 will not have HIV



Population
HIV Positive
500
HIV Negative
99500




How many of the 500 HIV positive people will the test detect?
The test accurately reports positive 99.6% of the time if the blood has HIV, so (0.996)* (500) = 498
498 HIV positive people will be detected by the test

How many of the 500 HIV positive people will the test miss?
The test erroneously reports negative (false negative) 0.4% of the time if the blood has HIV, so (0.004) * (500) = 2
Alternatively, you could acknowledge that the false negative is the compliment of the positively detected group, so 500-498 = 2.
2 HIV positive people will be erroneously reported as HIV negative (missed by the test).

How many of the 99500 HIV negative people will the test detect?
The test accurately reports a negative result 98% of the time, so (0.98) * (99500) = 97510
97510 HIV negative people will be detected by the test

How many of the 99500 HIV negative people will the test miss?
The test erroneously reports positive (false positive) 2% of the time if the blood lacks HIV, so (0.02)*(99500) = 1990
Alternatively, you could acknowledge that the false positive is the compliment of the negatively detected group, so 99500-97510 = 1990
1990 HIV negative people will be erroneously reported as HIV positive (missed by the test).

Using this data to create a cross-tabulation table...


Actually Positive (+) Actually Negative (-) Total
Test Positive (+)
498
1990
2488
Test Negative (-)
2
97510
97512
Total
500
99500
100000

Proportion of people that are actually HIV positive given ELISA reported negative?
p(HIV+ | ELISA-) = 2 / 97512
p(HIV+ | ELISA-) = 0.00002051
0.002051% of the time ELISA will report negative when the sample is HIV positive

Proportion of people that are HIV positive given ELISA reported positive? 
p(HIV+ | ELISA+) = 498 / 2488
p(HIV+ | ELISA+) = 0.200160772
20% of the time ELISA will report positive when the sample HIV positive

Proportion of people that are HIV negative given ELISA reported positive?
p(HIV- | ELISA+) = 1990 / 2488
p(HIV- | ELISA+) = 0.799839228
Nearly 80% of the time ELISA will report positive when sample HIV negative


Why would the ELISA be designed to report HIV positive when the sample is actually HIV negative more frequently than report HIV negative when the sample truly is HIV positive (false negative)?
A person who receives a false negative from ELISA could potentially spread the infection further given the epidemic nature of the illness. Thus, ELISA was designed in order to keep the false negative rate as low as possible.

6.1)

Discrete Random Variable - Specific number of probable outcomes. If you chose from the class - 40 possible outcomes.

Continuous Random Variable - If class ran a mile - Infinite number of possible outcomes.

If you would like to augment your knowledge of this subject, please watch this video.

6.2)
Binomial Distribution or Binomial "pattern of variability"
How do you determine if it's a Binomial Distribution?

  1. Two possible outcomes (event happens or it doesn't)
  2. Fixed number of trials/attempts (I will play $1 on the slot machine, as opposed to playing until I win or run out of money)
  3. Each outcome is independent (the outcome is not contingent upon the previous outcome, for example if you were to flip a coin- the coin is not going to "remember" to land on heads the second flip because it landed on heads the first flip)
  4. Probability remains constant (success/failure rate remains the same from one trial to the next, odds do not change as you continue
If you are unclear on any of this, please watch this video


Is flipping a coin three times for a heads a binomial distribution?
1) 2 possible outcomes? Yes (heads or not heads)
2) Fixed number of trials? Yes (3)
3) Outcomes independent? Yes
4) Probability remains constant? Yes
All 4 conditions are met, this qualifies as a binomial distribution.

34% of burglars enter through the front door, is a study of 36 burglaries a binomial distribution?

1) 2 possible outcomes? Yes (front door entry or not)
2) Fixed number of trials? Yes (36)
3) Outcomes independent? Yes (36 different burglaries, they have nothing to do with one another)
4) Probability remains constant? Yes (34%)
All 4 conditions are met, this qualifies as a binomial distribution.

Formula for Binomial distribution by hand p. 277


Binomial Distribution in Minitab

In this example we will use a binomial distribution to randomly generate the probability of your four hypothetical children being female.

Calc> Random Data> Binomial Distribution

1000 rows (trials), storing in C1, 4 trials, probability of event (.51)

Select OK> Number of girls in each family is now in each cell.

Stat> Tables> Tally Individual Variables>

Select C1, include "Counts" and "Percents"

Select OK>

Interpreting results:
p(0 females) = 5.2%
p(1 female) = 24.0%
p(2 females) = 38.1%
p(3 females) = 23.7%
p(4 females) = 9.0%

p(3 of 1 gender) = (1B & 3G) + p(3G+1G) = 24+23.7 = 47.7%


Suppose you wanted to grow your happy hypothetical family from 4 children to 10 children. What happens to the probability of having a female?

Calc> Random Data> Binomial Distribution> 1000 rows (trials), storing in C2, 10 trials, probability of event (.51)

Resulting data is stored in C2>

Stat> Tables> Tally Individual Variables> Select C2, include "Counts" and "Percents"> Select OK>


Interpreting results:
p(0 females) = 0%
p(1 females) = 0.7%
p(2 female) = 3.6%
p(3 females) = 10.2%
p(4 females) = 17.6%
p(5 females) = 26.9%
p(6 females) = 22.3%
p(7 females) = 12.8%
p(8 females) = 5.3%
p(9 females) = 0.6%
p(10 females) = 0%


What have we learned? The more trials (children) you have, the probability decreases. If you wanted the best odds of having a child of each gender you're best off stopping at 2 children, continuing to have kids will not increase the odds!

Monday, April 9, 2012

04-09-2012

Lotto example continued

5 random numbers: 56, 55, 54, 53, 52 and a mega number: 46

So, 56 * 55 * 54 * 53 * 52 = 458377920
458377920 * 46 = 2.108538432 * 1010

But the actual probability is: 175711536
How do we get this number? We eliminate the duplicate counts.
How do we eliminate the duplicates? We first need to determine if it's a permutation or a combination.

Combination (nCr): n! / [r! (n-r)!]
  • r items chosen from n distinct items
  • No repetition allowed
  • Order is not important

Permutation (nPr): n! / (n-r)!
  • r items permuted from n distinct items 
  • No repetition allowed 
  • Order is important
Video lecture on Permutations and Combinations

Factorial:
0! = 1
1! = 1
2! = 2 * 1
5! = 5 * 4 * 3 * 2 * 1
9! = 9 * 8 * 7 * 6 * 5 *4 * 3 * 2 * 1

Video lecture on Factorials

As you may have determined, the lotto is a combination because the order of the numbers is not important.
Applying the combination formula: nCr -> 56C5, meaning 56 items to chose from 5 distinct times.
Plugging into the combination formula: 56! / [5! (56-5)!]
56! / [5! (51)!]
(56 * 55 * 54 *53 * 52 *51!) / 5! (51!)
simplifying: (56 * 55 * 54 *53 * 52) / 5!
rewriting: (56 * 55 * 54 *53 * 52) / (5 * 4 * 3 * 2 * 1)
multiplying: 458377920 / 120
simplifying: 3819816
Interpretation: 3819816 distinct number combinations can be created by choosing 5 random numbers between 1 and 56

Now for the mega number: 46C1
46! / [1! (46-1)!]
46! / 45!
(46 * 45!) / 45!
46

46 * 3819816 = 175711536
Probability of winning the lotto with 1 entry? 1 / 175711536 = 5.69x10-9 or 0.0000000056911. Making that a percentile: 0.0000000056911*100 =  0.00000056911% chance of winning.


Brandon's Snack Example p.247
5 carrots sticks, 4 celery sticks, 2 cherry tomatoes. How many different ways can Brandon arrange his snack?
Is order important? Yes, therefore it's a permutation.
n! (total number of items) / (n1! n2! n3! (factorial of each category))

11! / (5! 4! 2!)
(11 * 10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1) / [(5 * 4 * 3 * 2 * 1)(4 * 3 * 2 * 1) (2 * 1)]
Simplifying:  (11 * 10 * 9 * 8 * 7 * 6) / [(4 * 3 * 2 * 1) (2 * 1)]
Multiplying: 332640 / 48
Simplifying: 6930
6930 different ways exist for Brandon to arrange his snack, that's enough to keep him busy for nearly 19 years. 


Student Project example
35 students in the class being paired in groups of two. How many pairs exist?

First, does order matter? No, therefore it's a combination.

35C2
35! / (2! (35-2)!)
(35 * 34 * 33!) / (2! (33!))
35 * 34 / 2!
35 * 34 / 2
1190/2 = 595
595 combinations of students.

5.2) Combining Events

Union (∪) - "or" -  The set of elements that is in the first set “or” the second set
Intersection (∩) - "and" - The set of elements that are in the first set “and” the second set.
Video lecture on Unions and Intersections
Video lecture on Unions and Intersections with Venn Diagrams (helps visualize the concept)

Addition rule - For "or" situations

p(A∪B) = p(A) + p(B) - p(A∩B)
Note: the subtraction of p(A∩B) removes the intersection so elements are not counted twice.

Cards example
Probability of pulling an ace [p(A)]? 4/52 (0.0769)
Probability of pulling a heart [p(H)]? 13/52 (0.25)
Probability of pulling an ace AND a heart [p(A∩H)]? 1/52 (0.0192)
Probability of pulling an ace OR a heart [p(A∪H)]? (4/52) + (13/52) - (1/52) = 16/52 (.307)

Gender and self-reported physical appearance (example 5.13 on p. 220)
Probability of female [p(f)] = 28865 / 52877 or 0.5465
Probability of self report "attractive" [p("attractive")] = 28635 / 52877 or 0.542
Probability of female AND self reporting "attractive" [p(f∩"attractive" )] = 16181 / 52877 or 0.306
Probability of female OR self reporting "attractive" [p(f∪"attractive" )] = (28865 / 52877) + (28635 / 52877) - (16181 / 52877) = 0.72033


Conditional Probabilities - What if you already know something? Suppose you passed the prerequisite for this class with an A, does that mean you will pass this class with an A?

Probability (B) given Probability(A) already occurred can be written: p(B|A).
How do you find p(B|A)? p(B|A) = [p (A∩B)] / p(A)

Probability of responding to a direct mail marketing campaign example 5.16
p(responded) = 48 / 288 = .167
p(responded | credit card on file) = [p(responded ∩ credit card on file] / p(credit card on file) = 31/110 = 0.282