Lectue 05 - The Distribution of Random Variables

Key Topics

Binomial distribution Gaussian distribution Hypothesis testing ggplot2 moments Normality testing nortest Poisson distribution R Random variables Statistical significance stats

Resources

Open on Open on View Lecture Equations Functions Lab 04 Lab 04 Replication Problem Set 02

Lecture Slides

Grappling with p-values

Here are a couple of links for digging further into what p-values are and how we explain them:

American Statistical Association’s Statement on p-values
``Mission Improbable: A Concise and Precise Definition of P-Value” - Science Magazine interview with Victor De Gruttola
``Science Isn’t Broken” - FiveThirtyEight article on p-values
``Not Even Scientists Can Easily Explain P-values” - FiveThirtyEight article on p-values, includes the video I mentioned in class
Andrew Gelman’s views on p-values
``Abandon Statistical Significance” - Andrew Gelman and others
Gelman and John Carlin suggest some solutions for the ``p-value communication problem”
``It will be much harder to call new findings ‘significant’ if this team gets its way” - Science Magazine article on changing culture around p-values

Histograms with the Normal Distribution Overlaid in `ggplot2`

I introduced the idea of a histogram with a normal distribution overlaid during this week’s lecture, but purposely did not include the syntax on our quick reference sheet for R functions. If you are interested in playing with the syntax, it is included below. We make one key change to the initial ggplot() call - the aesthetic mapping is included in the initial ggplot() function rather than in a specific geom. We also covert the hwy variable’s representation from frequency to density, and layer on top of this our normal distribution by using the stat_function() function. Make sure to update the arguments for stat_function() if you want to adjust this example, and note that we have to specify the data frame and the variable name separated by a $.

library(ggplot2)

ggplot(data = mpg, mapping = aes(hwy)) +
  geom_histogram(mapping = aes(y = ..density..)) +
  stat_function(fun = dnorm, color = "red",
    args=list(mean=mean(mpg$hwy), sd=sd(mpg$hwy)))

Performing Calculations

During class this week, I briefly described how to perform calculations and save results in R. You can use standard mathematical operators +, -, /, and * to add, subtract, divide, and multiply (respectively). These are reviewed in the week-02-lecture-03-rQuickref.pdf file in the Week-02 repo on GitHub. Quickly, we can calculate a value like lambda within R, save it as an object, and reference the object in later calculations:

> lambda <- 100000 * .00004
> dpois(6, lambda = lambda)
[1] 0.1041956

Reading in Arbitrary Vectors to `R`

Problem Set 04 asks you to find the variance of an arbitrary set of values in a vector (list) of numbers. These can easily be read into R so that you do not have to calculate the variance by hand. You can use the same technique to check your work as you calculate skew and kurtosis by hand.

To begin: Let x = 2, 4, 6, 8, 10

x <- c(2, 4, 6, 8, 10)

Once you add these values into R’s memory, you will be able to see them in your environment tab and will be able to include them in calculations:

> mean(x)
[1] 6

Extra Information

This week, I mentioned a number of important statisticians. If you want more information, you can check out these Wikipedia pages:

Additionally, here is the link to the quincunx simulation we talked about in class!