Course Preview
Meta
Key Topics
Analysis development GitHub R RMarkdown Reproducibility Version Control Workflow
Resources
Open onOverview
To set the stage for this semester, please watch the two videos below. These will take approximately 45 minutes to view. Once you have finished these videos, follow the link at the bottom of the page to answer a few short questions about what you saw. Before you begin, please make sure you have completed the course onboarding process and Lecture Prep 01! You won’t need any of the course software for this Course Preview, however, so there is no need to have installed software at this stage.
Plain Text Science
To begin, read the introductory “chapter” (it is quite short!) to Duke University sociologist Kieran Healy's primer The Plain Person's Guide to Plain Text Social Science . This gives a great taste of some of the tools we’ll be using this semester (like RMarkdown) as well as some of the ideas we’ll be thinking about (like reproducibility).
Once you’ve ready the Kieran’s intro, please watch the first few sections of a recent talk by Hadley Wickham . He is the Chief Scientist for RStudio and developer of the tidyverse
, which is a family of packages we’ll be using this semester. In fact, the original name of this family of packages was the hadleyverse
. Hadley is also the author of R for Data Science , one of the books we’ll be reading this semester. In this video, Hadley speaks about why using a programming language to express yourself for data science and statistical work can be so valuable. Please watch up to 18:30, when the “Why Program in R?” slide comes up.
For the uninitiated, here is a description of what a GUI is.
Here are links to the two GitHub repositories Hadley referenced during the talk:
Many (but not all of you) will have experienced some parts of these processes before. Perhaps you’ve used Microsoft Excel to organize some information or used SPSS to analyze some quantitative data. We won’t be using those tools. Instead, this course will emphasize the use of other tools that support reproducible, accurate, and collaborative data analysis. Throughout the semester, we’ll discuss why these tools are important and the advantages they have over other products that are out there.
The tools we use are also incredibly flexible and therefore powerful. R
and Markdown, for example, were used exclusively to create this website and the Syllabus . Both websites are hosted using GitHub. Learning these tools therefore opens up doors not only for managing data and data analyses, but also for communicating your findings.
Analysis Development
The workflow that Hadley introduced in the first video, and the idea that you cannot do data science in a GUI, is opinionated - there is a strong premise that underlies the workflow about the ways in which spatial data (and data more generally) should be obtained, stored, modified, and mapped. Hilary Parker is a data scientist at Stichfix and also runs a data science podcast called Not So Standard Deviations . She has been speaking recently about an idea called opinionated analysis development. The video linked to below is a 25 minute talk she gave on this idea last year, and she now has a draft paper out on the topic as well. Our workflow for this semester is closely linked to the ideas she discusses in this talk.
Inspired by Hilary’s idea of opinionated analysis development, our goal each week will be to focus on the processes that can be used to increase the reproducibility and accuracy of our statistical work.
Lecture Prep 02
The lecture prep for the first week asks three follow-up questions about these videos. Please answer these questions and submit them before class on August 27th. Answers must be submitted through Google Forms and each response should be three to four sentences in length. The questions are provided here for reference:
- Based on the typology presented by Kieran Healy, are you an “office type” or an “engineering type”? Why?
- Hadley Wickham introduces the idea of a workflow. In your own words, describe what you think a workflow is.
- What are some of the advantages of using plain text for a data analysis that Hadley and Kieran cite?
- In your own words, what are the key aspects of opinionated analysis development as described by Hilary Parker?