Data Day Project 3
TL;DR
Pick a dataset, any dataset…
…and do something with it. That is your Data Day 3 assignment in a nutshell. More details below.
The long version!
For Data Day 3 you should carry out analysis on a dataset (or multiple datasets) of your own choosing. The goal is for you to demonstrate proficiency in the techniques we have covered in class by applying them to a new dataset in a meaningful way.
The goal is to show that you are proficient at asking meaningful questions and answering them using data analysis techniques, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. You do not have to apply every procedure we’ve covered (and you can use techniques we haven’t officially covered in class).
The project is very open ended. There is no limit on what tools or packages you may use, but I expect that you will make heavy use of the packages we covered in class, particularly the tidyverse
packages.
Exploratory data analysis components
As part of your work, you should:
Summarize your dataset, using functions and explain in your own words.
Explore the variables in your dataset that you will explore as a part of your project. Generate relevant summary statistics and visualizations.
Create compelling visualization(s) of your data in R. A single high quality visualization may be acceptable (with approval), but it is more likely that you will generate several small visualizations. Pay attention, as well, to your narrative presentation. Neatness, coherency, and clarity will count.
- Regardless of what you submit as your final visualizations, you should expect to create many visualizations in your Quarto as you pursue understanding of the dataset and communicating about what you learned about it’s content.
Create at least two linear regression models to explore the relationships between variables in your dataset.
All analysis must be done in RStudio, using R, and presented in a report in a Quarto document (the report file is on the Brightspace assignment page for the Data Day 3 report). You will also give a presentation in class that can be done using Quarto slides (in which case the html will be displayed) or Google slides or PowerPoint (in which case your slides must be converted to PDF).
Data
Choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored.
Your dataset should have at least 50 observations and at least 10 variables.
The dataset should include categorical variables, discrete numerical variables, and continuous numerical variables.
It is possible that you will need a few datasets that you combine for elements of your analysis (consider the
nycflights13
data that involves several datasets).
If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble, ask for help before it is too late.
Do not reuse datasets that we have already used in examples, homework assignments, exams, or labs.
Below are a list of data repositories that might be interesting to browse. You are not limited to these resources, and you’re encouraged to venture beyond them, particularly if none of the datasets below are aligned with your area of interest.
- FiveThirtyEight
- TidyTuesday
- NHS Scotland Open Data
- Edinburgh Open Data
- Open access to Scotland’s official statistics
- Bikeshare data portal
- UK Gov Data
- Kaggle datasets
- OpenIntro datasets
- Awesome public datasets
- Youth Risk Behavior Surveillance System (YRBSS)
- PRISM Data Archive Project
- Harvard Dataverse
- If you know of others, let me know, and I’ll add them here…
Deliverables
- Report – Your final RStudio Project file with your Quarto document - due May 23rd, 11:59 PM – submit a .zip file containing:
- the .qmd file,
- rendered .html (troubleshoot until the file renders)
- along with your data
- Presentation - slides due May 12th, 1:00 PM; presentations will be May 12th and 13th (as needed) – submit either
- Option 1: a PDF of PowerPoint slides or Google slides, or
- Option 2: a .zip file containing the slides.qmd file, slides.html file, slides_files folder, libs folder, data folder, etc. A folder with a Quarto slides template can be found here (link).
Project Document
The document available in on the Brightspace assignment page, DataDay3-Report-SP25.qmd
, gives you a sketch of what should be in your project document, including introduction, data information, and data analysis.
Introduce your purpose and research question(s).
Describe your data. Include a summary of its characteristics.
Data analysis including code, interpretations, and summary of your findings.
Presentations
Each person will present a presentation of no more than 10 minutes. You should have a slide deck built using PowerPoint, Google Slides, or Quarto. In the presentation you will lay out the question(s) you wanted to answer and your initial hypothesis about the answer(s); the dataset(s) chosen and why, including necessary steps to clean and combine datasets; explain your analysis approach and why you chose those analysis steps; discuss what you learned, and what the answers were to the questions you started out with. All slide files must be turned in on the night before Data Day 3, and everyone should be ready to present during class time. We will do as many presentations as possible that day and the remainder on the following day. The presentation order will be randomly determined.