My favorite source for free, interdisciplinary data

During my time in data science bootcamp, one of the struggles I faced was sourcing interesting datasets. There are lots of places to find free data: Kaggle, the Awesome Datasets GitHub repository, UCI’s Machine learning repository. A month into my program I also signed up for Jeremy Singer-Vine’s Data is Plural newsletter, which deposits 4–5 unique datasets into my email inbox weekly. So finding data for my projects should’ve been a cinch; yet I still found myself underwhelmed by the types of data I was finding.

While brainstorming ideas for projects, I would often end up on one of the various municipal or state public data portals. These portals offer tons of data that you would expect to see: zoning data, crime statistics, financial records. To an urban planning enthusiast like myself, these datasets are pretty interesting in and of themselves. However, combing through these public data portals often yielded some unexpected and surprisingly interdisciplinary results. Here are a handful of the datasets that I discovered while sifting through public city data.

  • Philadelphia’s public dataset portal contains an aerial photography dataset perfect for image classification and recognition tasks. This particular dataset has images of modern Philadelphia, with yearly entries starting in early 200s, but also contains historical images dating back to the 1800s.
  • The Seattle public data portal has a handful of interesting datasets, but my personal favorite is the public library’s Checkouts by Title. This dataframe tabulates how many times certain titles were checked out each month, starting in 2017. There are millions of rows representing all types of media (including ebooks). This robust dataset offers interesting potential for time-series analysis of interest in library materials.
  • If you’re interested in environmental studies, the Florida state public database leans heavily toward GIS data. It offers myriad data on natural water sources, native species, and weather patterns. I’d be thrilled to see an analysis (from someone more biologically savvy than myself) of this dataset tracking the health of Florida reef ecosystems, with data going all the way back to 1996. In addition to its ecology slant, the entire data portal is based in GIS. If you’re looking for a project to flex your map visualization skills, there will certainly be something for you here.
  • By far the most meaningful dataset I’ve discovered so far, Nashville’s Enslaved and Free People of Color Database represents genealogical data collected from pre-emancipation primary source material. Many descendents of enslaved people rely on such primary source material to trace their ancestry, and to find these records collected and preserved in a database is very unique.

These are just my favorites from the time I’ve spent crawling these portals for datasets suitable for my bootcamp projects. Almost every major city has an open data portal, so I would encourage you to do a search for an area that you’re familiar with to see what kind of data they offer. If you, like me, are interested in public data or urban planning issues overall, here are some of the major city data portals to get you started.

New York, NY

Los Angeles, CA

Chicago, IL

State of Texas

Phoenix, AZ

Philadelphia, PA

Seattle, WA

Nashville, TN

State of Florida

Detroit, MI

A word of caution: one of the tricky issues with using public data is that it can often be outdated or incomplete. If you’re interested in using data from one of these public portals, it’s best to double-check for quality. Another limitation with this collection I’ve put together is that it is completely centered on US cities and states. I’m sure that there are so many more interesting datasets with public data for international cities.

I hope that these datasets are interesting and useful to look through! One of the many reasons that I enjoy data science is the diversity of interest and knowledge that my colleagues bring to the table, and I hope that this post can connect data science folks to data that interests them most.

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Kick Off Your Data Science Learning With These 10 Posts

A complete guide to Bias & Ethics in Data Collection

White building with data has a better idea sign board

Time Series Forecasting With Prophet

Humans are handy to have around

How LinkedIn Uses Machine Learning in its Recruiter Recommendation Systems

Hypothesis Testing for Complete Beginners

Best Online Data Science Courses & Certifications In 2022

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Carly Tsuda

Carly Tsuda

More from Medium

Exploring Long Term Musical Trends with Spotify Data

Meet Aptitive’s Data Consultants — December Employee Spotlight

3 Things I Didn’t Know You Could Do With SQL

Types of Databases being used by top companies in 2021.