There are a ton of places to find data related to business online. Here are some examples:
Kaggle: Kaggle hosts machine learning competitions. A byproduct of these competitions is a host of fascinating datasets that are generally free and open to the public. The datasets are sometimes accompanied with suggestions for how to analyze them. Here are a few examples:
gather()
or pivot_longer()
, since there are individual columns of deposits per year.Quandl: Nasdaq Data Link provides some datasets related to finance and economics for free (though many require a paid subscription). The datasets are listed in this data catalog.
FRED: Economic Data from the St. Louis Fed.
tidyquant
to tq_get()
quantative data in a tibble format. This package can also be used to get data from FRED, Quandl, and other sources.Twitter: You could collect data on tweets that reference a specific company via the twitter API or by using a package like twitteR
to access the API. Then you could use this on its own or in conjunction with some other data (e.g., stock prices/returns) to do an analysis of consumer sentiment, for example.
Inside Airbnb: Inside Airbnb is a mission-driven activist project with the objective to provide data that quantifies the impact of short-term rentals on housing and residential communities, as well as create a platform to support advocacy for policies to protect our cities from the impacts of short-term rentals. The post data for many different cities. Please note these community guidelines:
- Only take the data you need
- Do not scrape data from the site
- Only download the data once. Do not write scripts that download the data every time they are executed.
Yelp Open Dataset: A subset of Yelp businesses, reviews, and user data for use in personal, educational, and academic purposes.
jsonlite
). Feel free to come to office hours for help.Bank Marketing Data: data on direct marketing campaigns and whether they succeeded in convincing customers to make a deposit.
Data from Intro to Statistical Learning: The ISLR
package contains several interesting (if dated and or/simulated) datasets:
ISLR::Caravan
contains 5,822 real records on whether customers purchased a caravan insurance policy, along with product ownsership information and sociodemographic data based on their zip codes.ISLR::Default
is a simulated dataset of 10,000 credit card customers that can be used to predict which customers will default on their credit card debt.ISLR::OJ
contains 1,070 purchases of orange juice for two brands, along with customer and product characteristics (like prices) that could be used to study pricing strategies.Data is Plural Newsletter: Jeremy Singer-Vine sends a weekly newsletter of the most interesting public datasets he’s found. He also has an archive of all the datasets he’s highlighted.
Google Dataset Search: Google indexes thousands of public datasets; search for them here.