As Data scientists, our focus is on both the quality and quantity of data which can improve the model results. With different sources of data, we can leverage the information to drive good business understanding. Whether you are working on a personal project, learning the concepts, or working with datasets for your company, the primary focus is a data acquisition and data understanding. Your data should possess the maximum available information to perform meaningful analysis.
In this article, we will look at more 40+ different places to find free datasets for data science projects. We will discuss the different types of datasets in data science which cover disciplines like data visualization, data processing, machine learning, data cleaning, exploratory data analysis, natural language processing, and computer vision. If you are looking to explore the field of data science and learn how to turn your data into insights, check out one of the Best Data Science certifications online that will help you sharpen your skills with data science training from more than 650 expert trainers.
What is a Data Science Dataset?
A dataset is a collection of organized or unorganized data points consisting of related information. The common format for datasets found is tabular data, represented as rows and columns and found in the CSV file formats most often. For Computer Vision applications, we have a collection of images that make our dataset.
The datasets belong to a sample and are assumed to be drawn from a population. It is also assumed to be a good representative of unbiased information about this population. For example, the sales figure of a company is a dataset. Your image collection in Google Photos, the reviews found on e-commerce platforms, etc., are also examples of datasets. These datasets can be used in the field of data science to analyze and gain information.
Types of Datasets
Datasets can be public or private, depending on their source. The use of public datasets contributes significantly to research and development.
In addition, datasets can be classified by the type of information they contain:
- Multivariate: Data containing multiple variables.
- Categorical: It depicts a wide range of people.
- Numerical: These datasets measure data in terms of numbers, such as age, height, etc.
- Correlation: Here, data points are interrelated.
- File Based: In this case, datasets are stored in files.
- Bivariate: A dataset with two variables and a relationship between them.
- Web Dataset: A collection of data from one or more similar internet portals.
- Database: Such datasets store data in tables, columns, and rows.
Datasets for Data Science Project: Public Data Sources
Public data sources can be in various forms. We have worldwide data sets provided by international bodies like finance-related datasets provided by International Monetary Fund (IMF) and The World Bank, health-related datasets provided by World Health Organization (WHO), government datasets made available by the respective countries, or even datasets collected by other institutions or individuals. We can use these datasets for our learning and building solutions for non-commercial purposes. Below are some of the public datasets for data science.
1. US Government Dataset
Vivek Kundra, the Federal Chief Information Officer of the United States, unveiled the Data.gov website at the end of May 2009. Improved public access to valuable, machine-readable datasets produced by the federal government's executive branch is the goal of Data.gov. It is an open data site of nearly 3,00,000 datasets managed by the US Technology Transformation Service and General Services Administration.
Link to Dataset
2. Open Government Data (OGD) Platform India
Open Government Data (OGD) Platform India The National Informatics Centre (NIC), a leading ICT entity of the Government of India operating under the direction of the Ministry of Electronics & Information Technology, created, produced, and hosts the Open Government Data (OGD) Portal. The goal of the Open Government Data Platform India is to make it easier for people to access publicly available, shareable government data and information proactively, encouraging more people to use publicly available data and maximizing its potential for national development. It contains datasets from multiple sectors like Health, Sports, Judiciary, Biotechnology, Travel, etc.
Link to Dataset
3. The World Bank Open Data
The World Bank Group, a distinctive international alliance addressing poverty worldwide with sustainable solutions, has 189 member nations. For the aim of pursuing capital projects, the governments of low- and middle-income countries can apply for loans and grants from the World Bank, which acts as an international financial institution. The World Bank Open Data provides free and direct access to global development data such as governance indicators, food price inflation estimates by country, child mortality, women in education, access to electricity, climate change, extreme poverty, etc.
Link to Dataset
4. Data.world
Datanami Data.world is developed on the idea that if we could bring together people with different perspectives and skills to collaborate on datasets from around the world, we can make considerable progress in making data of all kinds available at any time. Data.world has built a social network for data people. We can easily find the data we need or add new data and share it with the community using the platform. Their community can help clean the data, add annotations, scripts, and visualizations or just upvote and discuss it.
Link to Dataset
Datasets for Data Science Project: Data Visualization
Information and data are graphically represented in data visualization. Data visualization tools offer a simple approach to spot and comprehending trends, outliers, and patterns in data by utilizing visual components like charts, graphs, and maps. In this section, we will cover some of the places where you can find the datasets for data science beginners as well as intermediate and advanced levels for creating data visualization projects.
1. BFI - Industry Data and Insights
The British Film Institute (BFI) is a nonprofit organization dedicated to the advancement and preservation of British film and television. One of the most significant film and television collections in the world is housed in the BFI National Archive. In the UK film industry and other screen industries, one can read free research information and market intelligence. They conduct official statistics and specialized research all year long, in addition to publishing the BFI Statistical Yearbook every year. You can use this data like weekend box office figures, and the UK film economy, to create some insightful dashboards or visualizations.
Link to Dataset
2. The Humanitarian Data Exchange (HDX)
IM Resource Portal To coordinate the worldwide emergency response to save lives and protect people in humanitarian crises, the General Assembly of the United Nations established the United Nations Office for the Coordination of Humanitarian Affairs (OCHA) in December 1991. The organization provides a Humanitarian Data Exchange (HDX) portal to find, share and use humanitarian data. You can get a hold of 19,848 datasets from 254 locations combined from 1818 sources as of writing this article. One can use this vast dataset store as free datasets for data science projects involving data visualization.
Link to Dataset
3. Data at World Health Organization (WHO)
Humanitarian Data Exchange The United Nations has a dedicated agency for worldwide public health called the World Health Organization. A collection of datasets for global health data is provided by the WHO's World Health Data Hub. It offers complete solutions to gather, store, analyze, and exchange fast, accurate, and useful data. Data related to the pandemic, mortality data, global health estimates, immunization data, clinical trials, etc. can be found as part of their database.
Link to Dataset
4. FBI’s Crime Data Explorer
Big Local News The FBI's Crime Data Explorer (CDE) intends to increase awareness of the sharing of criminal and noncriminal law enforcement data, increase transparency around it, improve law enforcement's accountability, and lay the groundwork for public policy that will make the country safer. Using the CDE, you can either view these datasets through visualizations or download them for creating your custom visualizations.
Link to Dataset
Datasets for Data Science Project: Data Processing
Data processing is the manipulation of data by carrying out operations such as retrieval of information or transformation of data points. A machine is not capable of understanding the data in raw formats or even text representations. It becomes important to transform this data into a machine-readable format so that the AI or ML models can be trained on the transformed data. Below are some sources where one can find data science sample datasets to perform data processing operations.
1. AWS Open Data Registry
To make it easier for users to find and share datasets made available through AWS resources, the AWS Open Data registry was created. It consists of plenty of datasets collected from various sources, along with some usage examples for the datasets. It also encourages users to add a dataset or example of how to use a dataset in the registry.
Link to Dataset
2. FiveThirtyEight
ABC News - The Walt Disney Company American website FiveThirtyEight specializes in opinion poll analysis, politics, economics, and sports blogging in the country. Statistician Nate Silver founded FiveThirtyEight. Using algorithms and statistical models, Silver and other analysts make forecasts about politics, sports, the economy, and other topics. The site hosts historical datasets related to all these events which anyone can use to work on their data processing projects.
Link to Dataset
3. IMDb Datasets
Users can access subsets of the IMDb data for non-commercial purposes. Every day, the data is updated by IMDb. It contains datasets providing information for titles, episodes, ratings, cast, etc.
Link to Dataset
Datasets for Data Science Project: Machine Learning
Machine Learning is a subset of Data Science that deals with building predictive models for supervised or unsupervised tasks covering regression, classification, and clustering problems. Most of us look for open datasets for data science to work on machine learning projects. There are tons of resources out there where you can find these datasets to work with. In this section, we have mentioned the five best places where you can find datasets for your machine-learning applications.
1. Kaggle
- Every data science beginner or practitioner must be aware of Kaggle. It is one of those platforms which is popular for finding datasets for a variety of data science applications. One can also upload a dataset on the platform.
- It has a vast community that uses this dataset for data processing, cleaning, and model-building purposes and saves its work publicly on the platform in the form of Jupyter Notebooks. Others can view their work and get inspiration.
- A configurable, no-setup Jupyter Notebooks environment is available on Kaggle. Access GPUs at no cost to you, together with a sizable collection of publicly available community data & code. It also has a provision to initiate discussions among community members.
- Kaggle regularly hosts competitions, and the community members with the best solution are also rewarded if their work is found to be the best among all. All the code and data you need to complete your data science work are available inside Kaggle.
- To quickly complete any analysis, use over 50,000 accessible datasets and 400,000 public notebooks. With all these features, Kaggle is the first choice for machine learning enthusiasts to find datasets for their research, learning, or implementation.
Link to Dataset
2. UCI Machine Learning Repository
Aalto Data Hub The machine learning community uses the UCI Machine Learning Repository as a collection of databases, domain theories, and data generators for the exploratory study of machine learning algorithms. David Aha and other graduate students from the University of California, Irvine started the archive in 1987 as an FTP archive. It has been widely used as a key source of machine learning data sets by students, instructors, and researchers across the world. One of the top 100 most cited "papers" in all of computer science, with over 1000 citations, serves as a measure of the archive's influence.
Link to Dataset
3. Google Dataset Search
Google Dataset Search is a Google search engine that assists users in finding data science datasets that are openly accessible for use. On September 5, 2018, the business introduced the service, claiming that it was intended for data scientists and data journalists. Users can find data sets housed in thousands of repositories on the internet by performing a quick keyword search.
Link to Dataset
4. Nasdaq Data Link
FX News Group The Nasdaq Stock Market, an American stock exchange with headquarters in New York City, is the first electronic exchange in the world and an online global marketplace for buying and trading stocks. When you are working on time-series modelling, stock data can be a good place to start with. Nasdaq has the data repositories for all the core financial data from across regions including global economic indicators. One can try to predict the next inflation or a multi-bagger stock using the enormous financial data on the platform.
Link to Dataset
5. Recommender Systems and Personalization Datasets
A recommender system is a type of information filtering system that suggests products or content based on what the user will find most useful. Nowadays, recommender systems are in use everywhere. E-commerce websites are using it to provide relevant suggestions to the users to increase their sales, social media platforms are using it to suggest relevant content to the users, and advertisements firms are using it to target the right ads to the users. One can leverage this platform to find datasets for learning the data science sub-field that consists of recommendation systems.
Link to Dataset
To learn more about how to use these datasets to create predictive models for your application, you can explore Data Science Online Bootcamp, which helps you wrangle massive data sets. With more than 100 guided hands-on exercises and more than 10 intensive case studies, you will build analytical and programming skills to become a confident data scientist with expert guidance.
Datasets for Data Science Project: Data Cleaning
The practice of correcting or deleting inaccurate, damaged, improperly formatted, duplicate, or incomplete data from a dataset is known as data cleaning. Almost all the data science dataset needs to go through the data cleaning stage. Therefore, it becomes important that you practice data cleaning steps before any data analysis is done on the dataset. To learn an efficient way of cleaning data, one needs to work on a dataset that has lots of issues when it comes to the right data format. The following are some of the places where you can try your hands at cleaning the datasets.
1. Reddit - Datasets
The datasets community of Reddit consists of more than 164,000 subscribers. It is a place to share, find, and discuss datasets. You can find all kinds of different datasets on this platform. You can even request a dataset type in the community forum or even ask questions to the other members of the community.
Link to Dataset
2. Open Data Network by Socrata
Thousands of datasets from hundreds of open data catalogs can be searched using the Open Data Network, a global search engine from Socrata. The software analyzes datasets and consistently classifies them across catalogs using machine learning.
Link to Dataset
3. Making Noise and Hearing Things by Rachael Tatman
The dataset collection provided by Rachael Tatman in Making Noise and Hearing Things is a curated list of data science practice data sets for data cleaning. Each of these datasets needs to be cleaned and processed before it can be used for further analysis.
Link to Dataset
Datasets for Data Science Project: Exploratory Data Analysis
Exploratory data analysis is the process of examining datasets through preliminary analyzes on data to find patterns, identify anomalies, and test hypotheses. If you are looking for datasets for learning data science, then the following resources will prove useful.
1. Climate Data Online by NOAA
In addition to station history data, Climate Data Online (CDO) offers free access to the NCDC's database of historical worldwide weather and climate data. These data include radar data, 30-year Climate Normals, and quality-controlled daily, monthly, seasonal, and yearly measures of temperature, precipitation, wind, and degree days. The climate data online is a repository of global marine data, local climatological data, weather, precipitation, regional snowfall information, etc. You can use these datasets to investigate the data points and summarize the results using exploratory data analysis.
Link to Dataset
2. Azure Open Datasets
All the major cloud service providers have established open data repositories for the data science community. Like Google and AWS, Azure also has an open data repository where publicly available datasets can be used to perform data cleaning, exploratory data analysis, and machine learning. Since all machine learning tasks go through the EDA process, this repository might be a good place to start with.
Link to Dataset
3. IEEE Data Port
Aalto Data Hub IEEE is a non-profit technical professional organization that has approximately 426,000 members across more than 160 nations. The biggest technical professional association in the world works to advance technology for the benefit of all. The portal contains datasets from various categories, namely, artificial intelligence, astronomy, biomedical, cloud computing, finance, image processing, machine learning, security, signal processing, and many more. You can use these data sets for data science applications involving a good amount of exploratory data analysis.
Link to Dataset
Datasets for Data Science Project: Natural Language Processing
A branch of linguistics, computer science, and artificial intelligence called "natural language processing" studies how computers and human language interact, with a focus on how to design computers to process and analyze massive amounts of natural language data. Today’s world is using voice assistants like Siri, Google Voice, and Alexa to make their day-to-day tasks easier and save time. While it is complicated to achieve human-level interaction with these voice assistants but we can see how they have grown over the years. In this section, we will discuss data science datasets examples that can help you to analyze a large amount of text data.
1. Wikipedia: Database
When we are talking about text data, we cannot ignore the enormous text data present on Wikipedia. The best part is that Wikipedia offers free copies of all the available data content to interested users. The complete data dump can be found in the link below.
Link to Dataset
2. BuzzFeed News
The American news website BuzzFeed News is run by BuzzFeed. It is a leading provider of digital media and has released several high-profile scoops. You can find a list of all the datasets made available by BuzzFeed News on the GitHub link mentioned below.
Link to Dataset
3. Academic Torrents
Wikipedia Academic Torrents is a distributed system for sharing sizable datasets. You can also publish your public data globally for free to ensure it is available forever. The platform is used by some of the top academic research institutes around the world. You can make use of publicly available research papers to analyze the text contents in them.
Link to Dataset
4. Yelp Open Dataset
Medium The Yelp dataset is a subset of businesses, reviews, and user information for use in private, academic, and educational contexts. The primary aim of these datasets is to teach students about databases and to learn NLP. With over 150,000 businesses and around 70,00,000 reviews, this might be the perfect dataset to start your NLP journey or even enhance your existing knowledge in the domain.
Link to Dataset
5. The NLP Index by Quantum Stat
After providing the community resources NLP Model Forge and The Big Bad NLP Database, the NLP Index compiles many facets of NLP research papers, code, and discovery into a single repository. You can use these repositories to understand the various NLP model built by different authors. The NLP Index is not just about the data but also about pre-trained NLP models.
Link to Dataset
Data Science Datasets for Computer Vision
Computer vision studies how advanced knowledge may be extracted by computers from digital images or films. From an engineering standpoint, it aims to comprehend and automate operations that the human visual system can perform. The section covers some of the places where you can find images and video-related content for your computer vision applications.
1. Computer Vision Online
Computer Vision Online is an online platform that hosts several computer vision datasets that can be used publicly. It enables data science enthusiasts to access the datasets, exchange ideas and information within the community, and build computer vision systems from scratch or enhance a pre-trained model.
Link to Dataset
2. Visual Data Discovery
The website describes itself as the best place to find and share computer vision datasets. The website is home to computer vision datasets that include images, video, and 3D datasets.
Link to Dataset
3. Roboflow Public Datasets
Roboflow is a popular platform to look for computer vision datasets. It hosts these datasets in many popular formats. Their public datasets database contains more than 66 million images, 90 thousand datasets and even 7000 pre-trained models. Searching for your computer vision dataset requirements on Roboflow is an easy decision.
Link to Dataset
4. Datasets – Computer Vision Group, TUM
The Computer Vision Group of the Technical University of Munich has a team of researchers that work on topics in Computer Vision and Image Processing. They have made their research datasets available to others. This list of datasets can be found on their official website.
Link to Dataset
More Datasets for Data Science Projects
Here is a list of culminated datasets for data science projects over various topics:
How are Data Science Datasets Created?
Nowadays, institutions and organizations are striving to collect more and more data. The data collection is done in several different ways, either directly or indirectly. The datasets are created using surveys, polls, votes, forms, observation, social media monitoring, online tracking, etc. and then made either publicly available or used for in-house purposes. In this article, we will cover the places where these datasets are publicly available and free to use.
Elevate your career with certification of competency in business analysis. Improve skills, gain expertise, and stand out in the competitive business world.
Conclusion
In this article, we have seen 31 different places to find the datasets for data science projects based on which data science task we are working on. You can head over to the respective links based on your area of work or choose platforms such as Kaggle, AWS Open Data Registry, Google Dataset Search, etc., which host datasets for a variety of data science tasks. The purpose of making these datasets publicly available is to enhance the knowledge of individuals by making the resources readily available to them. The use of these datasets is intended to be for either learning purposes or non-commercial usage. To explore more possibilities with these datasets, you can check out KnowledgeHut’s Best Data Science Certification Online, which helps you master the tools, technologies, and trends driving the Data Science revolution. You will acquire the latest data analysis and visualization skills by working on real-world datasets under the guidance of trained experts.