Ashish is a techology consultant with 13+ years of experience and specializes in Data Science, the Python ecosystem and Django, DevOps and automation. He specializes in the design and delivery of key, impactful programs.
HomeBlogData ScienceUsing GitHub for Data Science [Comprehensive Guide + Repos]
Any project in the industry consists of many people - AI researchers, Data Scientists, Software Developers, and Testers working together to refine a code base. Git enables developers to keep track of the changes and merge them in a single repository. Git is a command-line version control system designed to track changes over a period of time. In this article, we will discuss the various necessary commands in GitHub for Data Science, data science resources GitHub and projects with Python, along with a learning path for a beginner in data science.
Imagine you are working on a Data Science project and made some changes to your code last night. The next morning, your friend also makes some improvements on top of your code. To avoid any confusion and merge these changes, you would need a version control system. This article will also discuss GitHub Data Science project.
Data Scientists need GitHub for source code management. It hosts Git, an open-source version control system that tracks the changes and requests of a project. Using GitHub, users can clone the code from the central repository to their local machine, make changes, commit the modifications and merge it back to the central repository. Many organizations follow agile development methods and usage of Git makes it easier to track and visit back changes. As a data scientist, it's important to understand the concepts and use of Git and GitHub for Data Science projects. To learn GitHub for data science from scratch, learn to rely on Git tools and their enhanced functionality. To build a great GitHub profile or portfolio for data science, there are Git commands which are needed to be incorporated into the organization. Consider enrolling in the following course Data Science Bootcamp with Job Placement to get better insights in GitHub Data Science.
Git is a distributed version control system for tracking changes in source code during software development - Wikipedia. Git is an open-source project and is well-known in terms of its performance, functionality, security, and flexibility. Any project in the industry consists of various developers. Git is used for coding and collaboration platforms making the workflow easier for different teams. GitHub is a web hosting platform that hosts Git commands, also providing you a copy of your work in case of the local repository on your system is lost or cracked up.
Data Scientists can use GitHub commands to:
Git uses three-tier architecture and can store different states of the same code in each stage. It has a working directory, staging area and local repository.
While performing a data science project, you need to track changes and versions to your project and to the version the code was working perfectly and last working. It makes your project systematic and lists down all modifications in every commit stage. It makes collaboration easier with multiple people making changes.
The ‘untracked’ area of Git is the current working directory where our local files are present. However, if we don’t save the changes made to our files in the local system, they will be lost. You need to move files from Working Tree to the Staging area and explicitly tell Git to notice edits such that changes reflect in. Git directory. It is important to make changes stepwise and combine changes on the same topic with a single commit. The Git directory consists of a Local Repository consisting of checkpoints or commits.
1. Git clone
It is a command for downloading existing code from remote repositories. It makes an identical copy of the latest version of a project in a repository and saves it to the local computer.
Git clone <https://name-of-the-repository-link>
2. Git branch
With the help of branches, many developers can work in a single project parallelly. Git branch command is used to create, list and delete branches.Git branch <branch-name> (local branch)
Git push -u <remote> <branch-name> (remote repository) Git branch -d <branch-name> (Deleting a branch)
3. Git checkout
This command helps us switch before working in a branch. It creates a new branch in your local and checks the branch out to new right after it has been created.
Git checkout -b <name-of-your-branch>
4. Git status
This command gives us all the information about the current branch, including file information whether the current branch is up to date or not, if there is anything to commit, push or pull and files are staged, unstaged or untracked.
Git status
5. Git add
The command is used to include changes into a file such as create, modify and delete before committing. The changes are not saved until we use Git commit.
Git add <file> (Add a single file)
Git add -A (Add everything at once)
6. Git commit
It helps us set a checkpoint in our development and saves the changes locally. A developer can go back and see the modifications made in a particular file.
Git commit -m "commit message"
7. Git push
It is used after committing the changes and helps to send changes to the remote server. It uploads the commits to the remote repository and only uploads the changes that are committed.
Git push <remote> <branch-name>
8. Git pull
It is used to get updates from the remote repository. It gets updates from the remote repository and applies the changes into the local system.
Git pull <remote>
9. Git revert
It is used to undo the changes made. It creates a new commit without deleting the old one. Its advantage is that it doesn’t touch the commit history. Each operation has a hashcode which could be viewed using Git status.
Git revert #hashcode
10. Git merge
This helps to merge our local branch with the parent branch (master). It adds all the commits to the master branch.
Git merge <branch-name>
We will discuss how to install Git in Windows and make a repository to commit changes. However, to enroll for an in-depth course, consider enrolling in Data Science Training.
Go to Git and install the required version. Once installed, select Launch the Git Bash, then click on Finish. The Git Bash is now launched. Use the Git --version command to check the version installed.
To create a new directory, use the $mkdir command and enter the folder using $cd. Here, newproject is my local directory name.
To initialize the directory, use $Git init command. Just to test, go to the folder where “newproject” is created and create a text file with the line This is a new project. Save and close the file.
Enter the Git bash and use $Git status to check the status of the folder.
Git config allows users to set configuration values on how Git looks and operates and uses those to determine non-default behavior that one may want. With Git config we can set global variables such as the name and email of a user and verify the same using Git config --list.
Initially, our file is untracked. The Git add command copies a file from the working directory to the staging area. Adding commits keeps track of the changes we perform. The commit command performs a commit and the -m “message” adds a message. It then takes a snapshot of the staging area and assigns a hash from the commit to the snapshot.
Logs help us to see the commit history and changes in a project when different developers have worked on the same repository. In the image below, we can see the commit message which we had put earlier - “adding first file” by using Git log.
Make a new repository on GitHub and give it a name and a readme description.
Add a file into a folder and use the below commands in sequence.
You will see the file gets automatically added to the GitHub repository.
Git remote command can be used to share code to a remote repository. Any project can be downloaded from remote server to local computer. There is an existing connection between the original remote setup, which points to the “origin” remote connection.
We use the command Git remote add origin <GitHub repo link>
The Git push command is used to upload local repository content and commits to a remote repository. After all the final modifications have been made by the developers, a push operation is performed so that changes can be successfully shared with remote team members.
The command is Git push origin master
In order to merge conflicts, add or remove files or make commits, it's important to clone the repository which enables us to keep a copy from GitHub to your local repository. Each repository comes with versions of every file and folder for the project. It creates a copy of the existing repository.
Branching allows developers to get the code from production to fix a bug or add a feature. Branches are used to work with versions of code to fix a bug or add a feature without modifying the existing version. These branches work with a copy of code, make and build changes, test those changes, which are then merged into the main branch.
To create a new branch, use - Git branch < name of branch >
Pull requests inform the changes in a branch in a repository. Once a pull request is opened, one can discuss and review the potential changes with collaborators and then commit after making those changes.
Forking is the process of contributing to or using someone else’s project. It makes a copy, and one can make changes to the existing project to make it better using pull requests which can then be merged with the original project. You are making open-source contributions to someone else’s project. It creates a remote copy of the original repository into your repository.
This takes the snapshot of the changes, commits and push help to push the changes.
This is how you contribute to open-source changes and contribute to a public repository.
To track changes in the Data Science project, Git and GitHub can be used. As a data scientist, you need to have GitHub to collect data from different sources and modify or implement changes into the existing project file. Multiple other developers and managers could review the changes and also view the existing modifications. Some best practices include :
To learn more about the best practices and tips on how to get started in Data Science filled, consider enrolling in KnowledgeHut’s Data Science Bootcamp with Job Placement to know more.
The following are the best GitHub repositories for data science. The best way to excel is to build GitHub projects on Data Science.
As an aspiring data scientist, one must acquire knowledge of version control tools like Git and GitHub to maintain and review the research and changes in the project. It allows you to manipulate the data in real-time with multiple data scientists and could be used in production while looking for high-risk projects. The key takeaways from these articles included basic Git commands and their step-by-step usage with pictures. Finally, we discussed how Git is used in the data science domain.
Start by learning basic Git commands, which have been mentioned in the article. Try collecting all your data, models, and notebooks in one repository for one data science project. Include readme files and make branches in case of other modifications.
We listed the top repositories for learning and hosting data science projects on GitHub. In order to participate, you could enroll in contests from websites like Kaggle or Analytics Vidhya.
Bitgrit consists of numerous AI challenges and problems, including real-world datasets to find solutions for, whereas GitHub is a version control platform that consists of a repository of data science projects.
You can use GitHub to list all your projects under a single repository or even make GitHub Pages showcasing and explaining your individual Data Science project and your approach to solving them.
Git is a version control system that helps to keep track of our modifications in a project, whereas GitHub is a cloud-based website or hosting platform that consists of repositories consisting of files describing a particular project.
Name | Date | Fee | Know more |
---|