HomeBlogData ScienceUsing GitHub for Data Science [Comprehensive Guide + Repos]

Using GitHub for Data Science [Comprehensive Guide + Repos]

Published
26th Apr, 2024
Views
view count loader
Read it in
16 Mins
In this article
    Using GitHub for Data Science [Comprehensive Guide + Repos]

    Any project in the industry consists of many people - AI researchers, Data Scientists, Software Developers, and Testers working together to refine a code base. Git enables developers to keep track of the changes and merge them in a single repository. Git is a command-line version control system designed to track changes over a period of time. In this article, we will discuss the various necessary commands in GitHub for Data Science, data science resources GitHub and projects with Python, along with a learning path for a beginner in data science. 

    What is GitHub?

    Imagine you are working on a Data Science project and made some changes to your code last night. The next morning, your friend also makes some improvements on top of your code. To avoid any confusion and merge these changes, you would need a version control system. This article will also discuss GitHub Data Science project.  

    Why Do Data Scientists Need to Use GitHub?

    Data Scientists need GitHub for source code management. It hosts Git, an open-source version control system that tracks the changes and requests of a project. Using GitHub, users can clone the code from the central repository to their local machine, make changes, commit the modifications and merge it back to the central repository. Many organizations follow agile development methods and usage of Git makes it easier to track and visit back changes. As a data scientist, it's important to understand the concepts and use of Git and GitHub for Data Science projects. To learn GitHub for data science from scratch, learn to rely on Git tools and their enhanced functionality. To build a great GitHub profile or portfolio for data science, there are Git commands which are needed to be incorporated into the organization. Consider enrolling in the following course Data Science Bootcamp with Job Placement to get better insights in GitHub Data Science.  

    What is Git and How Does It Work?

    Git is a distributed version control system for tracking changes in source code during software development - Wikipedia. Git is an open-source project and is well-known in terms of its performance, functionality, security, and flexibility. Any project in the industry consists of various developers. Git is used for coding and collaboration platforms making the workflow easier for different teams. GitHub is a web hosting platform that hosts Git commands, also providing you a copy of your work in case of the local repository on your system is lost or cracked up. 

     Data Scientists can use GitHub commands to: 

    • Review research and ongoing repositories 
    • Understand the previous and current state of the project and the functions used 
    • Track the changes and the user who made the changes. 

    Git uses three-tier architecture and can store different states of the same code in each stage. It has a working directory, staging area and local repository.

    • Working Directory: The place in your local where local files are stored. 
    • Staging Area: Area where files are present that you want to send to commit. 
    • Git Repository: After firing a commit, files are moved from the staging area to the Git repository. 

    Is Git Important to Learn for Data Science?

    While performing a data science project, you need to track changes and versions to your project and to the version the code was working perfectly and last working. It makes your project systematic and lists down all modifications in every commit stage. It makes collaboration easier with multiple people making changes.

    Git Terminologies and Basic Commands

    The ‘untracked’ area of Git is the current working directory where our local files are present. However, if we don’t save the changes made to our files in the local system, they will be lost. You need to move files from Working Tree to the Staging area and explicitly tell Git to notice edits such that changes reflect in. Git directory. It is important to make changes stepwise and combine changes on the same topic with a single commit. The Git directory consists of a Local Repository consisting of checkpoints or commits.

    Terminologies

    1. Repository or Repo: A directory is a storage space for your codes and projects. It can be a local folder on your computer or a storage space on GitHub. For each project you make, you can make a repo for it and keep code files, read-me files, image files, and everything related to your project.  
    2. Terminal: The terminal is a command line interface where we can input 
    3. commands. You can input text commands, also known as prompts, into the screen.  
    4. Cloning: It pulls copies of all the repository data - including folders and files with its version. One can push the changes back to the remote repository. When we clone a repository, we copy the repository from GitHub.com to our local machine which makes it easier to fix merge conflicts.  
    5. Commit: This command helps Git take a ‘snapshot’ of your repository and mark a checkpoint against it which would help you to reevaluate or restore the project to any previous version. It makes it easier to split a feature into minute commits, keep related commits together and keep related commits grouped together. The ‘Git commit’ command is used for it. 
    6. Push: The Git push command is used to upload content or files from a local repository to a remote repository. Pushing exports commits to local branches. However, special care should be taken while pushing as it can overwrite existing changes.  
    7. Pull: The Git pull command is used to fetch and download content from a remote repository to the local repository. First, Git fetch is being run which downloads content from a specific remote repository. Then a Git merge is executed to merge the remote content to a new local merge commit. Pull requests the changes being pushed to a branch in a repository. A pull request is opened, and collaborators can discuss and review the potential changes with collaborators before the final changes are merged into the base branch.

    Basic commands

    1. Git clone

    It is a command for downloading existing code from remote repositories. It makes an identical copy of the latest version of a project in a repository and saves it to the local computer.  

    Git clone <https://name-of-the-repository-link

    2. Git branch

    With the help of branches, many developers can work in a single project parallelly. Git branch command is used to create, list and delete branches.Git branch <branch-name> (local branch) 

    Git push -u <remote> <branch-name> (remote repository) 
    Git branch -d <branch-name> (Deleting a branch) 

    3. Git checkout

    This command helps us switch before working in a branch. It creates a new branch in your local and checks the branch out to new right after it has been created. 

    Git checkout -b <name-of-your-branch>

    4. Git status

    This command gives us all the information about the current branch, including file information whether the current branch is up to date or not, if there is anything to commit, push or pull and files are staged, unstaged or untracked. 

    Git status

    5. Git add

    The command is used to include changes into a file such as create, modify and delete before committing. The changes are not saved until we use Git commit.  

    Git add <file> (Add a single file) 

    Git add -A (Add everything at once)

    6. Git commit

    It helps us set a checkpoint in our development and saves the changes locally. A developer can go back and see the modifications made in a particular file. 

    Git commit -m "commit message"

    7. Git push

    It is used after committing the changes and helps to send changes to the remote server. It uploads the commits to the remote repository and only uploads the changes that are committed. 

    Git push <remote> <branch-name>

    8. Git pull

    It is used to get updates from the remote repository. It gets updates from the remote repository and applies the changes into the local system. 

    Git pull <remote>

    9. Git revert

    It is used to undo the changes made. It creates a new commit without deleting the old one. Its advantage is that it doesn’t touch the commit history. Each operation has a hashcode which could be viewed using Git status. 

    Git revert #hashcode

    10. Git merge

    This helps to merge our local branch with the parent branch (master). It adds all the commits to the master branch. 

    Git merge <branch-name>

    How to Create and Clone a Repository? [Step-by-Step]

    We will discuss how to install Git in Windows and make a repository to commit changes. However, to enroll for an in-depth course, consider enrolling in Data Science Training.  

    Step 1: Create Account & Git Installations

    Go to Git and install the required version. Once installed, select Launch the Git Bash, then click on Finish. The Git Bash is now launched. Use the Git --version command to check the version installed.

    Step 2: Initializing a Repository

    To create a new directory, use the $mkdir command and enter the folder using $cdHere, newproject is my local directory name.

    To initialize the directory, use $Git init command. Just to test, go to the folder where “newproject” is created and create a text file with the line This is a new project. Save and close the file. 

    Enter the Git bash and use $Git status to check the status of the folder.  

    Step 3: Configuring Git

    Git config allows users to set configuration values on how Git looks and operates and uses those to determine non-default behavior that one may want. With Git config we can set global variables such as the name and email of a user and verify the same using Git config --list.

    Step 4: Learn How to Commit Files in Git

    Initially, our file is untracked. The Git add command copies a file from the working directory to the staging area. Adding commits keeps track of the changes we perform. The commit command performs a commit and the -m “message” adds a message. It then takes a snapshot of the staging area and assigns a hash from the commit to the snapshot.

    Step 5: Viewing Logs

    Logs help us to see the commit history and changes in a project when different developers have worked on the same repository. In the image below, we can see the commit message which we had put earlier - “adding first file” by using Git log.

    Step 6: Uploading To Remote Repo on Git 

    Make a new repository on GitHub and give it a name and a readme description.

    Add a file into a folder and use the below commands in sequence. 

    • cd “folder where files are kept.” 
    • Git init 
    • Git remote add origin “your GitHub repository path” 
    • Git remote -v 
    • Git add . 
    • Git commit -m “your message” 
    • Git push origin master 

    You will see the file gets automatically added to the GitHub repository.

    Step 7: Adding Git Remote to Your Repository

    Git remote command can be used to share code to a remote repository. Any project can be downloaded from remote server to local computer. There is an existing connection between the original remote setup, which points to the “origin” remote connection. 

    We use the command Git remote add origin <GitHub repo link>

    Step 8: Push using Git

    The Git push command is used to upload local repository content and commits to a remote repository. After all the final modifications have been made by the developers, a push operation is performed so that changes can be successfully shared with remote team members. 

    The command is Git push origin master

    Step 9: Cloning a GitHub Repository

    In order to merge conflicts, add or remove files or make commits, it's important to clone the repository which enables us to keep a copy from GitHub to your local repository. Each repository comes with versions of every file and folder for the project. It creates a copy of the existing repository.

    Step 10: Branching and Merging

    Branching allows developers to get the code from production to fix a bug or add a feature. Branches are used to work with versions of code to fix a bug or add a feature without modifying the existing version. These branches work with a copy of code, make and build changes, test those changes, which are then merged into the main branch. 

    To create a new branch, use - Git branch < name of branch > 

    • Step 1: Create branch -> Git branch “branch name” 
    • Step 2: Checkout branch -> Git checkout “branch name” 
    • Step 3: Merge new branch in master branch -> Git merge “branch name”

    Step 11: Pull using Git

    Pull requests inform the changes in a branch in a repository. Once a pull request is opened, one can discuss and review the potential changes with collaborators and then commit after making those changes.

    Step 12: Forking and Contributing to the world

    Forking is the process of contributing to or using someone else’s project. It makes a copy, and one can make changes to the existing project to make it better using pull requests which can then be merged with the original project. You are making open-source contributions to someone else’s project. It creates a remote copy of the original repository into your repository. 

    • Open any public repository and click on the Fork button to fork the changes.
    • You can keep the same name of the repository you want to fork and  Click on Create Fork.
    • Once you fork, you will see a copy of the original repository in your account.
    • Once you have made changes in the code, you need to push the changes back.

    This takes the snapshot of the changes, commits and push help to push the changes.

    This is how you contribute to open-source changes and contribute to a public repository. 

    Best Practices for Structuring a Data Science Project Using Git

    To track changes in the Data Science project, Git and GitHub can be used. As a data scientist, you need to have GitHub to collect data from different sources and modify or implement changes into the existing project file. Multiple other developers and managers could review the changes and also view the existing modifications. Some best practices include : 

    • Keeping all your data, project files, and models in one single place. 
    • Keep track of changes and versions of projects locally using Git. 
    • Storing machine learning models and doing analytics using code or different tools such as Tableau or PowerBI. It will allow users to deploy projects through CI/CD methodology. 

    To learn more about the best practices and tips on how to get started in Data Science filled, consider enrolling in KnowledgeHut’s Data Science Bootcamp with Job Placement to know more.

    5 Tools for Getting Started with Data Science on GitHub

    1. VS Code extensions  
    2. GitHub.dev  
    3. Codespaces  
    4. Model and Data Templates  
    5. GitHub Actions  

    TOP 10 GitHub Repositories for Data Science

    The following are the best GitHub repositories for data science. The best way to excel is to build GitHub projects on Data Science. 

    1. Freecodecamp 
    2. TensorFlow 
    3. The Algorithms 
    4. Awesome Machine Learning 
    5. Data Science I-Python Notebooks 
    6. Homemade Machine Learning 
    7. Awesome Data Science 
    8. Deep Learning Drizzle

    Conclusion

    As an aspiring data scientist, one must acquire knowledge of version control tools like Git and GitHub to maintain and review the research and changes in the project. It allows you to manipulate the data in real-time with multiple data scientists and could be used in production while looking for high-risk projects. The key takeaways from these articles included basic Git commands and their step-by-step usage with pictures. Finally, we discussed how Git is used in the data science domain.  

    Frequently Asked Questions (FAQs)

    1How do I get started with GitHub as a Data Scientist?

    Start by learning basic Git commands, which have been mentioned in the article. Try collecting all your data, models, and notebooks in one repository for one data science project. Include readme files and make branches in case of other modifications. 

    2What are good data science projects on GitHub for a beginner to participate in?

    We listed the top repositories for learning and hosting data science projects on GitHub. In order to participate, you could enroll in contests from websites like Kaggle or Analytics Vidhya. 

    3What are the differences between BitGrit and GitHub for data science projects?

    Bitgrit consists of numerous AI challenges and problems, including real-world datasets to find solutions for, whereas GitHub is a version control platform that consists of a repository of data science projects.

    4How do I create a data science portfolio on GitHub?

    You can use GitHub to list all your projects under a single repository or even make GitHub Pages showcasing and explaining your individual Data Science project and your approach to solving them. 

    5What is the difference between Git and GitHub?

    Git is a version control system that helps to keep track of our modifications in a project, whereas GitHub is a cloud-based website or hosting platform that consists of repositories consisting of files describing a particular project. 

    Profile

    Ashish Gulati

    Data Science Expert

    Ashish is a techology consultant with 13+ years of experience and specializes in Data Science, the Python ecosystem and Django, DevOps and automation. He specializes in the design and delivery of key, impactful programs.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Data Science Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon