Bootcamps

Enterprise

Resources

Home
Blog
Data Science
Julia for Data Science [A Beginner’s Guide]

HomeBlogData ScienceJulia for Data Science [A Beginner’s Guide]

Julia for Data Science [A Beginner’s Guide]

Blog Author

Eshaan Pandey

Published

13th Sep, 2023

Views

Read TimeRead it in

9 Mins

In this article

Julia for Data Science [A Beginner’s Guide]

In our data science or development journey, we all would have used Python as our primary language as you get an ecosystem of loaded libraries that makes the development much easier, but Python isn’t fast or convenient enough, and it also comes with security flaws as most of the libraries used in Python are built from different languages such a Java, C, C++ and even JavaScript. Julia for data science is better than Python in terms of memory allocation, and it provides us more freedom to manually control garbage collection. In Python, we are constantly freeing memory and collecting information about memory usage which can be daunting in some cases.

While comparing with Python, Julia takes the lead in multiple fields. In this comprehensive article, we will be discussing the advantages of Julia language, Julia libraries, and see how it makes the lives of developers easy.

First, let’s understand Julia language with its definition. Julia is a high-level and general-purpose language that can be used to write code that is fast to execute and easy to implement for scientific calculations. Julia language for data science is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation. Check out the Data Science Online Course. This will help you to learn more about the fundamentals and skills of data science.

Julia Overview

Julia programming language attracts non-programmer scientists and data science enthusiasts by providing simple syntax for math operations that are like the non-computing world. It was designed to be a general-purpose programming language that was also excellent for the special needs of technical computing (i.e., math and science stuff), meaning it must have great support for advanced mathematics, n-dimensional arrays, and superior performance. Basically, Julia programming language does all those things, though it usually isn’t quite as fast as C.

The way it was designed to be this fast, even though it’s dynamically typed and interpreted, is that the language was designed with static analysis in mind. The compiler can figure out what type everything has while the program is running and generates highly specialized code for whatever the machine types are. The interpreter/compiler creates LLVM bytecode at runtime, and LLVM does an additional round of optimizations on that.

We can also see that it’s a general language by the kinds of literal data types it has. Yes, there is fancy matrix syntax, but you have literals for regular expressions and system commands. These elements aren’t needed for scientific programming either but are particularly useful for all kinds of scripting tasks. Julia can also be embedded in other programs through its embedding API. Julia can be integrated in Python programs using Py Julia. R programs can do the same with R's Julia Call.

What is Julia Data Science?

When it comes to data science, developers need a flexible and versatile language that is simple to code but still should be able to handle complex mathematical processes.

Let’s understand the advantages of using Julia for Data Science.

Julia language for data science is a very well-designed universal language (not one only for numerical computing).
It is a mixture of Python, MATLAB, and Lisp.
Julia is a 1-indexing programming language.
It has native support for matrices and datasets.
Multiple dispatches fit much better data science than classical OOP.
Julia is fast, almost as fast as C.
Julia has exceptionally good C, Python, R, etc. interoperability.
Julia has good parallelism and multithreading.
Execution speed is much better than Python.
No Global Interpreter Lock. Can use all cores on your CPU.
Data science with Julia solves the two-language problem. We can prototype and put into production the same source code. It is a frequent practice to prototype in Python, but production code is re-implemented in languages like - Java, Haskell, C/C++ etc.

Julia, in Data science will slowly overtake Python, given that the GPU-support of both is equivalent. The main reason for this is that Julia can essentially get more computation per hour (which is important for Big Data) and the object-oriented aspect of Python is of less value in data science (since most methods are based on vector/matrix computations).

Further, Julia in Data Science may have the chance of using the hardware of processors better, since vector operations are closer to hardware operations. See, for example, C Is Not a Low-Level Language for an interesting viewpoint of how close C is to the hardware. To get started with data science Julia projects, you should check Data Science Bootcamp training, as it offers the best online course.

Exploring Heart Disease Dataset

Let’s set up your Julia repl, either use Julia-Pro or set up your VS code for Julia, and if you are using a cloud notebook, it is suggested that you add the below code into your docker file and build it.

Code:

FROM gcr.io/deepnote-200602/templates/deepnote
RUN wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz &&  
tar -xvzf julia-1.6.2-linux-x86_64.tar.gz &&  
sudo mv julia-1.6.2 /usr/lib/ &&  
sudo ln -s /usr/lib/julia-1.6.2/bin/julia /usr/bin/julia &&  
rm julia-1.6.2-linux-x86_64.tar.gz &&  
julia -e "using Pkg;pkg"add IJulia LinearAlgebra SparseArrays Images MAT""
ENV DEFAULT_KERNEL_NAME "julia-1.6.2"

Installing Julia Data Science Packages

The method below will help you download and install multiple libraries at once.

import Pkg; Pkg.add(["CSV","CategoricalArrays", 
"Chain", "DataFrames", "GLM", "Plots", "Random", "StatsPlots", 
"Statistics","Interact", "Blink"])

Importing Packages

Code:

using CSV 
using CategoricalArrays 
using Chain 
using DataFrames 
using GLM 
using Plots 
using Random 
using StatsPlots 
using Statistics 
ENV["LINES"] = 20 # to limit the number of rows. 
ENV["COLUMNS"] = 20 # to limit the number of columns

Density Plot

A Density Plot is a type of Histogram that uses kernel smoothing for plotting the values, which allows us for much smoother distributions by smoothing out the noise. The Density Plot visualizes the distribution of data over a continuous interval of time. The peaks in the Density Plot help us to display where values are concentrated over the interval or time.

An advantage that Density Plots have over Histograms is that they're better at determining the distribution shape as they're not affected by the number of bins used (each bar used in a typical histogram). A Histogram that has only 4 bins wouldn't produce any distinguishable enough shape of distribution as a 20-bin Histogram would. However, with Density Plots, this is never an issue.

Code:

using StatsPlots, KernelDensity 
a, b = randn(10000), randn(10000) 
dens = kde((a,b)) 
plot(dens)

Group Histogram

In statistics, a histogram is a representation of the distribution of numerical data, where the data are binned, and the count for each bin is represented. Generally, in Plotly-JS a histogram is an aggregated bar chart, which can have many possible aggregation functions (e.g. average, count).

Code:

using Plotly-JS 
plot( 
    [ 
        histogram( 
            x=randn(500), 
            histnorm="percent", 
            name="control", 
            xbins_start=0.2, 
            xbins_end=0.8, 
            xbins_size=0.1, 
            marker_color="#eb98b5", 
            opacity=0.75 
        ), 
        histogram( 
            x=randn(500) .+ 1, 
            histnorm="percent", 
            name="experimental", 
            xbins_start=0.4, 
            xbins_end=0.8, 
            xbins_size=0.1, 
            marker_color="#330C73", 
            opacity=0.75 
        ) 
    ], 
    Layout(title="Sampled Results", xaxis_title="Value", yaxis_title="Count") 
)

Output:

Multiple Plots

With Plots, there are two possibilities to show multiple series in one plot:

First, you can use a matrix where each column constitutes a separate series:

a, b, c = randn(100), randn(100), randn(100) 
histogram([a b c]) 
hcat is used to concatenate the vectors (note the spaces instead of commas). 
This is equivalent to 
histogram(randn(100,3)) 
You can apply options to the individual series using a row matrix: 
histogram([a b c], label = ["a" "b" "c"]) 
Second, you can use plot! and its variants to update a previous plot: 
histogram(a) # creates a new plot 
histogram!(b) # updates the previous plot 
histogram!(c) # updates the previous plot 
Alternatively, you can specify which plot to update: 
p = histogram(a) # creates a new plot p 
histogram(b) # creates an independent new plot 
histogram!(p, c) # updates plot p 
This is useful if you have several subplots.

Predictive Model

Predicting NYC Lot Prices with Lathe in Julia.

Code:

using Lathe: models
# Our y is going to be Price,
# Our x is going to be Doors:
# Fitting our model:
model = models.meanBaseline(traindf.Price) 
accuracy = Validate.mae(testdf.Price,testdf.Baseline)
println("Baseline accuracy: ",accuracy)Baseline accuracy: -41321.739130434784

Our mean absolute error was about 41,000, bad. But of course, this was to be expected and gets me even more excited for the validation of the actual model.

# Fitting our model:
linreg = models.SimpleLinearRegression(traindf.Doors,traindf.Price)
# Put our x and y into the predict method:
testdf.Prediction = models.predict(linreg,testdf.Doors)

And you are probably wondering how that fared for our mean absolute error.

linaccuracy = Validate.mae(testdf.Price,testdf.Prediction)
println("Linear Regression Accuracy: ",linaccuracy)Linear Regression Accuracy: 0.0

That’s right, 0.0. If you’re a data science extraordinaire, you likely understand why it is 0. Regardless, what this example shows is that our model was able to figure out the slope of our linear equation. Of course, this is great!

The slope of the lines is pretty much identical, which is to be expected with a mean absolute error of zero.

Conclusion

In this comprehensive article, we have learned how simple the Julia for Data Science is and how powerful it is when it comes to scientific calculations. With a few lines of code, we have discovered that this language has the potential to overtake Python as it has similar syntax but higher performance. It’s still new to data science, but I am sure it is the future of machine learning and Artificial intelligence. You can gain more knowledge by taking KnowlegeHut’s Data Science Course Online.

Frequently Asked Questions (FAQs)

1. Is Julia the future of data science?

Julia for data science is still new but Julia will slowly overtake Python in the field of Data Science.

2. Will Julia replace Python in data science?

Julia for data science might be the better language, but Python is the better tool for the job. Python is still exceedingly popular if compared. Python being older than Julia, has an advantage thanks to the massive and extremely active community it has built over time. The main driving force and conclusive advantage Python has over Julia is its vast collection of packages, one can easily access most of the code from libraries without having to code themselves on a low level. Things like this don’t exactly exist in Julia.

3. Is Julia good at data science?

Julia for Data science is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation.

Eshaan Pandey

Author

Eshaan is a Full Stack web developer skilled in MERN stack. He is a quick learner and has the ability to adapt quickly with respect to projects and technologies assigned to him. He has also worked previously on UI/UX web projects and delivered successfully. Eshaan has worked as an SDE Intern at Frazor for a span of 2 months. He has also worked as a Technical Blog Writer at KnowledgeHut upGrad writing articles on various technical topics.

Share This Article

Ready to Master the Skills that Drive Your Career?

Avail your free 1:1 mentorship session.

Upcoming Data Science Batches & Dates

Name	Date	Fee	Know more

Course Advisor