In our data science or development journey, we all would have used Python as our primary language as you get an ecosystem of loaded libraries that makes the development much easier, but Python isn’t fast or convenient enough, and it also comes with security flaws as most of the libraries used in Python are built from different languages such a Java, C, C++ and even JavaScript. Julia for data science is better than Python in terms of memory allocation, and it provides us more freedom to manually control garbage collection. In Python, we are constantly freeing memory and collecting information about memory usage which can be daunting in some cases.
While comparing with Python, Julia takes the lead in multiple fields. In this comprehensive article, we will be discussing the advantages of Julia language, Julia libraries, and see how it makes the lives of developers easy.
First, let’s understand Julia language with its definition. Julia is a high-level and general-purpose language that can be used to write code that is fast to execute and easy to implement for scientific calculations. Julia language for data science is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation. Check out the Data Science Online Course. This will help you to learn more about the fundamentals and skills of data science.
Julia Overview
Julia programming language attracts non-programmer scientists and data science enthusiasts by providing simple syntax for math operations that are like the non-computing world. It was designed to be a general-purpose programming language that was also excellent for the special needs of technical computing (i.e., math and science stuff), meaning it must have great support for advanced mathematics, n-dimensional arrays, and superior performance. Basically, Julia programming language does all those things, though it usually isn’t quite as fast as C.
The way it was designed to be this fast, even though it’s dynamically typed and interpreted, is that the language was designed with static analysis in mind. The compiler can figure out what type everything has while the program is running and generates highly specialized code for whatever the machine types are. The interpreter/compiler creates LLVM bytecode at runtime, and LLVM does an additional round of optimizations on that.
We can also see that it’s a general language by the kinds of literal data types it has. Yes, there is fancy matrix syntax, but you have literals for regular expressions and system commands. These elements aren’t needed for scientific programming either but are particularly useful for all kinds of scripting tasks. Julia can also be embedded in other programs through its embedding API. Julia can be integrated in Python programs using Py Julia. R programs can do the same with R's Julia Call.
What is Julia Data Science?
When it comes to data science, developers need a flexible and versatile language that is simple to code but still should be able to handle complex mathematical processes.
Let’s understand the advantages of using Julia for Data Science.
- Julia language for data science is a very well-designed universal language (not one only for numerical computing).
- It is a mixture of Python, MATLAB, and Lisp.
- Julia is a 1-indexing programming language.
- It has native support for matrices and datasets.
- Multiple dispatches fit much better data science than classical OOP.
- Julia is fast, almost as fast as C.
- Julia has exceptionally good C, Python, R, etc. interoperability.
- Julia has good parallelism and multithreading.
- Execution speed is much better than Python.
- No Global Interpreter Lock. Can use all cores on your CPU.
- Data science with Julia solves the two-language problem. We can prototype and put into production the same source code. It is a frequent practice to prototype in Python, but production code is re-implemented in languages like - Java, Haskell, C/C++ etc.
Julia, in Data science will slowly overtake Python, given that the GPU-support of both is equivalent. The main reason for this is that Julia can essentially get more computation per hour (which is important for Big Data) and the object-oriented aspect of Python is of less value in data science (since most methods are based on vector/matrix computations).
Further, Julia in Data Science may have the chance of using the hardware of processors better, since vector operations are closer to hardware operations. See, for example, C Is Not a Low-Level Language for an interesting viewpoint of how close C is to the hardware. To get started with data science Julia projects, you should check Data Science Bootcamp training, as it offers the best online course.
Exploring Heart Disease Dataset
Let’s set up your Julia repl, either use Julia-Pro or set up your VS code for Julia, and if you are using a cloud notebook, it is suggested that you add the below code into your docker file and build it.
Code:
FROM gcr.io/deepnote-200602/templates/deepnote
RUN wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz &&
tar -xvzf julia-1.6.2-linux-x86_64.tar.gz &&
sudo mv julia-1.6.2 /usr/lib/ &&
sudo ln -s /usr/lib/julia-1.6.2/bin/julia /usr/bin/julia &&
rm julia-1.6.2-linux-x86_64.tar.gz &&
julia -e "using Pkg;pkg"add IJulia LinearAlgebra SparseArrays Images MAT""
ENV DEFAULT_KERNEL_NAME "julia-1.6.2"
Installing Julia Data Science Packages
The method below will help you download and install multiple libraries at once.
import Pkg; Pkg.add(["CSV","CategoricalArrays",
"Chain", "DataFrames", "GLM", "Plots", "Random", "StatsPlots",
"Statistics","Interact", "Blink"])
Importing Packages
Code:
using CSV
using CategoricalArrays
using Chain
using DataFrames
using GLM
using Plots
using Random
using StatsPlots
using Statistics
ENV["LINES"] = 20 # to limit the number of rows.
ENV["COLUMNS"] = 20 # to limit the number of columns
Density Plot
A Density Plot is a type of Histogram that uses kernel smoothing for plotting the values, which allows us for much smoother distributions by smoothing out the noise. The Density Plot visualizes the distribution of data over a continuous interval of time. The peaks in the Density Plot help us to display where values are concentrated over the interval or time.
An advantage that Density Plots have over Histograms is that they're better at determining the distribution shape as they're not affected by the number of bins used (each bar used in a typical histogram). A Histogram that has only 4 bins wouldn't produce any distinguishable enough shape of distribution as a 20-bin Histogram would. However, with Density Plots, this is never an issue.
Code:
using StatsPlots, KernelDensity
a, b = randn(10000), randn(10000)
dens = kde((a,b))
plot(dens)
Group Histogram
In statistics, a histogram is a representation of the distribution of numerical data, where the data are binned, and the count for each bin is represented. Generally, in Plotly-JS a histogram is an aggregated bar chart, which can have many possible aggregation functions (e.g. average, count).
Code:
using Plotly-JS
plot(
[
histogram(
x=randn(500),
histnorm="percent",
name="control",
xbins_start=0.2,
xbins_end=0.8,
xbins_size=0.1,
marker_color="#eb98b5",
opacity=0.75
),
histogram(
x=randn(500) .+ 1,
histnorm="percent",
name="experimental",
xbins_start=0.4,
xbins_end=0.8,
xbins_size=0.1,
marker_color="#330C73",
opacity=0.75
)
],
Layout(title="Sampled Results", xaxis_title="Value", yaxis_title="Count")
)
Output:
Multiple Plots
With Plots, there are two possibilities to show multiple series in one plot:
First, you can use a matrix where each column constitutes a separate series:
a, b, c = randn(100), randn(100), randn(100)
histogram([a b c])
hcat is used to concatenate the vectors (note the spaces instead of commas).
This is equivalent to
histogram(randn(100,3))
You can apply options to the individual series using a row matrix:
histogram([a b c], label = ["a" "b" "c"])
Second, you can use plot! and its variants to update a previous plot:
histogram(a) # creates a new plot
histogram!(b) # updates the previous plot
histogram!(c) # updates the previous plot
Alternatively, you can specify which plot to update:
p = histogram(a) # creates a new plot p
histogram(b) # creates an independent new plot
histogram!(p, c) # updates plot p
This is useful if you have several subplots.
Predictive Model
Predicting NYC Lot Prices with Lathe in Julia.
Code:
using Lathe: models
# Our y is going to be Price,
# Our x is going to be Doors:
# Fitting our model:
model = models.meanBaseline(traindf.Price)
accuracy = Validate.mae(testdf.Price,testdf.Baseline)
println("Baseline accuracy: ",accuracy)Baseline accuracy: -41321.739130434784
Our mean absolute error was about 41,000, bad. But of course, this was to be expected and gets me even more excited for the validation of the actual model.
# Fitting our model:
linreg = models.SimpleLinearRegression(traindf.Doors,traindf.Price)
# Put our x and y into the predict method:
testdf.Prediction = models.predict(linreg,testdf.Doors)
And you are probably wondering how that fared for our mean absolute error.
linaccuracy = Validate.mae(testdf.Price,testdf.Prediction)
println("Linear Regression Accuracy: ",linaccuracy)Linear Regression Accuracy: 0.0
That’s right, 0.0. If you’re a data science extraordinaire, you likely understand why it is 0. Regardless, what this example shows is that our model was able to figure out the slope of our linear equation. Of course, this is great!
The slope of the lines is pretty much identical, which is to be expected with a mean absolute error of zero.
Conclusion
In this comprehensive article, we have learned how simple the Julia for Data Science is and how powerful it is when it comes to scientific calculations. With a few lines of code, we have discovered that this language has the potential to overtake Python as it has similar syntax but higher performance. It’s still new to data science, but I am sure it is the future of machine learning and Artificial intelligence. You can gain more knowledge by taking KnowlegeHut’s Data Science Course Online.