For enquiries call:

Phone

+1-469-442-0620

Aage ki Socho

HomeBlogData ScienceJulia for Data Science [A Beginner’s Guide]

Julia for Data Science [A Beginner’s Guide]

Published
13th Sep, 2023
Views
view count loader
Read it in
9 Mins
In this article
    Julia for Data Science [A Beginner’s Guide]

    In our data science or development journey, we all would have used Python as our primary language as you get an ecosystem of loaded libraries that makes the development much easier, but Python isn’t fast or convenient enough, and it also comes with security flaws as most of the libraries used in Python are built from different languages such a Java, C, C++ and even JavaScript. Julia for data science is better than Python in terms of memory allocation, and it provides us more freedom to manually control garbage collection. In Python, we are constantly freeing memory and collecting information about memory usage which can be daunting in some cases.  

    While comparing with Python, Julia takes the lead in multiple fields. In this comprehensive article, we will be discussing the advantages of Julia language, Julia libraries, and see how it makes the lives of developers easy.  

    First, let’s understand Julia language with its definition. Julia is a high-level and general-purpose language that can be used to write code that is fast to execute and easy to implement for scientific calculations. Julia language for data science is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation. Check out the Data Science Online Course. This will help you to learn more about the fundamentals and skills of data science. 

    Julia Overview

    Julia programming language attracts non-programmer scientists and data science enthusiasts by providing simple syntax for math operations that are like the non-computing world. It was designed to be a general-purpose programming language that was also excellent for the special needs of technical computing (i.e., math and science stuff), meaning it must have great support for advanced mathematics, n-dimensional arrays, and superior performance. Basically, Julia programming language does all those things, though it usually isn’t quite as fast as C.

    The way it was designed to be this fast, even though it’s dynamically typed and interpreted, is that the language was designed with static analysis in mind. The compiler can figure out what type everything has while the program is running and generates highly specialized code for whatever the machine types are. The interpreter/compiler creates LLVM bytecode at runtime, and LLVM does an additional round of optimizations on that.

    We can also see that it’s a general language by the kinds of literal data types it has. Yes, there is fancy matrix syntax, but you have literals for regular expressions and system commands. These elements aren’t needed for scientific programming either but are particularly useful for all kinds of scripting tasks. Julia can also be embedded in other programs through its embedding API. Julia can be integrated in Python programs using Py Julia. R programs can do the same with R's Julia Call.

    What is Julia Data Science?

    When it comes to data science, developers need a flexible and versatile language that is simple to code but still should be able to handle complex mathematical processes. 

    Let’s understand the advantages of using Julia for Data Science. 

    • Julia language for data science is a very well-designed universal language (not one only for numerical computing). 
    • It is a mixture of Python, MATLAB, and Lisp. 
    • Julia is a 1-indexing programming language. 
    • It has native support for matrices and datasets. 
    • Multiple dispatches fit much better data science than classical OOP. 
    • Julia is fast, almost as fast as C. 
    • Julia has exceptionally good C, Python, R, etc. interoperability. 
    • Julia has good parallelism and multithreading. 
    • Execution speed is much better than Python. 
    • No Global Interpreter Lock. Can use all cores on your CPU. 
    • Data science with Julia solves the two-language problem. We can prototype and put into production the same source code. It is a frequent practice to prototype in Python, but production code is re-implemented in languages like - Java, Haskell, C/C++ etc. 

    Julia, in Data science will slowly overtake Python, given that the GPU-support of both is equivalent. The main reason for this is that Julia can essentially get more computation per hour (which is important for Big Data) and the object-oriented aspect of Python is of less value in data science (since most methods are based on vector/matrix computations).

    Further, Julia in Data Science may have the chance of using the hardware of processors better, since vector operations are closer to hardware operations. See, for example, C Is Not a Low-Level Language for an interesting viewpoint of how close C is to the hardware. To get started with data science Julia projects, you should check Data Science Bootcamp training, as it offers the best online course.

    Exploring Heart Disease Dataset

    Let’s set up your Julia repl, either use Julia-Pro or set up your VS code for Julia, and if you are using a cloud notebook, it is suggested that you add the below code into your docker file and build it. 

    Code

    FROM gcr.io/deepnote-200602/templates/deepnote
    RUN wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz &&  
    tar -xvzf julia-1.6.2-linux-x86_64.tar.gz &&  
    sudo mv julia-1.6.2 /usr/lib/ &&  
    sudo ln -s /usr/lib/julia-1.6.2/bin/julia /usr/bin/julia &&  
    rm julia-1.6.2-linux-x86_64.tar.gz &&  
    julia -e "using Pkg;pkg"add IJulia LinearAlgebra SparseArrays Images MAT""
    ENV DEFAULT_KERNEL_NAME "julia-1.6.2" 

    Installing Julia Data Science Packages

    The method below will help you download and install multiple libraries at once. 

    import Pkg; Pkg.add(["CSV","CategoricalArrays", 
    "Chain", "DataFrames", "GLM", "Plots", "Random", "StatsPlots", 
    "Statistics","Interact", "Blink"]) 

    Importing Packages

    Code

    using CSV 
    using CategoricalArrays 
    using Chain 
    using DataFrames 
    using GLM 
    using Plots 
    using Random 
    using StatsPlots 
    using Statistics 
    ENV["LINES"] = 20 # to limit the number of rows. 
    ENV["COLUMNS"] = 20 # to limit the number of columns 

    Density Plot

    A Density Plot is a type of Histogram that uses kernel smoothing for plotting the values, which allows us for much smoother distributions by smoothing out the noise. The Density Plot visualizes the distribution of data over a continuous interval of time. The peaks in the Density Plot help us to display where values are concentrated over the interval or time.

    An advantage that Density Plots have over Histograms is that they're better at determining the distribution shape as they're not affected by the number of bins used (each bar used in a typical histogram). A Histogram that has only 4 bins wouldn't produce any distinguishable enough shape of distribution as a 20-bin Histogram would. However, with Density Plots, this is never an issue.

    Code: 

    using StatsPlots, KernelDensity 
    a, b = randn(10000), randn(10000) 
    dens = kde((a,b)) 
    plot(dens)

    Group Histogram

    In statistics, a histogram is a representation of the distribution of numerical data, where the data are binned, and the count for each bin is represented. Generally, in Plotly-JS a histogram is an aggregated bar chart, which can have many possible aggregation functions (e.g. average, count). 

    Code: 

    using Plotly-JS 
    plot( 
        [ 
            histogram( 
                x=randn(500), 
                histnorm="percent", 
                name="control", 
                xbins_start=0.2, 
                xbins_end=0.8, 
                xbins_size=0.1, 
                marker_color="#eb98b5", 
                opacity=0.75 
            ), 
            histogram( 
                x=randn(500) .+ 1, 
                histnorm="percent", 
                name="experimental", 
                xbins_start=0.4, 
                xbins_end=0.8, 
                xbins_size=0.1, 
                marker_color="#330C73", 
                opacity=0.75 
            ) 
        ], 
        Layout(title="Sampled Results", xaxis_title="Value", yaxis_title="Count") 
    ) 

    Output:

    Multiple Plots

    With Plots, there are two possibilities to show multiple series in one plot: 

    First, you can use a matrix where each column constitutes a separate series: 

    a, b, c = randn(100), randn(100), randn(100) 
    histogram([a b c]) 
    hcat is used to concatenate the vectors (note the spaces instead of commas). 
    This is equivalent to 
    histogram(randn(100,3)) 
    You can apply options to the individual series using a row matrix: 
    histogram([a b c], label = ["a" "b" "c"]) 
    Second, you can use plot! and its variants to update a previous plot: 
    histogram(a) # creates a new plot 
    histogram!(b) # updates the previous plot 
    histogram!(c) # updates the previous plot 
    Alternatively, you can specify which plot to update: 
    p = histogram(a) # creates a new plot p 
    histogram(b) # creates an independent new plot 
    histogram!(p, c) # updates plot p 
    This is useful if you have several subplots. 

    Predictive Model

    Predicting NYC Lot Prices with Lathe in Julia. 

    Code: 

    using Lathe: models
    # Our y is going to be Price,
    # Our x is going to be Doors:
    # Fitting our model:
    model = models.meanBaseline(traindf.Price) 
    accuracy = Validate.mae(testdf.Price,testdf.Baseline)
    println("Baseline accuracy: ",accuracy)Baseline accuracy: -41321.739130434784 

    Our mean absolute error was about 41,000, bad. But of course, this was to be expected and gets me even more excited for the validation of the actual model.  

    # Fitting our model:
    linreg = models.SimpleLinearRegression(traindf.Doors,traindf.Price)
    # Put our x and y into the predict method:
    testdf.Prediction = models.predict(linreg,testdf.Doors) 

    And you are probably wondering how that fared for our mean absolute error. 

    linaccuracy = Validate.mae(testdf.Price,testdf.Prediction)
    println("Linear Regression Accuracy: ",linaccuracy)Linear Regression Accuracy: 0.0 

    That’s right, 0.0. If you’re a data science extraordinaire, you likely understand why it is 0. Regardless, what this example shows is that our model was able to figure out the slope of our linear equation. Of course, this is great! 

    The slope of the lines is pretty much identical, which is to be expected with a mean absolute error of zero. 

    Conclusion

    In this comprehensive article, we have learned how simple the Julia for Data Science is and how powerful it is when it comes to scientific calculations. With a few lines of code, we have discovered that this language has the potential to overtake Python as it has similar syntax but higher performance. It’s still new to data science, but I am sure it is the future of machine learning and Artificial intelligence. You can gain more knowledge by taking KnowlegeHut’s Data Science Course Online

    Frequently Asked Questions (FAQs)

    1Is Julia the future of data science?

    Julia for data science is still new but Julia will slowly overtake Python in the field of Data Science. 

    2Will Julia replace Python in data science?

    Julia for data science might be the better language, but Python is the better tool for the job. Python is still exceedingly popular if compared. Python being older than Julia, has an advantage thanks to the massive and extremely active community it has built over time. The main driving force and conclusive advantage Python has over Julia is its vast collection of packages, one can easily access most of the code from libraries without having to code themselves on a low level. Things like this don’t exactly exist in Julia. 

    3Is Julia good at data science?

    Julia for Data science is designed to keep all the needs of scientific researchers and data scientists to optimize the experimentation and design implementation.

    Profile

    Eshaan Pandey

    Author

    Eshaan is a Full Stack web developer skilled in MERN stack. He is a quick learner and has the ability to adapt quickly with respect to projects and technologies assigned to him. He has also worked previously on UI/UX web projects and delivered successfully. Eshaan has worked as an SDE Intern at Frazor for a span of 2 months. He has also worked as a Technical Blog Writer at KnowledgeHut upGrad writing articles on various technical topics.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Data Science Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon