For enquiries call:

Phone

+1-469-442-0620

HomeBlogWeb DevelopmentAdd Voice to Your Blog With the SpeechSynthesis API

Add Voice to Your Blog With the SpeechSynthesis API

Published
24th Apr, 2024
Views
view count loader
Read it in
10 Mins
In this article
    Add Voice to Your Blog With the SpeechSynthesis API

    Voice-based technologies have become popular in recent years because of their several real-world applications. There are multiple organizations working on voice technologies, thanks to software like Google Assistant, Amazon Alexa, and Siri. Voice-based applications can completely change the way a user interacts with other software and hardware devices and improve the user experience. For web pages, users can listen to them instead of reading, alongside focusing on other important tasks.

    Voice technologies have played a very important role in making websites and blogs more accessible to people with disabilities and will continue to do so in the coming years. Let’s explore how we can add voice to a blog or a webpage.

    How can you make a web page talk or add voice to your blog? The Web Speech API can do that for you. It has two subsets – Speech Recognition API, which is used to recognize a voice, and Speech Synthesis API which is used to read something.

    Speech Synthesis API is a subset of Web Speech API and is a very popular way to add voice to a webpage or a blog. It enables developers to create natural human speech as playable audio. Arbitrary strings, words, and sentences can be converted into the sound of a person reciting the same things. Let’s learn a little more about Speech Synthesis API and then explore how you can implement it with your web page through an easy-to-follow tutorial.

    What is the Web Speech API?

    The Web Speech API was first introduced in 2012 and it is used by web developers to provide audio input and output features in web apps. The Web Speech API completely takes care of users’ privacy as it will not use the microphone and speaker services until the user grants permission for the same. And we do not need to add any third-party application for it. If the website is running on HTTPS protocol, then you need to permit it once otherwise, it will ask for permission in each new session. The web speech API also allows you to specify grammar objects and rules.

    The Web Speech API has two major applications:

    1. SpeechSynthesis: It is used in applications that need text-to-speech features and allows the web pages to read the content available on-page in human speech format. It uses the device's speech synthesizer for performing this.
    2. SpeechRecognition: It is used in applications that need voice recognition and transcribing. SpeechRecognition allows these applications to recognize a specific voice context by enabling the voice as an input and control method just like the keyboard and touch through the device microphone. With speech recognition, users can interact with websites and apps through voice commands easily.

    In this tutorial we are focusing on adding a voice to your blogs, we will explore the SpeechSynthesis API in more detail.

    Text to Speech with SpeechSynthesis

    Speech Synthesis is a way of transforming text into voice through a computerized voice. In a simple form of output, the computer reads the typed text in a human-like voice.

    Now the question is how does Speech Synthesis work? The process of Speech Synthesis happens in three parts -

    • Text to words:

    This pre-processing or normalization stage is focused on reducing the ambiguity in the text. It goes through the complete text and cleans it up so that the computer can read it without any mistakes.

    Reading a written paragraph may sound easy to you but it is a new thing for computers just like a young child. The written text can have multiple meanings, depending on the context. We can differentiate them using inflections in speech, but how would a computer differentiate? for example, 2001 can be a year or a lock combination or a measure of something. In human speech, the meaning will become clear based on the way we enunciate. This applies to date, time, special characters, abbreviations, acronyms, etc.

    Also, some words can have different meanings and they may sound the same, like- sell and cell. This stage of speech synthesis also handles Homographs.

    • Words to Phonemes:

    Our API has figured out the words that need to be said and now it’s time to generate speech sounds to form those words. To pronounce a word, computers need a list of words mentioning their sound and how they are pronounced with the sound. Phonemes are ways of pronouncing a word or sequence of sound and basically, we will need a list of phonemes for each word and the computer will read everything. Again, it looks easy, but it is not a practical thing for computers. A single sentence can be read out in several ways depending on the context of talk and the meaning of the text. For example - ‘Read’ can be read as ‘red’, ‘read’ and ‘reed’, depending on the situation.

    To read out the words in the right manner they are broken down into their graphemes which are the written component unit, generally made from the syllables that make up a word. After that, phonemes are generated corresponding to them. It is like reading or trying to pronounce the words that you have encountered in the past.

    • Phonemes to sound:

    As of now, the computer has cleaned the text and generated a list of phonemes for each word, and now it needs basic phonemes that the computer reads. So how do we get those phonemes? There are three approaches for that -

    1. Concatenative: This approach uses human voice recordings to generate phonemes. We record lots of human sound samples saying multiple things. These sentences are broken into words and then words are converted into phonemes. This method is limited to only one voice, and we need a lot of human efforts to perform this.  
    2. Formant: A speech can be considered as a pattern of sound that varies in volume, pitch, and frequency just like a musical instrument. Then it is possible to make a device that can generate these sound patterns. This way of generating phonemes is called Formant. Unlike Concatenative, Formant is an easier way of generating multiple voices but Concatenative sounds more natural than Formant.
    3. Articulatory:  The most realistic and complex approach of generating phonemes is Articulatory which is referred to as making computers speak by modeling a human vocal apparatus. The simplest explanation of Articulatory would be to build a talking robot with a mouth that produces sounds as humans with a combination of computer, electrical and mechanical components.

    Methods and properties of Speech Synthesis API

    Speech Synthesis API inherits properties and methods from its parent interface, EventTarget. These properties, methods, and event handlers help developers to perform certain tasks and manage the speech synthesis while implementing the API. Some of these objects are:

    • onstart: This event handler is fired when recognition service is started to listen the audio with intent to recognize the voice.
    • onresult: This callback is fired when speech recognition returns a result.
    • onerror: If an error occurs during recognition, this callback is fired.
    • onend: This callback will run when the service is disconnected.
    • SpeechSynthesis.cancel(): This method removes all the utterances from the utterance queue and the speaking will stop immediately.
    • SpeechSynthesis.getVoices(): This method is used to display the list of all available voices in the current device.
    • SpeechSynthesis.pause(): It puts the speechsynthesis object into paused mode or you can say it pauses the narration.
    • SpeechSynthesis.resume(): If the Speechsynthesis object is in paused state then it will the narration.
    • SpeechSynthesis.speak(): This method adds an utterance to the utterance queue that will be spoken.

    What are the benefits of using Speech Synthesizers?

    There are various useful applications of Speech Synthesizers and some of them are:

    • It can help visually impaired people by reading books and web pages.
    • In education, Speech Synthesizers can be used to teach spellings and pronunciations.
    • It can also be used in the multimedia and telecommunication field to replace human efforts and reading emails and messages through the phone lines.
    • Let your user engage with your application and device with the voice user interface.
    • Personalize communication with users based on preference.
    • Improve customer interaction with intelligent and natural responses.

    Adding Text-to-Speech to your website: Live Demo of Speech Synthesis API

    Now we will see how you can implement Speech Synthesis API in your web pages to let them talk with you.

    Prerequisites

    • A basic understanding of JavaScript and HTML.
    • A Code Editor and browser, supporting Web Speech API, to view webpages.

    Browser Compatibility

    The Web Speech API has been in the market for a while, but it is not supported by every browser. You need to have a chrome version 33+ or an updated version of Safari.

    Let’s give a voice to our blog: Tutorial of Speech Synthesis API Implementation

    Let’s start coding to implement Speech Synthesis API in our blog which is quite simple and can be done through a single line -

    SpeechSynthesis.speak(new SpeechSynthesisUtterance('Hey'))

    And we can also control the voice pitch, rate, and volume like this:

    const utterance = new SpeechSynthesisUtterance('Hey')
    utterance.pitch = 1.5
    utterance.volume = 0.5
    utterance.rate = 8
    speechSynthesis.speak(utterance)

    In this tutorial, we will start from the basics, so it is easy for you to do it. First, we need to create a simple web page with some content on it and three buttons - Pause, Play, and Stop. We will also initiate a flag with the default value as false.

    <div>
        <button id=play></button>
        <button id=pause></button>
        <button id=stop></button>
    </div>
    <article>
        <h1>The Hare & the Tortoise</h1>
        <p>A Hare was making fun of the Tortoise one day for being so slow.</p>
        <p>"Do you ever get anywhere?" he asked with a mocking laugh. "Yes," replied the Tortoise</p>
        <p>They both agreed on a race. </p>
        <p>The Hare was soon far out of sight and to make the Tortoise feel how easy the race was for him, he lay down under a tree to take a nap. </p>
    <p>The Tortoise kept going slowly and steadily and passed the place where the Hare was sleeping to win the race.</p>
    <p>The Hare woke up and started running but it was already too late.</p>
    
        <!-- More text... -->
        <blockquote>Moral of the story...Slow and Steady wins the race.</blockquote>
    </article>

    You can also create a CSS file for the same to align buttons and other things.

    Now we need to ensure if the browser supports the Speech Synthesis API. If it is supported, we will create a reference for Speech Synthesis to assign synth variable. To check the compatibility with API, we will use the following code:

    onload = function() {
      if ('speechSynthesis' in window) {
          /* speech synthesis supported */
      }
         else { /* speech synthesis not supported */
            msg = document.createElement('h5');
            msg.textContent = "Detected no support for Speech Synthesis";
            msg.style.textAlign = 'center';
            msg.style.backgroundColor = 'red';
            msg.style.color = 'white';
            msg.style.marginTop = msg.style.marginBottom = 0;
            document.body.insertBefore(msg, document.querySelector('div'));
        }
    
    }

    Next for our three buttons, we will create respective functions with JavaScript - onClickPlay(), onClickPause(), and onClickStop(). Once the user clicks on one of them, the function will be called.

    if ('speechSynthesis' in window){
        var synth = speechSynthesis;
        var flag = false;
    
        /* references to the buttons */
        var playEle = document.querySelector('#play');
        var pauseEle = document.querySelector('#pause');
        var stopEle = document.querySelector('#stop');
        Var flag = false;
    
        /* click event handlers for the buttons */
        playEle.addEventListener('click', onClickPlay);
        pauseEle.addEventListener('click', onClickPause);
        stopEle.addEventListener('click', onClickStop);
    
        function onClickPlay() {
        }
        function onClickPause() {
        }
        function onClickStop() {
        }
    }

    1. Play: Firstly, the flag is checked, when we click on the Play button; if it is set to false then we set it to true. The code in the first ‘if condition’ would not execute again until the flag is set to false.  

    We will create a new instance for SpeechSynthesisUtterance interface to hold information about narration language, voice pitch, volume, and narration speed.  

    SpeechSynthesis.getVoices() method is used to assign a voice to speak. There may be multiple voices available in your device’s software. It will return an array of all available voices in the device. The first voice is designated by utterance.voice = synth.getVoices()[0];

    Next, SpeechSynthesis.speak() is called to start the narration. The Article text is added as the parameter of the constructor and assigned to the utterance variable. SpeechSynthesis.paused is used to check if the narration is paused. If it is paused, then SpeechSynthesis.resume() method is used to resume it.  

    The onend is executed when the speech is finished to change the flag value to False again so that the speech execution can be started again when the button is clicked.

    function onClickPlay() {
                if(!flag){
                    flag = true;
                    utterance = new SpeechSynthesisUtterance(document.querySelector('article').textContent);
                    utterance.voice = getVoices()[0];
                    utterance.onend = function(){
                        flag = false; playEle.className = pauseEle.className = ''; stopEle.className = 'stopped';
                    };
                    playEle.className = 'played';
                    stopEle.className = '';
                    speak(utterance);
                }
                 if (paused) { /* unpause/resume narration */
                    playEle.className = 'played';
                    pauseEle.className = '';
                    resume();
                }  
            }

    2. Pause: We will create onClickPause() function to check if narration is paused or running. If SpeechSynthesis.speaking and SpeechSynthesis.paused both statements are true, then onClickPause()  function pauses the speech with SpeechSynthesis.pause().

    function onClickPause() {
                if(speaking && !paused){ /* pause narration */
                    pauseEle.className = 'paused';
                    playEle.className = '';
                    pause();
                }
            }

    3. Stop: onClickStop is used to stop the narration if it is going on. It calls SpeechSynthesis.Cancel() to remove all utterances and stop the narration. 

    function onClickStop() {
                if(speaking){ /* stop narration */
                    /* for safari */
                    stopEle.className = 'stopped';
                    playEle.className = pauseEle.className = '';
                    flag = false;
                    cancel();
    
                }
            }
    }

    The Scope of Web Speech API

    Web Speech API is very useful in various fields including education, data entry, Voice user interface control, increasing accessibility of web applications for people with disabilities, generally enhancing the user experience. In this tutorial, we explored how to add a voice to our blog and enable it to talk with the user using SpeechSynthesis API.

    As of now, Speech Synthesis API is only supported by a few browsers. We hope that soon, it will be supported by all the browsers and will have many other features. Also, as of now, the browser sends the text to the SpeechSynthesis API server for results. Working offline needs more consideration for odd situations.  It will be good to see the applications being controlled just by the voice.

    What is your experience of working with Web Speech? Where all do you see it being used? Share your experience and questions about using the Web Speech API with us.

    Unlock the Power of Data with our Cutting-Edge Data Engineering Course. Gain In-Demand Skills and Propel Your Career Forward. Enroll Today!

    Profile

    Sachin Bhatnagar

    Blog Author

    Sachin Bhatnagar is an experienced education professional with 20+ years of expertise in Media & Entertainment and Web Technologies. Currently, as the Program Director - Full-Stack at KnowledgeHut, he excels in curriculum development, hands-on training, and strategic deployment of industry-centric educational programs. His online training programs on have attracted over 25,000 learners since 2014. He actively contributes to the development of full-stack training products, leveraging KnowledgeHut's advanced learning platform. Collaborating with organizational leaders, he ensures the success of these programs and has created several technology programs prominently featured in KnowledgeHut's course offerings.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Web Development Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon