GrabDuck

Python HTTP CLI | Speech Recognition and Synthesis via HTTP services

:

Overview


The CLI is currently intended for use on *nix systems.

The Python Command Line Interface (CLI) is intended to provide developers with an easy to use, interactive interface to understand the NDEV ASR and TTS HTTP services.

Moreover, the CLI provides utilities that help developers deal with capturing and resampling audio.

The methods that are made available to developers are:

Service Command
Capturing Audio record_wav.py some_output_file.wav
Resample Audio resample.sh some_sample_file.wav
ASR asr.py some_sample_file.{wav,spx,ogg} --lang=en_US
Streaming ASR asr_stream.py --lang=tr_TR
TTS tts.py some_output_file.{wav,spx,ogg,mp3} --lang=zh_HK

Getting Started


Before proceeding, make sure you have created an app profile using ndevmobile.com and have identified that the app is using the HTTP services.

Mac Specific Setup

These installation instructions assume you use Homebrew to manage your packages.

brew install libsamplerate
brew install portaudio

Linux Specific Setup

Replace apt-get with the appropriate package manager for your system.

apt-get install libsamplerate

Install portaudio via these instructions

Common Setup

Once you have installed portaudio, you will want to install the required Python modules. Before proceeding with the modules in requirements.txt, you will want to ensure that you have numpy installed. If you do not:

pip install numpy 

With numpy installed, it is up to you whether you want to create a virtualenv for the project or not. Regardless, you will want to install the required packages:

pip install -r requirements.txt 

Usage

Execute the following command within the root directory of the project.

export PYTHONPATH=$PYTHONPATH:`pwd`

Download


To download these utilities, you can clone the project on GitHub

   git clone git@github.com:NuanceDev/ndev-python-http-cli.git

OR

download a zip file

Project Structure


The project’s structure is defined below.

ndev-python-http-cli/
    ├── README.md
    ├── bin
    │   ├── asr.py
    │   ├── asr_stream.py
    │   ├── asr_then_tts.py
    │   ├── play_wav.py
    │   ├── record_and_recognize.sh
    │   ├── record_wav.py
    │   ├── resample.sh
    │   └── tts.py
    ├── credentials.json
    ├── ndev
    │   ├── __init__.py
    │   ├── asr.py
    │   ├── core.py
    │   └── tts.py
    ├── requirements.txt
    └── setup.py

The ndev directory houses all the necessary interfaces for the NDEV HTTP API.

Peruse this code to learn about various aspects of the APIs for both ASR and TTS, like the languages available or the sampling rates available for a given codec.

The bin directory houses all the scripts that you are able to run. When invoking a script, you will want to do so in the root directory of the CLI, as the scripts currently rely on credentials.json being present there.

Credentials


Overview

In order to work with the CLI, you will need to have an NDEV developer account.

Once you have an account, create an application, and be sure to specify that you’re using the HTTP service. You will receive an email with credentials and will be able to access the dev portal to see those credentials at any time.

These are the properties that you will need for the Python CLI:

  • appId
  • appKey
  • asrHost
  • asrEndpoint
  • ttsHost
  • ttsEndpoint

Supplying Credentials

The credentials for your app are needed in order to make requests through the CLI. Define them in the credentials.json file.

Here is an example of what the file looks like:

{
  "appId": "HTTP_NMDP_MyApp_20130506033030",  
  "appKey": "[128-char-key]",  
  "asrUrl": "[protocol://pathname:port]",  
  "asrEndpoint": "/dictation",  
  "ttsUrl": "protocol://pathname:port]",  
  "ttsEndpoint": "/tts"  
}

Capturing Audio


Note This utility will record audio at the sampling rate determined from the audio device you choose to capture audio from. This will most likely be 44.1kHz or 48kHz. You will need to downsample the recorded audio in order to use it with the NDEV HTTP services.

Overview

The CLI provide an interface for capturing audio using the portaudio library. To use the record_wav.py script, provide the name of the file to write a wave file, including the wav extension, like so:

  python bin/record_wav.py test.wav

How does it work

  1. Pick an Audio Device to capture from
  2. Press a key (Enter) to begin capturing audio
  3. Interrupt the program (Ctrl+C) to end capturing audio if < 10s long

Here is an example of the output after having recorded a wav file.

Recording to: test.wav

Here are the available audio devices:
[0]  Built-in Microph	Default Sample Rate: 44100
[1]  Built-in Input	Default Sample Rate: 44100
[2]  Built-in Output	Default Sample Rate: 44100

Which device would you like to record audio from: 0

[enter] to begin recording, [ctrl-c] to cancel

o	 recording	(ctrl+c to stop)
^Cx	 done recording

Dependencies

The utility leverages the following Python modules

During installation you would have performed pip install -r requirements.txt. This will install pyaudio.

Resampling Audio


Overview

If you record audio using the record.py script, you will notice that the wav file is stored at the sample rate it was captured at, possibly 44.1kHz or 48kHz. The NDEV HTTP service requires that you use 8kHz or 16kHz, and so the audio needs to be resampled.

The CLI offers a resample.sh script that provides an interface leveraging the SOX utility.

Specify the wav file that you want to resample and optionally pass in the sample rate to resample to, i.e. 8k, 16000.

Here is an example of how to use the script

  ./bin/resample.sh test.wav 8k

If you do not define the sample rate, a rate of 16000 will be used as the default. The utility will create a new wav file after resampling, with a naming pattern like [name]_[samplerate].wav.

Dependencies

To take advantage of the resampling utility, install SoX.

Speech Recognition


All ASR requests are performed using the chunked-transfer encoding transfer mechanism.

To perform speech recognition on an audio file using the NDEV HTTP services use the asr.py script.

This utility will do the following:

  • Determine the appropriate request headers based on the audio file

  • Ask the user for a language to use if one is not defined

  • Build the request using data available and credentials.json

  • Issue the request to the HTTP service

  • Display the top result for the perform recognition OR Display the error message from the server

Usage

Usage: asr.py {source_file.wav} [options]

Options:
  -h, --help            show this help message and exit
  -l LANGUAGE, --lang=LANGUAGE
                        desired language via language code

Sample Output

The asr.py utility provides an output of an ASR request. For example, using a wav file sampled at a rate of 16kHz and using the language en_US results in the following output:

* analyzing audio stream...

  Request URL      protocol://server:port/endpoint

  Request Params
  ---------------
  appId            --
  appKey           --
  id               --

  Request Headers 
  --------------- 
  Content-Type        audio/x-wav;bit=16;codec=pcm;rate=16000
  Transfer-Encoding   chunked
  Accept              text/plain
  Accept-Topic        Dictation
  Accept-Language     en_US

  Audio Information 
  ----------------- 
  Sample Width        2
  Sample Rate         16000
  Num Channels        1
  Bit Rate            16

  Audio File          test_16k.wav
  Bytes Sent          94366/94366       100% 

* analyzed stream.

Streaming Speech Recognition


Sending audio data in real time while capturing it enhances the user experience drastically when integrating speech into your applications.

There is a utility asr_stream.py that will perform real time streaming and audio capture for speech recognition.

Usage

Usage: asr_stream.py [options]

Options:
  -h, --help            show this help message and exit
  -l LANGUAGE, --lang=LANGUAGE
                        desired language via language code
  -s SAMPLERATE, --samplerate=SAMPLERATE
                        specify the desired samplerate for audio transfer
  -v, --verbose         see the raw HTTP 

If you choose to view the raw bytes being transferred during the request, you can use the -v, verbose flag.

The language is optional, and if unspecified, will be determined by the user with an interactive input for available languages.

Text To Speech


Speech synthesis from text is a compelling feature that can be added to enhance an application.

The CLI TTS utilities encourage experimentation and allow you to store an audio file that is returned from the server based on text and the given language.

Please note that these utilities should not be used to gather samples that can then be used later. This is stated in the Terms of Use.

Usage

Usage: tts.py {destination_file_name.format} {text_to_synthesize} [options]

Options:
  -h, --help            show this help message and exit
  -l LANGUAGE, --lang=LANGUAGE
                        desired language via language code, i.e. en_US
  -r SAMPLERATE, --rate=SAMPLERATE
                        the sample rate to use for the create audio file if
                        relevant, i.e. 16000

Sample

Here is an example of making a TTS request having defined the destination path of a wav file and some text:

./bin/tts.py test.wav "this is a test"


NDEV HTTP Python CLI from Nuance Communications
    for more info see: http://nuancedev.github.io

Select Synthesis Language

 [0]	Arabic                    ar_WW
 [1]	Australian English        en_AU
 [2]	Bahasa (Indonesia)        id_ID
 [3]	Basque                    eu_ES
 [4]	Belgian Dutch             nl_BE
 [5]	Canadian French           fr_CA
 [6]	Cantonese                 zh_HK
 [7]	Catalan                   ca_ES
 [8]	Czech                     cs_CZ
 [9]	Danish                    da_DK
 [10]	Dutch                     nl_NL
 [11]	Finnish                   fi_FI
 [12]	French                    fr_FR
 [13]	German                    de_DE
 [14]	Greek                     el_GR
 [15]	Hindi                     hi_IN
 [16]	Hungarian                 hu_HU
 [17]	Indian English            en_IN
 [18]	Irish English             en_IE
 [19]	Italian                   it_IT
 [20]	Japanese                  jp_JP
 [21]	Korean                    ko_KR
 [22]	Mandarin                  zh_CN
 [23]	Norwegian                 no_NO
 [24]	Polish                    pl_PL
 [25]	Portuguese                pt_PT
 [26]	Portuguese Braz.          pt_BR
 [27]	Romanian                  ro_RO
 [28]	Russian                   ru_RU
 [29]	Scottish English          en_SC
 [30]	Slovak                    sk_SK
 [31]	South African English     en_ZA
 [32]	Spanish Castilian         es_ES
 [33]	Spanish Mexican           es_MX
 [34]	Swedish                   sv_SE
 [35]	Taiwanese Mandarin        zh_TW
 [36]	Thai                      th_TH
 [37]	Turkish                   tr_TR
 [38]	UK English                en_UK
 [39]	US English                en_US

Which language (default: US English)? 39

The following voices are available in en_US..

 [0]  Allison (F)
 [1]  Carol (F)
 [2]  Samantha (F)
 [3]  Tom (M)

Which voice would you like to use? 0

Using Language: US English (en_US)	Voice: Allison

The following sample rates are available for the 'wav' format..

 [0]  8000Hz
 [1]  16000Hz
 [2]  22000Hz

What sample rate would you like to use? 1

Using Sample Rate: 16000

* synthesizing text...

 Request URL
 --------------- 
 [url here]

 Request Headers 
 --------------- 
 Content-Type:	text/plain; charset=utf-8
 Accept:	audio/x-wav;bit=16;codec=pcm;rate=16000

 Making request: 1.038076 seconds, 43648 bytes

* synthesize request complete

✓ TTS

Text synthesized to file -> test.wav

File Formats

The TTS service supports the ability to create synthesized samples in

  • wav (Sample Rates: 8k, 16k, 22k)
  • spx or ogg (Sample Rates: 8k, 16k)
  • mp3
  • amr

The format will be determined based upon specifying the extension for the file to write out.

For example if you specify test.wav the resulting file will be of wave format (with unsigned PCM).

Alternatively, if you specify test.mp3 the resulting file will be an mp3 format with a bit rate of 128kbps.

If you deal with spx or ogg, you may want to use the speex library to decode it into wave.