In aviation I sometimes hear about how dangerous it is to have communications done over the radios compared to text messages. I admit actively listening to the radios while flying a plane is a lot of multitasking. I sometimes mishear things because my mind is focused on other tasks. Sometimes I find myself forgetting what ATC said. That's why a lot of student pilots start out learning to just fly the plane at a basic level while the the instructor does the radios for them. Once the student is more comfortable, the instructor would slowly hand off the radios to the student. I equate it to learning to play a guitar and then layering in singing.
I wanted to see how accurate Watson's Speech to Text (STT) services were, so I snipped out two short ATC audio clips. Below are the two ATC audio clips.
Audio Clip 1 is at an uncontrolled airfield (Lebanon Municipal) where pilots are making position announcements.
Audio Clip 2 is of ATC giving an instruction to an aircraft. For this, listen and repeat what ATC says once the transmission is done (FAA requires verbal repeat of this particular transmission).
Watson transcription:
Audio Clip 1: democratic sixty five Charlie clear one nine eleven Lebanon traffic system five three four six zero six not probably one nine moving the patterns of lettuce seven a traffic finance a number one zero five seven whiskey a party for one four forty four aircraft taking off only one model for the fourth quarter
Reference: "lebanon traffic cessna two six two five charlie clear one nine lebanon traffic lebanon traffic cessna five three four six zero taking off runway one nine will remain in the pattern student pilot lebanon lebanon traffic bonanza number one zero five seven whiskey departing one five seven whiskey abort abort abort aircraft taking off runway one nine abort abort abort"
Audio Clip 2: forty three three one two seven zero three seven one less susceptible for
Reference: "southwest forty three sixty two wind two seven zero at six runway one left cleared for takeoff no delay"
To score STT there's a metric called Word Error Rate (WER) and it's based off Levenshtein distance. Levenshtein distance is used for characters and the WER is just a slight modification for use on words. It measures the error rate of the transcription by comparing the reference to the hypothesis. If you're interested in the underlying math and code, check out the source at the end.
S:-Substitutions
D: Deletions
I: Insertions
N: Number of words in the reference
For example:
Reference: I have a red dog.
Hypothesis: I a blue dog.
In the example there is one deletion (have is deleted), one substitution (red to blue) and a total of 5 words in the reference. This results in the WER of (1+0+1)/5 = 20%.
Audio Clip 1 WER: 67%
Audio Clip 2 WER: 92%
It looks like the WER are pretty high. A WER of 25% is average for most STT services. I assume the STT would perform better if the models were built on purely ATC audio. Regardless it was a fun thing to try.
Technical Review: In Python I downloaded the IBM Watson package and used my authenticator info to send the audio clips to IBM's servers using an API. Watson sends me back a list of items broken down by confidence level. I take the list of items and convert it to one string. The WER calculation is from a defined function that I found online.
WER source: https://martin-thoma.com/word-error-rate-calculation/
Comments