Spoken Language Processing

Text-To-Speech System: The task of transforming text into spoken words may be decomposed into several subtasks:

  1. Decide how to read the text, or how to expand abbreviations, numeric expressions, etc. into words. Examples: Dr./doctor; 12/twelve; ASAP/as soon as possible.
  2. Decide how to pronounce each word, or how to convert letters (graphemes) into sounds (phonemes). Examples: house/h aw s; beef/b iy f; speech/s p iy ch.
  3. Decide what prosody to use for each phrase, or how to assign specific stress, rhythm and intonation patterns to the target sequence of phonemes.
  4. Synthesize the target sequence of phonemes. There are several ways of achieving this. The most common, called concatenative synthesis, consists in recording lots of speech from a person, and later copy-pasting individual phoneme-like units to generate the intended message.

We are currently building an Argentinian Spanish TTS system, aiming in the medium term at developing assistive technologies such as screen readers, potentially useful to people who are visually impaired or illiterate.

 

Prosody Modeling: Often without noticing it, we constantly alter the way we talk according to countless reasons. We manipulate the intonation, rhythm and stress (the ‘prosody’) of our speech to ask questions, to answer them, to structure our discourse, to coordinate our conversations, to express our feelings, to lie, to tell jokes, to confide secrets, etc.

With few exceptions, state-of-the-art speech processing systems are designed to process ‘what’ is said, but not ‘how’ it is said. In consequence, utterances such as “John saw a man on the hill with a telescope”, which need to be prosodically disambiguated, are usually mishandled by current systems.

This project consists in building computational models that describe different dimensions of prosodic variation in spoken language. We focus on two languages of study: English and Spanish.