Showing posts with label SLP. Show all posts
Showing posts with label SLP. Show all posts

Friday, August 15, 2025

ASR overview 2025

 

What Is Audio Speech Recognition?

Audio speech recognition, also called automatic speech recognition (ASR), is a technology that enables computers and devices to understand and process human speech by converting spoken language into text or actionable commands. Fundamentally, ASR captures audio input (via a microphone), digitizes the sound waves, and then processes them through algorithms to recognize phonemes (basic units of sound), assemble them into words, and produce a transcript or trigger specific tasks.twilio+1

Core components and steps include:

  • Audio capture and preprocessing: Microphones convert voice vibrations into electrical and then digital signals; preprocessing enhances the speech and reduces noise.

  • Acoustic modeling: Maps the digitized signal to phonemes.

  • Language modeling: Predicts word sequences using statistical information and context.

  • Decoding: Converts the identified phonemes and language models into coherent, context-accurate text.kardome+1

Main Challenges in Speech Recognition

1. Background Noise: Ambient sounds such as traffic, appliances, or other voices can blur the spoken signal, significantly impacting ASR accuracy. While noise suppression exists, it is not perfect, especially in complex real-world environments.milvus+2

2. Accents, Dialects, and Pronunciation Variability: Regional accents, dialects, slang, and non-native pronunciation introduce significant variability, making recognition more difficult if the system isn’t trained on diverse data. Homophones and contextual ambiguities also increase complexity.atltranslate+1

3. Speech Speed and Volume Fluctuations: Variations in how quickly or slowly people speak, as well as changes in loudness, challenge systems optimized for 'average' speech patterns.waywithwords

4. Contextual Understanding: Disambiguating meaning in homophones or similar-sounding words requires context-aware models, which add computational and design complexity.milvus+1

5. Computational Efficiency and Real-Time Processing: Processing long audio streams or interactive tasks with minimal delay demands significant computing resources, balancing accuracy and responsiveness, particularly on mobile or 'edge' devices.milvus

6. Speaker Identification in Multi-Speaker Scenarios: Recognizing who is speaking and tracking speakers accurately is difficult, making transcription and command targeting less reliable in group settings.atltranslate

State of the Art (2025)

a. Neural Network Architectures: Modern ASR models are built on advanced machine learning, particularly neural networks such as transformers, recurrent neural networks, and state-space models. These models excel at mapping speech to text even in challenging acoustic environments.arxiv+1

b. Samba-ASR: The new Samba-ASR model uses a novel state-space architecture, replacing traditional transformers for improved computational efficiency and accuracy. It sets new benchmarks with remarkably low Word Error Rates (WER): as low as 1.17% on LibriSpeech Clean, outperforming previous state-of-the-art models. It is both faster and more adaptable across various languages, domains, and speaking styles.arxiv

c. OpenAI's gpt-4o-transcribe: Recent models like gpt-4o-transcribe improve on earlier whisper-based solutions in accuracy and reliability, especially for diverse accents, noisy environments, and fast or variable speech. These models use reinforcement learning and large-scale, diverse datasets to achieve high performance.openai

d. Multilingual and Accent Robustness: New benchmarks such as ML-SUPERB push models to handle over 150 languages and hundreds of accents, reflecting major progress toward more inclusive, accessible ASR. Models are evaluated on global linguistic diversity and are robust to different speech patterns and background conditions.interspeech2025

In summary, audio speech recognition has evolved into a highly capable, AI-driven field but still wrestles with real-world variability, noise, and linguistic diversity. Today’s best models—like Samba-ASR and GPT-4o—achieve impressively low error rates and operate efficiently, but ongoing research emphasizes even broader language coverage, context awareness, and noise robustnesstness.ibm+2

  1. https://www.twilio.com/en-us/blog/insights/ai/what-is-speech-recognition
  2. https://opencv.org/blog/applications-of-speech-recognition/
  3. https://www.kardome.com/blog-posts/difference-speech-and-voice-recognition
  4. https://milvus.io/ai-quick-reference/what-are-common-issues-faced-by-speech-recognition-systems
  5. https://waywithwords.net/resource/challenges-in-speech-data-processing/
  6. https://www.atltranslate.com/ai/blog/automatic-speech-recognition-challenges
  7. https://arxiv.org/html/2501.02832v1
  8. https://openai.com/index/introducing-our-next-generation-audio-models/
  9. https://www.interspeech2025.org/challenges
  10. https://www.ibm.com/think/topics/speech-recognition
  11. https://en.wikipedia.org/wiki/Speech_recognition
  12. https://developer.nvidia.com/blog/essential-guide-to-automatic-speech-recognition-technology/

Sunday, August 3, 2025

speech-to-text corpora, accent

 speech-to-text corpora—one of which is Meta FAIR’s fairness-oriented dataset:

  • LibriSpeech ASR Corpus
    A corpus of roughly 1,000 hours of 16 kHz read English speech, derived from LibriVox audiobooks, carefully segmented and aligned. Released under a CC BY 4.0 license. (openslr.org)

  • Multilingual LibriSpeech (MLS)
    A large-scale ASR dataset by Facebook AI Research (Meta), comprising ∼50,000 hours of public-domain audiobooks across eight languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish). (Meta AI, voxforge.org)

  • Mozilla Common Voice
    A crowdsourced, multilingual speech corpus with millions of volunteer-recorded, validated sentences and transcriptions, released under CC0 (public domain). (Wikipedia)

  • TED-LIUM v3
    An English ASR corpus of 452 hours of TED talk recordings with aligned transcripts, freely available for research. (openslr.org)

  • VoxForge
    A community-collected GPL-licensed speech corpus in multiple languages, built to support open-source ASR engines (e.g., CMU Sphinx, Julius). (voxforge.org)

  • Fair-Speech Dataset (Meta FAIR)
    A fairness-oriented evaluation set containing 26,471 utterances from 593 U.S. speakers, designed to benchmark bias and robustness in speech recognition. (Meta AI, Meta AI)

  • GigaSpeech
    A multi-domain English ASR corpus featuring 10,000 hours of high-quality transcribed audio (plus 40,000 hours of additional audio for semi-/unsupervised research).

  • VoxPopuli
    Contains over 1 million hours of unlabeled multilingual speech and 1.8 k hours of transcribed speeches in 16 languages (with aligned interpretation pairs), for representation learning and semi-supervised ASR. (arxiv.org)


Here are several publicly available speech-to-text corpora that include regional and non-native accents—many of which you can filter or mine for Southern Chinese (e.g., Cantonese-influenced) accent patterns (such as /s/ vs /ʃ/ or –ing vs –in):

  • Speech Accent Archive

    A growing, global collection of ~2,500 English recordings of the same Harvard paragraph, each with narrow phonetic transcription and speaker metadata (including L1 and region). You can browse by “Chinese” and then drill down to Cantonese vs. other dialect regions. (ResearchGate, accent.gmu.edu)

  • L2-ARCTIC

    A corpus of non-native English speech from ten Mandarin (plus Hindi, Korean, Spanish, Arabic) speakers reading CMU ARCTIC prompts. It includes orthographic transcripts, forced-aligned phonetic annotations, and expert mispronunciation tags. (psi.engr.tamu.edu)

  • CSLU Foreign-Accented English (Release 1.2)

    ~4,925 telephone-quality utterances by speakers of various L1s (including Chinese), with transcript, speaker background, and perceptual accent ratings. (borealisdata.ca)

  • speechocean762

    5,000 English utterances from 250 non-native speakers (half children), each annotated at the sentence, word, and phoneme level. Designed for pronunciation assessment, freely downloadable via OpenSLR. (arXiv)

  • ShefCE: Cantonese-English Bilingual Corpus

    Audio & transcripts from 31 Hong Kong L2 English learners reading parallel Cantonese and English texts—ideal for studying Cantonese-influenced English phonetics. (orda.shef.ac.uk)

  • Sell-Corpus: Multi-Accented Chinese English Speech

    First open-source English speech corpus covering seven major Chinese dialect regions (including Southern dialects), with recordings & transcripts for accent variation research. (sigport.org)

  • Mozilla Common Voice

    Crowdsourced, multilingual speech data (CC0) with per-speaker accent tags—you can filter English recordings by “Chinese (Hong Kong)” or “Chinese (Mainland)” to get regional accent samples. (Wikipedia)

  • ICNALE Spoken Monologues

    4,400 60-second monologues (~73 h) by 1,100 Asian learners (incl. Mainland China, Hong Kong, Taiwan), with transcripts—useful for comparing Southern vs. Northern Chinese L1 influence on English pronunciation. (language.sakura.ne.jp, language.sakura.ne.jp)

  • International Dialects of English Archive (IDEA)

    Free archive of scripted & unscripted English dialect samples worldwide. Browse the “Asia → China” section to find Cantonese- and Mandarin-accented speakers, all with transcripts. (Wikipedia)

Each of these datasets provides aligned audio and text (and often phonetic detail) that you can mine to analyze pronunciation patterns—like the s/ʃ or –ing/–in contrasts—among Southern Chinese speakers learning or using English.

Sunday, July 13, 2025

goodnight moon audo data and labels?


Children’s Speech Recognition Challenge

https://kidsasr.drivendata.org/

https://github.com/hongqin/goodnight-moon


https://www.drivendata.org/competitions/298/literacy-screening/


Saturday, April 5, 2025

Monday, January 20, 2025

whisperx ubuntu workstation

 tried to install whisperx to run on gpu, have trouble with cuda libaries. So, default to cpu instend. 

Tuesday, July 16, 2024

SLP software

 

ASR exploratory task:

Compare some models’ performance on children's speech. Compute the
word error rate (WER) and identify some of the phrases that each model
is not good at. Some models to think about (and test on):

Wav2vec2-conformer

Wav2vec2

Whisper

Nemo ASR - STT (from Nvidia)

Paraformer

You might also want to read (for a better understanding of how today’s
deep learning models deal with audio):

https://huggingface.co/learn/audio-course/en/chapter3/ctc

https://huggingface.co/learn/audio-course/en/chapter3/seq2seq

Tuesday, May 28, 2024

Childen speech data

 

https://childes.talkbank.org/access/


This page provides an index to CHILDES corpora, organized by language group and data type. In accordance with TalkBank rules, any use of data from these corpora must cite at least one corpus reference (see citation info on corpus page) and acknowledge CHILDES grant support -- NICHD HD082736.