test run old slurm job and it worked.
git clone to wahahb
hqin@wahab-01 fairASR25]$ pwd
This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interest to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin @hongqin
test run old slurm job and it worked.
git clone to wahahb
hqin@wahab-01 fairASR25]$ pwd
speech-to-text corpora—one of which is Meta FAIR’s fairness-oriented dataset:
LibriSpeech ASR Corpus
A corpus of roughly 1,000 hours of 16 kHz read English speech, derived from LibriVox audiobooks, carefully segmented and aligned. Released under a CC BY 4.0 license. (openslr.org)
Multilingual LibriSpeech (MLS)
A large-scale ASR dataset by Facebook AI Research (Meta), comprising ∼50,000 hours of public-domain audiobooks across eight languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish). (Meta AI, voxforge.org)
Mozilla Common Voice
A crowdsourced, multilingual speech corpus with millions of volunteer-recorded, validated sentences and transcriptions, released under CC0 (public domain). (Wikipedia)
TED-LIUM v3
An English ASR corpus of 452 hours of TED talk recordings with aligned transcripts, freely available for research. (openslr.org)
VoxForge
A community-collected GPL-licensed speech corpus in multiple languages, built to support open-source ASR engines (e.g., CMU Sphinx, Julius). (voxforge.org)
Fair-Speech Dataset (Meta FAIR)
A fairness-oriented evaluation set containing 26,471 utterances from 593 U.S. speakers, designed to benchmark bias and robustness in speech recognition. (Meta AI, Meta AI)
GigaSpeech
A multi-domain English ASR corpus featuring 10,000 hours of high-quality transcribed audio (plus 40,000 hours of additional audio for semi-/unsupervised research).
VoxPopuli
Contains over 1 million hours of unlabeled multilingual speech and 1.8 k hours of transcribed speeches in 16 languages (with aligned interpretation pairs), for representation learning and semi-supervised ASR. (arxiv.org)
Here are several publicly available speech-to-text corpora that include regional and non-native accents—many of which you can filter or mine for Southern Chinese (e.g., Cantonese-influenced) accent patterns (such as /s/ vs /ʃ/ or –ing vs –in):
Speech Accent Archive
A growing, global collection of ~2,500 English recordings of the same Harvard paragraph, each with narrow phonetic transcription and speaker metadata (including L1 and region). You can browse by “Chinese” and then drill down to Cantonese vs. other dialect regions. (ResearchGate, accent.gmu.edu)
L2-ARCTIC
A corpus of non-native English speech from ten Mandarin (plus Hindi, Korean, Spanish, Arabic) speakers reading CMU ARCTIC prompts. It includes orthographic transcripts, forced-aligned phonetic annotations, and expert mispronunciation tags. (psi.engr.tamu.edu)
CSLU Foreign-Accented English (Release 1.2)
~4,925 telephone-quality utterances by speakers of various L1s (including Chinese), with transcript, speaker background, and perceptual accent ratings. (borealisdata.ca)
speechocean762
5,000 English utterances from 250 non-native speakers (half children), each annotated at the sentence, word, and phoneme level. Designed for pronunciation assessment, freely downloadable via OpenSLR. (arXiv)
ShefCE: Cantonese-English Bilingual Corpus
Audio & transcripts from 31 Hong Kong L2 English learners reading parallel Cantonese and English texts—ideal for studying Cantonese-influenced English phonetics. (orda.shef.ac.uk)
Sell-Corpus: Multi-Accented Chinese English Speech
First open-source English speech corpus covering seven major Chinese dialect regions (including Southern dialects), with recordings & transcripts for accent variation research. (sigport.org)
Mozilla Common Voice
Crowdsourced, multilingual speech data (CC0) with per-speaker accent tags—you can filter English recordings by “Chinese (Hong Kong)” or “Chinese (Mainland)” to get regional accent samples. (Wikipedia)
ICNALE Spoken Monologues
4,400 60-second monologues (~73 h) by 1,100 Asian learners (incl. Mainland China, Hong Kong, Taiwan), with transcripts—useful for comparing Southern vs. Northern Chinese L1 influence on English pronunciation. (language.sakura.ne.jp, language.sakura.ne.jp)
International Dialects of English Archive (IDEA)
Free archive of scripted & unscripted English dialect samples worldwide. Browse the “Asia → China” section to find Cantonese- and Mandarin-accented speakers, all with transcripts. (Wikipedia)
Each of these datasets provides aligned audio and text (and often phonetic detail) that you can mine to analyze pronunciation patterns—like the s/ʃ or –ing/–in contrasts—among Southern Chinese speakers learning or using English.
Real-time Out-of-distribution Detection in Learning-Enabled Cyber-Physical Systems
Here are the details:
https://arxiv.org/html/2501.10900v1
for AI works
CS Candidacy Exam.
The catalog description of the exam is found at
He’s doing Option 1, which is summary of papers relevant to his dissertation research topic.
The guidelines for the length of the document are just guidelines, it can be longer. And same for the length of the presentation. It can be as long as the committee would like.
https://kidsasr.drivendata.org/
https://github.com/hongqin/goodnight-moon
https://www.drivendata.org/competitions/298/literacy-screening/
chatGPT repeatly make mistake for shap summary plot.
for different class label, the index should be the 3rd position: shap_vals[:,:,idx],
Symposium below or share with interested colleagues.
national family survey of pregancy
https://www.cdc.gov/nchs/nsfg/index.htm
todo: request to restrickted access variables.
August 23 - Dec 12, 2025, Thursday 6p - 8:40pm.
Type | Time | Days | Where | Date Range | Schedule Type | Instructors |
---|---|---|---|---|---|---|
Scheduled In-Class Meetings | 6:00 pm - 8:40 pm | R | ENGINEERING & COMP SCI BLDG 2120 | Aug 23, 2025 - Dec 12, 2025 | LECTURE |
https://depts.washington.edu/uwruca/ruca-approx.php?utm_source=chatgpt.com
All of Us survey, data codebooks
https://docs.google.com/spreadsheets/d/1pODkE2bFN-kmVtYp89rtrJg7oXck4Fsex58237x47mA/edit?usp=sharing
The paper "A model for the assembly map of bordism-invariant functors" by Levin, Nocera, and Saunier (2025) develops advanced categorical frameworks for algebraic topology, particularly through oplax colimits of stable/hermitian/Poincaré categories and bordism-invariant functors123. While not directly addressing machine learning (ML) or large language models (LLMs), its contributions could indirectly influence these fields through three key pathways:
The paper's formalization of oplax colimits and Poincaré-Verdier localizing invariants13 provides new mathematical tools for structuring compositional systems. This could advance:
Model Architecture Design: By abstracting relationships between components (e.g., neural network layers) as bordism-invariant functors, enabling more rigorous analysis of model behavior under transformations5.
Geometric Deep Learning: Topological invariants and assembly maps could refine methods for learning on non-Euclidean data (e.g., graphs, manifolds) by encoding persistence of features under deformations5.
The bordism-invariance concept—where structures remain unchanged under continuous deformations—offers a mathematical foundation for invariance principles in ML:
Data Augmentation: Formalizing "bordism equivalence" could guide the design of augmentation strategies that preserve semantic content (e.g., image rotations as "topological bordisms")5.
Robust Feature Extraction: Kernels of Verdier projections13 might model noise subspaces to exclude during feature learning, improving adversarial robustness.
The paper’s explicit decomposition of complex functors (e.g., Shaneson splittings with twists13) parallels challenges in LLM-based reasoning:
Program Invariant Prediction: LLMs that infer program invariants6 could adopt categorical decompositions to handle twisted or hierarchical constraints (e.g., loop invariants in code).
Categorical Data Embeddings: LLM-generated numerical representations of categorical data4 might leverage bordism-invariance to ensure embeddings respect equivalence classes (e.g., "color" as a deformation-invariant attribute).
The work is highly theoretical, with no direct ML/LLM applications in the paper. Bridging this gap requires:
Translating topological bordisms into data-augmentation pipelines.
Implementing Poincaré-Verdier invariants as regularization terms in loss functions.
Extending LLM-based invariant predictors6 to handle categorical assembly maps.
While speculative, these connections highlight how advanced category theory could enrich ML’s theoretical foundations and LLMs’ reasoning capabilities.