Advancing Electrolaryngeal Speech Enhancement Through Speech--Text Representation Learning
Authors: Ding Ma, Jinyi Mi, Fengji Li, Lester Phillip Violeta, Jiajun He, Wenchin Huang, Kazuhiro Kobayashi, and Tomoki Toda
Comments: Published on IEEE Transactions on Biomedical Engineering (TBME).
Abstract:Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech for verbal communication. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. Additional optimization designs are performed across these stages. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations regarding both conversion quality and intelligibility. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.
Main concept
Dataset
Patient-1 and Patient-2 datasets were used to evaluate proposed methods for EL2SP.
Systems
EL speech, Normal speech: Original source EL and target normal speech as the reference.
Baseline 1: The Transformer-based model first conducted pretraining then directly fine-tuned with the original EL2SP datasets.
Baseline 2: The Transformer-based model first conducted pretraining then conducted two-stage fine-tuning with the synthetic data and original EL2SP datasets.
P-MF-2: The proposed speech--text representation learning method using middle-level fusion strategy and trained with the synthetic data, original EL2SP dataset, and text data.
P-IF-2: The proposed speech--text representation learning method using input-level fusion strategy and trained with the synthetic data, original EL2SP dataset, and text data.
P-HF-2: The proposed speech--text representation learning method using hybrid-level fusion strategy and trained with the synthetic data, original EL2SP dataset, and text data.
Conversion Samples on Patient-1
Transcription: 違うバスに乗ってしまったようです。 (chi ga u ba su ni notte shi matta you de su。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: 卵と牛乳がダメです。 (ta ma go to gyuunyuu ga da me de su。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: ご苦労様でした。 (go ku rou sa ma de shi ta。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: 申し訳ありません。 (mou shi wa ke a ri ma sen。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: 着陸にも若干の影響が出そうです。 (cha ku ri ku ni mo jakkan no eikyou ga de sou de su。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Conversion Samples on Patient-2
Transcription: 心配なら、ロープウェーもありますよ。 (shinpai na ra, roopuwee mo a ri ma su yo。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: ンゴニは、素朴な弦楽器である。(ngo ni wa, soboku na gengakki de a ru。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: ロープウェーで上りおりできます。 (roopuwee de noboriori de ki ma su。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: カープールレーンは渋滞しません。 (kaapuru reen wa juutai shi ma sen。)
EL speech
Normal speech
Reference:
Baseline 1
Baseline 2
P-MF-2
P-IF-2
P-HF-2
Conversion:
Transcription: プロパンガスとトイレットペーパーが含まれます。 (propangasu to toiretto peepaa ga fuku ma re ma su。)