Advancing Electrolaryngeal Speech Enhancement Through Speech--Text Representation Learning

Authors: Ding Ma, Jinyi Mi, Fengji Li, Lester Phillip Violeta, Jiajun He, Wenchin Huang, Kazuhiro Kobayashi, and Tomoki Toda
Comments: Published on IEEE Transactions on Biomedical Engineering (TBME).

Abstract: Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech for verbal communication. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. Additional optimization designs are performed across these stages. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations regarding both conversion quality and intelligibility. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

Main concept

Dataset

Patient-1 and Patient-2 datasets were used to evaluate proposed methods for EL2SP.

Systems


Conversion Samples on Patient-1

Transcription: 違うバスに乗ってしまったようです。 (chi ga u ba su ni notte shi matta you de su。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: 卵と牛乳がダメです。 (ta ma go to gyuunyuu ga da me de su。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: ご苦労様でした。 (go ku rou sa ma de shi ta。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: 申し訳ありません。 (mou shi wa ke a ri ma sen。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: 着陸にも若干の影響が出そうです。  (cha ku ri ku ni mo jakkan no eikyou ga de sou de su。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Conversion Samples on Patient-2

Transcription: 心配なら、ロープウェーもありますよ。 (shinpai na ra, roopuwee mo a ri ma su yo。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: ンゴニは、素朴な弦楽器である。(ngo ni wa, soboku na gengakki de a ru。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: ロープウェーで上りおりできます。 (roopuwee de noboriori de ki ma su。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: カープールレーンは渋滞しません。 (kaapuru reen wa juutai shi ma sen。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion:

Transcription: プロパンガスとトイレットペーパーが含まれます。  (propangasu to toiretto peepaa ga fuku ma re ma su。)


EL speechNormal speech
Reference:

Baseline 1Baseline 2P-MF-2P-IF-2P-HF-2
Conversion: