Xi-Vector Embedding for Speaker Recognition

1. Introduction

🧠 Assumption [speech signal 의 구성]

⚠️ Problem

🌳 Motivation (Why?)

2. Neural Speaker Embeddings

3. Uncertainty Modeling in the Latent Sapce

3. Xi-Vector Embedding with Uncertainty

Untitled

Frame Uncertainty

기존의 x-vector 모델은 각 프레임(frame)에 대해 단일 값인 "point estimate"를 출력
그러나 xi-vector 모델은 여기에 더해 "frame uncertainty"도 같이 출력
이는 모델이 각 프레임에 얼마나 확신하는지를 표현하는 불확실성 측정치

Gaussian Posterior Inference

이 모델에서는 시간에 따른 프레임 데이터를 집계(aggregate)할 때, Gaussian posterior inference 방식을 사용
이 방법은 각 프레임의 point estimate와 frame uncertainty를 함께 사용하여 더 정확한 "posterior mean vector"를 계산

Decoder Input

xi-vector에서는 기존의 x-vector에서 사용되던 첫 번째와 두 번째 순서의 통계 모멘트를 사용하지 않고, 대신 위에서 계산된 "posterior mean vector"를 디코더의 입력으로 사용