T5 and mT5
What is mT5?: mT5 is multilingual variant of T5 that was pre-trained on a dataset with 101 languages.
What is T5?: T5 model proposed a unified “text-to-text” format for all text-based NLP tasks. For example, for categorizing tasks, QA tasks, and translation tasks used different models architectures and output formats. The T5 unifies them into “Input Text -> Output Text” format that only pure text is involved.
- T5 uses a basic encoder-decoder Transformer architecture and is pre-trained on a masked language modeling “span-corruption” objective.
Span Corruption: randomly erase some of the segments of input text and masked with special marks. The model should learn to recover the original input text.
- mT5 solve bias and overfit/underfit problems on different language by boosting lower resources languages.
G2P-DDM
G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model
Sign Language Production(SLP): automatically translate spoken languages into sign sequences.
G2P-DDM mainly focus on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P).
Novel Methods: convert the continuous pose space generation problem into a discrete sequence generation problem.
Three steps approach:
- Pose-VQVAE: quantalized a 3D skeleton sequence into token. Seperate the skeleton into body, left hand, and right hand with independent codebook.
- G2P-DDM + CodeUnet: conduct mask-and-replace discrete diffusion on discrete latent, reconstruct with U-net + Transformer based on temporal and spatial features.
- Sequential-KNN: non-autoregressive needs a target length, using the semantic similarity of token to do kNN-based density peaks clustering algorithm, to get corresponding frame length of each gloss, predict the total length when inferencing.
Pose VQ-VAE
Discretize Pose Latent Space
Encoder and Separating Skeleton
Input N frames, J nodes, and K dimension feature sequence. Separating the skeleton into three subsequences: pose, left hand, and right hand. Each of them will pass its own linear layers, concatenation layer and spatial-temporal Transformer to learn the long-term dependence and obtain feature codes.
Multi-codebook Quantalization
Quantalize the feature codes of each frame by nearest neighbour method into three independent codebooks (pose, left hand, and right hand). Size of each code is V, obtaining a discrete token embedding (quantalized features).
Decode and SPL (Structure Prediction Layer)
Decoder also adopts the spatial-temporal Transformer architecture, the output features are also separated according to the three subsequences, then reconstruct the key node sequence via SPL. SPL will model the dependency of human body key nodes.
End-to-End Training
The loss of Pose-VQVAE contains the reconstruction of three patches and the standard codebook with stop gradient.
Discrere Diffusion
Modeling in discrete latent space.
Forward Fusion Markov Chain
Fuse the discrete codebook indexes into pure noise.
Length-varied mask-and-Replace Expansion
Standard dicrete diffusion often is length-fixed, to address this G2P-DDM intruduces: - Length Predictor: predict the length of each gloss and aggregate to get total length. - Introduce [MASK]/[PAD] marks into discrete diffusion. Alternate the diffusion process into recovering from [MASK]/[PAD] marks.
CodeUnet
CodeUnet is mixed with Transformer the the downsample/upsamle of Unet.
- Middle Modules: a Transformer Encoder and Decoder, Encdoer (6 layers) read the conditional features get from the gloss, decoder(2 layers) conduct self-attention and cross-attention from sequence features to gloss, and use AdaLN to mix the time step.
- Contract/Expand Route (U-shaped): 2 Transformer Encoder for each layer.
Pipeline Summary
- Input/Output
- Input: gloss sequence (discrete token sequence)
- Output: skeleton sequence
- Pose Discretazation
- Separate Skeleton: separate the skeleton of each fram into pose/left hand/right hand.
- Three patches -> linear layer -> spatial-temporal Transformer -> frame latent vector
- Three patches lookup each codebook nearest neighbour -> get token
- token -> codebook vector -> spatial-temporal Transformer + SPL -> skeleton reconstrcution
- Generative Model
- Length Prediction
- use Sequential-KNN + Categorizer to predict the frame number of each gloss.
- Get target token sequence total length N.
- Initialize Diffusion
- Construct a [MASK]/[PAD] sequence of length N.
- Discrete Diffusion
- Forward: randomly substitute with MASK/PAD when training.
- Backward: erase noise using CodeUnet.
- Condition: gloss token (Transformer Encoder) -> cross attention injected with CodeUnet.
- Time: diffusion step t is injected via AdaLN
- Length Prediction
- Skeleton Reconstruction
- Input the generated token sequence into Pose-VQVAE decoder.
- Get sequential 3D skeleton sequence.
Summary:
gloss -> length prediction -> discrete diffusion: obtain codebook token sequence -> Pose-VQVAE decode -> skeleton sequence.
