Using the tutorial by Patrick von Platen (https://huggingface.co/blog/fine-tune-xlsr-wav2vec2), I managed to fine-tune Wav2Vec2 for annotated audio datasets in a supervised manner.
I now have a custom dataset which only consists of unannotated audio data dsb-untranscribed in one specific language. I want to "fine-tune" wav2vec2-large-xls-r-300m using this audio data, before I later actually fine-tune this data using an annotated dataset. This means, I need to train in an unsupervised manner. How do I do this correctly? Here is what I am doing so far:
I have started by loading my dataset as follows:
from datasets import load_dataset
dsb_untranscribed = load_dataset("TiMauzi/dsb-untranscribed")
Then I did some preprocessing:
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Processor
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True,
return_attention_mask=True)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
dsb_untranscribed = dsb_untranscribed.cast_column("audio", Audio(sampling_rate=16_000))
Finally, I loaded the pretrained model (not sure whether a tokenizer and vocab size is needed here):
from transformers import Wav2Vec2ForCTC
model = Wav2Vec2ForCTC.from_pretrained(
"facebook/wav2vec2-xls-r-300m",
attention_dropout=0.0,
hidden_dropout=0.0,
feat_proj_dropout=0.0,
mask_time_prob=0.05,
layerdrop=0.0,
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer),
)
model.freeze_feature_extractor()
Now I would need to define my Trainer and TrainingArguments before I can use trainer.train() to start the training process. How do I do this correctly?