How can I do speaker identification (diarization) with microsoft speech to text without previous voice enrollment?

1k Views Asked by At

In my application, I need to record a conversation between people and there's no room in the physical workflow to take a 20 second sample of each person's voice for the purpose of training the recognizer, nor to ask each person to read a canned passphrase for training. But without doing that, as far as I can tell, there's no way to get speaker identification.

Is there any way to just record, say, 5 people speaking and have the recognizer automatically classify returned text as belonging to one of the 5 distinct people, without previous training?

(For what it's worth, IBM Watson can do this, although it doesn't do it very accurately, in my testing.)

3

There are 3 best solutions below

2
Ali Heikal On

If I understand your question right then Conversation Transcription should be a solution for your scenario, as it will show the speakers as Speaker[x] and iterate for each new speaker, if you don't generate user profiles beforehand.

User voice samples are optional. Without this input, the transcription will show different speakers, but shown as "Speaker1", "Speaker2", etc. instead of recognizing as pre-enrolled specific speaker names.

You can get started with the real-time conversation transcription quickstart.

0
Heidi Z. On

Microsoft Conversation Transcription which is in Preview, now targeting to microphone array device. So the input recording should be recorded by a microphone array. If your recordings are from common microphone, it may not work and you need special configuration. You can also try Batch diarization which support offline transcription with diarizing 2 speakers for now, it will support 2+ speaker very soon, probably in this month.

0
jayant k On

If you are using REST API then it will be helpful. You can use batch transcription with "diarizationEnabled" property as true and you will need to use only 1 channel and you will also need to give minimum and maximum number of speakers ( it can identify up to 36 speakers ). also use 3.1 version insted of 3.0 of REST API

here in how I used it:

url = f'{<speech_resourse_endpoint>}/speechtotext/v3.0/transcriptions'
data = {
    'displayName': 'name that you would like for transcription',
    "description": "Description of your task",
    'locale': 'en-US',
    'contentUrls': [<audio_url>],
    'properties': {
        'diarizationEnabled': True,
        'wordLevelTimestampsEnabled': True,
        "displayFormWordLevelTimestampsEnabled": True,
        "channels": [0],
        "diarization": {
          "speakers": {
            "minCount": 1,
            "maxCount": 20
          }
        }
        
    },
    "languageIdentification": {
        "candidateLocales": ["en-US","en-CA"],
    },
    "customProperties": {}
}

# intializer of transcription process
response = requests.post(url, headers=headers, json=data)

# Get the URL of the transcription status and output files  
self_url = response.json()['self']  
print(self_url)
files_url = response.json()['links']['files']

status = ""  
while status != "Succeeded":  
    # Send a GET request to get the transcription status  
    response = requests.get(self_url, headers=headers)  
    status = response.json()['status']
    print(f"Transcription status: {status}")  
    time.sleep(2)

#request to access list of files
response = requests.get(files_url, headers=headers)

response_json = response.json()

file_url = response_json["values"][0]["links"]["contentUrl"]
print(f"file_url = {file_url}")

Go to printed 'file_url' and in field called 'recognizedPhrases' you will see phrases with speaker identified

sample output :

{


"source": "-----",
  "timestamp": "2023-10-16T18:13:54Z",
  "durationInTicks": 4697600000,
  "duration": "PT7M49.76S",
  "combinedRecognizedPhrases": [
    {
      "channel": 0,
      "lexical": "....whole transcription....",
      "itn": ".....whole transcription....",
      "display": ".....whole transcription...."
    }
  ],
  "recognizedPhrases": [
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 1,
      "offset": "PT0.72S",
      "duration": "PT19.28S",
      "offsetInTicks": 7200000.0,
      "durationInTicks": 192800000.0,
      "nBest": [
        {
          "confidence": 0.27163595,
          "lexical": "----phrash----",
          "itn": "-----phrase---",
          "maskedITN": "-----phrase----",
          "display": "-----phrase----"
    },...]