Use the Microsoft Azure Speech service to extract text from an audio file

Page update date : Thursday, September 29, 2022

Page creation date : Sunday, July 17, 2022

Operation check environment

Microsoft Azure Voice Services

July 2022 edition

Visual Studio

Visual Studio 2022

.NET

precondition

Before you can verify that this Tips work, you need to:

Create a Microsoft account (Reference: Create a Microsoft account)
Create a Microsoft Azure account (Reference: Create a Microsoft Azure account)

About sample materials

We borrow audio files from the following site. It is not included in the sample code.

Sound Effects Lab

External links

Transcription from audio file in C# (Azure, proxy, time output, bulk, SpeechToText)

At first

As an example, in order to record the contents of the meeting, etc., the contents of the conversation are recorded, and later in order to create the minutes of the meeting, the recorded audio is heard and transcribed into text. However, it is easy to imagine that it is very troublesome to write down the contents of the conversation into a sentence.

Here's how to use the Microsoft Azure Speech service to automatically export and output speech data to text. In the previous example, it is a recording of the meeting, but in this Tips, it is extracted from the audio file, so it can be applied to anything that contains conversation contents.

Also, this time we are talking about transcribing to text, You can also use the Microsoft Azure voice service to convert and translate while recording.

About Fees

This time, we will use the Free plan, so even if you try to operate, you will not be charged. However, the Free plan has time and character limits, so if you need more conversion after actually operating it, please change to a paid plan.

Please refer to the following official page for pricing. Since it is a cloud service, the fee may fluctuate depending on the season.

Speech Services Pricing

Access the Microsoft Azure portal

Access the following URL in a web browser:

https://portal.azure.com/

If you're already signed in, you're done. If you are not already signed in, the following screen will be displayed, so please sign in.

When you log in, the following screen will be displayed. Depending on the design you are setting, the display may be different.

Create a voice service in Microsoft Azure

If you enter "voice" in the search input field above the portal, "voice service" will come out, so select it.

「Cognitive Services | When the "Voice Service" screen is displayed, select "Create".

The "Create Speech Services" screen opens, so enter the necessary items.

Basics

Example of input

contents	of	items
subscription	Select the subscriptions for which you want to pay. Even if it's free, you always need to tie it to some kind of subscription.
Resource Groups	Specifies in which resource group the voice service to be created is to include. If you have not created one yet, please create one from "New" below. You can think of resource groups as groupings when you create various services.
Region	Choose near a place where you might be using it most often. However, please note that prices may vary by region.	Japan East
name	Any name for this voice service. If you are creating more than one voice service, please make sure that the name is descriptive. This name is global and unique, so you cannot use a name that is used elsewhere.	SpeechToText-Test
Pricing Tier	If you want to use it for free, choose "Free F0". If you want to use it for a fee, please choose another plan. Of course, there are restrictions if it is free.	Free F0

Once entered, select Next: Network > below.

network

Example of input

contents	of	items
kind	Specifies the range from which this voice service is accessible. If you want to be able to access it from anywhere without any detailed settings, select "Including the Internet...". "Can be accessed from anywhere" also means "anyone can access it", but in fact only those who have the "key" to be acquired after this can access it, so it is not so much a problem in terms of security.	Including the Internet...

Once you've entered it, select Next: Identity > below.

Identity

Example of input

contents	of	items
System-assigned managed ID	This time I use the voice service alone, so it is good to turn it off.	off
User-assigned managed ID	I don't need to add it because I don't create a user this time.	without

Once you've entered it, select Next: Tag > below.

tag

Since the tag is not used this time, it is not set.

Once entered, select Next: Review and create > below.

Review and create

If there are no problems with the input contents, "Validation succeeded" is displayed. If there is a problem, an error message will be displayed, so go back and set it again.

If there is no problem with the input contents, click the "Create" button. Then the deployment will begin.

After a while, the deployment is completed and the following screen is displayed. The deployment name is long, but you don't need to worry about it because it's temporary in deployment.

You can click the Go to Resource button to verify that the voice service has been created.

Generating and retrieving keys

An authentication key is required to access this voice service from the client. Since it is only accessible by the program that has this key, it is necessary to prevent this key from being stolen by a third party.

"Client" here refers to all programs that use Microsoft Azure, such as desktop apps, smartphone apps, and web apps.

To get the key, go to the voice service you created. It can be accessed from dashboards, resource groups, and so on.

When the Voice Services page opens, select Keys & Endpoints from the menu on the left.

Then, the "Key and Endpoint" page will open and the items "Key 1", "Key 2", "Location/Region" and "Endpoint" will be displayed.

Both items are necessary, but "Key 2" is a spare frame, so it is not used in most cases.

Make a note of each value. As mentioned in the description, please do not share the key with the development parties unexpectedly.

If the key is leaked, click "Regenerate Key 1" above to issue a new key. Of course, be aware that in this case, the previous key will no longer be usable.

Use a speech service from a program to extract text from speech

From now on, the use of the voice service will change depending on the program used. This time, I'll access it from a .NET desktop application, but if you're using another framework, try using it on the Internet. The official website also has instructions on how to use it in several languages.

Quickstart: Recognize speech and convert it to text

This time, we're creating a WPF desktop app in Visual Studio 2022. In the case of Visual Studio Code, it is a little troublesome because there is no designer, but if you extract only the program part, it is possible to use it in a console app or a web app.

Create Project

Start Visual Studio 2022.

Select Create New Project.

Select WPF Application.

The project name and location are arbitrary. Once entered, select Next.

The version is called ". NET 6.0". Once set, click the "Create" button.

NuGet settings

You can create access to the Microsoft Azure API from scratch, but it's easier to use it because you already have an official library.

Right-click Dependencies for the solution and select Manage NuGet Packages.

Select the "Browse" tab, enter "Microsoft.CognitiveServices.Speech" in the search input field, and a list will be displayed, so select "Microsoft.CognitiveServices.Speech" and click the Install button.

Click the OK button.

Select I agree.

When you're done, it will be added to the package.

Creating the UI

This time, we will specify a WAV file containing audio and extract the conversation contents into text with the Speech API and make it displayed. For the time being, leave the environment dependent on the input field so that you can use the code by copy and paste.

The screen should look like this: Since only the minimum amount is included, if you want to add a file reference button for example, please implement it yourself.

MainWindow.xaml is as follows:

<Window x:Class="MicrosoftAzureSpeechToText.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        xmlns:local="clr-namespace:MicrosoftAzureSpeechToText"
        mc:Ignorable="d"
        Title="Microsoft Azure の音声サービスを使用したテキスト抽出" Height="450" Width="800">
  <Grid>
    <Grid.ColumnDefinitions>
      <ColumnDefinition Width="20*"/>
      <ColumnDefinition Width="80*"/>
    </Grid.ColumnDefinitions>
    <Grid.RowDefinitions>
      <RowDefinition Height="auto"/>
      <RowDefinition Height="auto"/>
      <RowDefinition Height="auto"/>
      <RowDefinition Height="auto"/>
      <RowDefinition Height="auto"/>
      <RowDefinition Height="auto"/>
      <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <Label Content="キー" Margin="4"/>
    <TextBox x:Name="KeyTextBox" Grid.Column="1" HorizontalAlignment="Stretch" Margin="4" Text="" />
    <Label Content="場所/地域" Margin="4" Grid.Row="1"/>
    <TextBox x:Name="RegionTextBox" Grid.Column="1" HorizontalAlignment="Stretch" Margin="4" Text="japaneast" Grid.Row="1"/>
    <Label Content="言語" Margin="4" Grid.Row="2"/>
    <TextBox x:Name="LanguageTextBox" Grid.Column="1" HorizontalAlignment="Stretch" Margin="4" Text="ja-JP" Grid.Row="2"/>
    <Label Content="WAV ファイルパス" Margin="4" Grid.Row="3"/>
    <TextBox x:Name="WavFilePathTextBox" Grid.Column="1" HorizontalAlignment="Stretch" Margin="4" Text="" Grid.Row="3"/>
    <Button x:Name="ExecuteButton" Content="実行" Margin="4" Grid.Row="4" Grid.ColumnSpan="2" FontSize="24" Click="ExecuteButton_Click"/>
    <Label Content="結果" Margin="4,2,4,2" Grid.Row="5"/>
    <TextBox x:Name="ResultTextBox" Margin="8" TextWrapping="Wrap" Text="" Grid.Row="6" Grid.ColumnSpan="2" VerticalScrollBarVisibility="Visible" />
  </Grid>
</Window>

Creating a Process

The ExecuteButton_Click program is all wrapped up in methods. If you want to use it in other frameworks, rewrite this code as a base.

MainWindow.xaml.cs

using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using System;
using System.Media;
using System.Threading.Tasks;
using System.Windows;

namespace MicrosoftAzureSpeechToText
{
  /// <summary>
  /// MainWindow.xamlのインタラクションロジック
  /// </summary>
  public partial class MainWindow : Window
  {
    public MainWindow()
    {
      InitializeComponent();
    }

    private void AppendLineLog(string log)
    {
      // 非同期処理から書き込むので Dispatcher.Invoke を使用
      Dispatcher.Invoke(()=>ResultTextBox.AppendText(log + Environment.NewLine));
    }

    private async void ExecuteButton_Click(object sender, RoutedEventArgs e)
    {
      // 入力内容をテキストボックスから取得
      var key = KeyTextBox.Text;
      var region = RegionTextBox.Text;
      var lang = LanguageTextBox.Text;
      var wavFilePath = WavFilePathTextBox.Text;

      try
      {
        // 音声ファイルが指定されているか確認するため再生する
        var wavPlayer = new SoundPlayer(wavFilePath);
        wavPlayer.Play();

        var stopRecognition = new TaskCompletionSource<int>();

        // 音声サービスを構成する
        var speechConfig = SpeechConfig.FromSubscription(key, region);
        AppendLineLog($"{speechConfig.Region} で音声サービスを使用する準備ができました。");

        // 音声認識言語の指定
        // 使用できる値一覧：https://docs.microsoft.com/ja-jp/azure/cognitive-services/speech-service/language-support?tabs=speechtotext#speech-to-text
        speechConfig.SpeechRecognitionLanguage = lang;

        // 入力を WAV ファイルとして設定
        using var audioConfig = AudioConfig.FromWavFileInput(wavFilePath);
        using var speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);

        // 解析結果が受信されたことを通知します。
        // このイベントは抽出が完了したものから随時発生します。
        speechRecognizer.Recognized += (s, e) =>
        {
          if (e.Result.Reason == ResultReason.RecognizedSpeech)
          {
            // 音声結果に認識されたテキストが含まれていることを示します。
            var time = TimeSpan.FromSeconds(e.Result.OffsetInTicks / 10000000).ToString(@"hh\:mm\:ss");
            var text = $"{time} {e.Result.Text}";
            AppendLineLog(text);
          }
          else if (e.Result.Reason == ResultReason.NoMatch)
          {
            // 音声を認識できなかったことを示します。
            AppendLineLog("音声を認識できませんでした。");
          }
        };

        // 音声認識が中断されたことを通知します。
        speechRecognizer.Canceled += (s, e) =>
        {
          AppendLineLog($"処理が終了しました。(Reason={e.Reason})");

          if (e.Reason == CancellationReason.Error)
          {
            AppendLineLog($"ErrorCode={e.ErrorCode}\r\n");
            AppendLineLog($"ErrorDetails={e.ErrorDetails}\r\n");
          }

          stopRecognition.TrySetResult(0);
        };

        // 継続的な処理を開始します。 StopContinuousRecognitionAsync を使用して処理を停止します。
        await speechRecognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

        // 完了するのを待ちます。Task.WaitAny を使用して、タスクをルート化してください。
        Task.WaitAny(new[] { stopRecognition.Task });

        // 処理を停止します。
        await speechRecognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
      }
      catch (Exception ex)
      {
        // 何らかの例外が発生した場合はエラー内容を出力
        AppendLineLog(ex.Message);
      }

      MessageBox.Show("処理が終了しました。");
    }
  }
}

Since the explanation is almost written in the code, I will not explain in detail, but if you raise the important part, it will be as follows.

SpeechConfig Configuring Voice Services in
AudioConfig Setting up voice data in
SpeechRecognizer Generate Processing Class in
Speech data is analyzed one by one, and the completed text SpeechRecognizer.Recognized is passed in the event from time to time.
If processing is finished for some reason, SpeechRecognizer.Canceled an event is called
The start of processing is SpeechRecognizer.StartContinuousRecognitionAsync called the method, and the completion of the processing is SpeechRecognizer.StopContinuousRecognitionAsync called the method.

Operation check

After creating the program, execute it, enter the necessary items and press the execute button. If you have specified the correct audio file, the audio should play and the text should be extracted one by one using the Microsoft Azure voice service behind the scenes.

The system of extraction is not perfect because it depends on the Microsoft Azure voice service, If the voice is quiet and speaks clearly, I think it will be extracted with considerable accuracy. Since it is a cloud service, even if the accuracy is not so high now, it may be getting better before you know it, so it is easy not to do anything in that area.