Speech client v2, my, working OK, example #3578

sorokinvj · 2023-11-28T15:26:06Z

Hey guys, I am not sure where to put it, but I just want to share my implementation of the speech client v2 and some thoughts about migrating from v1 to v2.
Official docs are unfortunately provide only python code, and there is not much info except this repo and the example I used to migrate: transcribeStreaming.v2.js

My sdk verison is:

"@google-cloud/speech": "^6.0.1",

In my code I first initialize the service as:

const service = createGoogleService({ language, send })

and then use service.transcribeAudio(data) whenever there is a new audio coming from the frontend, which uses

const mediaRecorder = new MediaRecorder(audioStream, { mimeType: 'audio/webm;codecs=opus' }) // its a default param;
mediaRecorder.ondataavailable = (event: BlobEvent) => {
... send the event.data to the backend
}

thus an audio chunk is just a browser Blob object.

My service:

import { logger } from '../..//logger';
import { getText, transformGoogleResponse } from './utils';
import { v2 as speech } from '@google-cloud/speech';
import { StreamingRecognizeResponse } from './google.types';
import { TranscriptionService } from '../transcription.types';
import { MachineEvent } from '../../websocket/websocket.types';
import { Sender } from 'xstate';
import { parseErrorMessage } from '../../../utils';
import { findRecognizerByLanguageCode } from './recognizers';

export const createGoogleService = ({
  language,
  send,
}: {
  language: string;
  send: Sender<MachineEvent>;
}): Promise<TranscriptionService> => {
  return new Promise((resolve, reject) => {
    try {
      const client = new speech.SpeechClient({
        keyFilename: 'assistant-demo.json',
      });

      const recognizer = findRecognizerByLanguageCode(language).name;

      const configRequest = {
        recognizer,
        streamingConfig: {
          config: {
            autoDecodingConfig: {},
          },
          streamingFeatures: {
            enableVoiceActivityEvents: true,
            interimResults: false,
          },
        },
      };

      logger.info('Creating Google service with recogniser:', recognizer);

      const recognizeStream = client
        ._streamingRecognize()
        .on('error', error => {
          logger.error('Error on "error" in recognizeStream', error);
          send({ type: 'ERROR', data: parseErrorMessage(error) });
        })
        .on('data', (data: StreamingRecognizeResponse) => {
          if (data.speechEventType === 'SPEECH_ACTIVITY_END') {
            send({ type: 'SPEECH_END', data: 'SPEECH_END' });
          }
          if (data.results.length > 0) {
            const transcription = transformGoogleResponse(data);
            if (transcription) {
              const transcriptionText = getText(transcription);
              if (!transcriptionText?.length) {
                // if the transcription is empty, do nothing
                return;
              }
              send({ type: 'NEW_TRANSCRIPTION', data: transcriptionText });
            }
          }
        })
        .on('end', () => {
          logger.warn('Google recognizeStream ended');
        });

      let configSent = false;

      const transcribeAudio = (audioData: Buffer) => {
        if (!configSent) {
          recognizeStream.write(configRequest);
          configSent = true;
        }
        recognizeStream.write({ audio: audioData });
      };

      const stop = () => {
        if (recognizeStream) {
          recognizeStream.end();
        }
      };

      resolve({ stop, transcribeAudio });
    } catch (error) {
      logger.error('Error creating Google service:', error);
      reject(error);
    }
  });
};

Migration considerations

to use v2 you need to create a recognizer, I did it with this function:

/**
 * Creates a new recognizer.
 *
 * @param {string} projectId - The ID of the Google Cloud project.
 * @param {string} location - The location for the recognizer.
 * @param {string} recognizerId - The ID for the new recognizer.
 * @param {string} languageCode - The language code for the recognizer.
 * @returns {Promise<object>} The created recognizer.
 * @throws Will throw an error if the recognizer creation fails.
 */
export const createRecognizer = async (
  projectId: string,
  location: string,
  recognizerId: string,
  languageCode: string
) => {
  const client = new v2.SpeechClient({
    keyFilename: 'assistant-demo.json',
  });

  const request = {
    parent: `projects/${projectId}/locations/${location}`,
    recognizer: {
      languageCodes: [languageCode],
      model: 'latest_long',
      // Add any additional configuration here
    },
    recognizerId,
  };

  try {
    console.log('Creating recognizer...', request);
    const [operation] = await client.createRecognizer(request);
    const [recognizer] = await operation.promise();
    return recognizer;
  } catch (error) {
    console.error('Failed to create recognizer:', error);
    throw error;
  }
};

The config object now should be sent as first data to the stream object, immediately before the audio, so if you did recognizingClient.write(audioData) before, now you should do (but only once!)recognizingClient.write(newConfigWithRecognizer) and then recognizingClient.write({ audio: audioData }) <<< notice the object notation
The config object itself has been changed to:

public streamingConfig?: (google.cloud.speech.v2.IStreamingRecognitionConfig|null);

/** Properties of a StreamingRecognitionConfig. */
interface IStreamingRecognitionConfig {

** StreamingRecognitionConfig config */
config?: (google.cloud.speech.v2.IRecognitionConfig|null);

/** StreamingRecognitionConfig configMask */
configMask?: (google.protobuf.IFieldMask|null);

/** StreamingRecognitionConfig streamingFeatures */
streamingFeatures?: (google.cloud.speech.v2.IStreamingRecognitionFeatures|null);
}

When instantiating streamingClient use _streamingRecognize() (this probably is likely to be changed)

The text was updated successfully, but these errors were encountered:

adambeer · 2024-01-15T18:19:43Z

Does this still work for you?

I created my recognizer in cloud console and am just specifying it as a string. Im however getting the error "Error: 3 INVALID_ARGUMENT: Malordered Data Received. Expected audio but none was set. Send exactly one config, followed by audio data.".

I really cant figure out what im doing wrong...

Edit: @sorokinvj I fixed it so the audio data can send but get back no response from STT. The stream ends up timing out. I have interimResults to true.

sorokinvj · 2024-03-23T17:48:18Z

Yeah, all of this is because you are somehow sending wrong audio data, if you share your code I might get more insight. @adambeer sorry, I saw it just now. My code runs production server, no problem so far

MilanHofmann · 2024-03-29T14:25:55Z

I get code: 7,
details: 'The caller does not have permission',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }

when I execute recognizeStream.write(streamingRecognizeRequest);

This my code:

const config: IRecognitionConfig = {
languageCodes: ['de-DE'],
model: 'chirp',
autoDecodingConfig: {},
};

const streamingRecognitionConfig: google.cloud.speech.v2.IStreamingRecognitionConfig =
{
config: config,
streamingFeatures: {
interimResults: true,
},
};

const streamingRecognizeRequest: google.cloud.speech.v2.IStreamingRecognizeRequest =
{
recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_,
streamingConfig: streamingRecognitionConfig,
};

While a non-streaming request works well:
const recognitionOutputConfig: IRecognitionOutputConfig = { gcsOutputConfig: { uri:${STAGING_BUCKET_URL}/transcriptions/`,
},
};

const request: RecognizeBatchRequest = {
processingStrategy: 'DYNAMIC_BATCHING',
toJSON(): { [p: string]: any } {
return {};
},
recognitionOutputConfig,
recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_,
config: config,
// content: buffer,
files: [
{
uri: ${STAGING_BUCKET_URL}/audio_recordings/${audioRecordingId},
},
],
};
`

Any idea why? the service account i use has cloud speech admin permissions.
Is it maybe because of my location asia-southeast1?
When I change the model to "short" or "long" I get 'The language "auto" is not supported by the model "short" in the location named "asia-southeast1".',
Happens for any other language.

PS: sorry for the non-code formatting.

carstarai · 2024-04-30T17:30:41Z

I am getting Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request.
I cant track it down. Can you please help. Here are relevant parts of code:

import { v2 as speech } from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

const request = {
recognizer:'projects/redacted/locations/us-central1/recognizers/redacted',
streamingConfig: {
config: {
languageCode: 'en-US',
},
streamingFeatures: {
enableVoiceActivityEvents: true,
interimResults: false,
},
},
};

const recognizeStream = speechClient.streamingRecognize(request)

NOTE: I am on node 18.

sorokinvj · 2024-05-01T13:42:19Z

I am getting Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request. I cant track it down. Can you please help. Here are relevant parts of code:

import { v2 as speech } from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

const request = { recognizer:'projects/redacted/locations/us-central1/recognizers/redacted', streamingConfig: { config: { languageCode: 'en-US', }, streamingFeatures: { enableVoiceActivityEvents: true, interimResults: false, }, }, };

const recognizeStream = speechClient.streamingRecognize(request)

NOTE: I am on node 18.

I don't know, man, sorry. Your code is doing something else than mine, but to understand your usecase I would need to see more than what you've shared.

You are using .streamingRecognize vs. _streamingRecognize in my code. I am not sure I understand what is your function doing.
Moreover, your setup looks like you want to instantiate a stream calling streamingRecognize once? For streaming you want to write continously some chunks of data, which you are missing.

sorokinvj · 2024-05-01T13:48:26Z

const config: IRecognitionConfig = { languageCodes: ['de-DE'], model: 'chirp', autoDecodingConfig: {}, };

Can you try to not use any model and leave this field undefined?

const streamingRecognizeRequest: google.cloud.speech.v2.IStreamingRecognizeRequest = { recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_, streamingConfig: streamingRecognitionConfig, };

Ensure FIREBASE_PROJECT_ID is set

recognizeStream.write(streamingRecognizeRequest)

It might be useful to have a basic check in the code, that you send this request only once in the first packet, though I doubt it has anything to do with permission issue. Sorry, have no idea why you might stuck with this.

carstarai · 2024-05-02T19:04:00Z

Thank you for your time. I got the config set correctly. I was also able to get a stream from a local file to transcribe correctly. However, I am getting no response on this.
const parsedMessage = JSON.parse(decodedMessage);

  if (parsedMessage.event === "media" && parsedMessage.media) {
    
    const decodedPayload = Buffer.from(parsedMessage.media.payload, 'base64');
    recognizeStream.write({audio:decodedPayload});

This is a base64 encoded payload coming from a websocket message. The config sends properly but there is clear some sort of problem with the buffer and audio data. I verified the incoming data and that the data is sent to _streamingRecognize. These lines worked in v1 by the way.

No error codes. Just no response then timeout after cancelling the project.

UPDATE: I just realized the other is only giving a response on stream end. I think this is my problem. this needs to transcribe voice data via a websocket.

rodrifmed · 2024-05-02T21:36:26Z

@sorokinvj Im calling it from a google function onCall with the code below and Im receiving Unhandled error Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request.

I tried without the model, but still the same problem

const client = new speech.v2.SpeechClient();

    const config: speech.protos.google.cloud.speech.v2.IRecognitionConfig = {
      languageCodes: ['en-US'],
      features: {
        "profanityFilter": true,
        "enableAutomaticPunctuation": true,
        "enableSpokenEmojis": true,
      },
      autoDecodingConfig: {},
    };
    const speechRequest: speech.protos.google.cloud.speech.v2.IRecognizeRequest = {
      content: request.data.audio,
      config: config,
    };
  
    const [response] = await client.recognize(speechRequest);'

rodrifmed · 2024-05-03T15:25:43Z

UPDATE
I could make it work with:

const client = new speech.v2.SpeechClient({
        apiEndpoint: `us-central1-speech.googleapis.com`,
    });

    const config: speech.protos.google.cloud.speech.v2.IRecognitionConfig = {
        languageCodes: ["en-US"],
        model: "long",
        features: {
            profanityFilter: true,
            enableAutomaticPunctuation: true,
            enableSpokenEmojis: true,
        },
        autoDecodingConfig: {},
    };

const speechRequest: speech.protos.google.cloud.speech.v2.IRecognizeRequest = {
        recognizer: "projects/{{FIREBASE_PROJECT_ID}}locations/us-central1/recognizers/_",
        content: request.data.audio,
        config: config,
    };

carstarai · 2024-05-03T16:25:33Z

I can get transcriptions but the problem I am having is that transcriptions only come back after the stream has ended

carstarai · 2024-05-06T18:16:58Z

I know the models offer mulaw encoding but everytime i set mulaw with a recognizer in console it flips back to linear16. Anyone able to help?

sorokinvj added priority: p3 Desirable enhancement or fix. May not be included in next release. triage me I really want to be triaged. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Nov 28, 2023

product-auto-label bot added the samples Issues that are directly related to samples. label Nov 28, 2023

sorokinvj changed the title ~~Speech client v2 example~~ Speech client v2, my working OK example Nov 28, 2023

sorokinvj changed the title ~~Speech client v2, my working OK example~~ Speech client v2, my, working OK, example Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech client v2, my, working OK, example #3578

Speech client v2, my, working OK, example #3578

sorokinvj commented Nov 28, 2023 •

edited

adambeer commented Jan 15, 2024 •

edited

sorokinvj commented Mar 23, 2024

MilanHofmann commented Mar 29, 2024 •

edited

carstarai commented Apr 30, 2024 •

edited

sorokinvj commented May 1, 2024

sorokinvj commented May 1, 2024 •

edited

carstarai commented May 2, 2024 •

edited

rodrifmed commented May 2, 2024 •

edited

rodrifmed commented May 3, 2024

carstarai commented May 3, 2024

carstarai commented May 6, 2024

Speech client v2, my, working OK, example #3578

Speech client v2, my, working OK, example #3578

Comments

sorokinvj commented Nov 28, 2023 • edited

My service:

Migration considerations

adambeer commented Jan 15, 2024 • edited

sorokinvj commented Mar 23, 2024

MilanHofmann commented Mar 29, 2024 • edited

carstarai commented Apr 30, 2024 • edited

sorokinvj commented May 1, 2024

sorokinvj commented May 1, 2024 • edited

carstarai commented May 2, 2024 • edited

rodrifmed commented May 2, 2024 • edited

rodrifmed commented May 3, 2024

carstarai commented May 3, 2024

carstarai commented May 6, 2024

sorokinvj commented Nov 28, 2023 •

edited

adambeer commented Jan 15, 2024 •

edited

MilanHofmann commented Mar 29, 2024 •

edited

carstarai commented Apr 30, 2024 •

edited

sorokinvj commented May 1, 2024 •

edited

carstarai commented May 2, 2024 •

edited

rodrifmed commented May 2, 2024 •

edited