Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech client v2, my, working OK, example #3578

Open
sorokinvj opened this issue Nov 28, 2023 · 11 comments
Open

Speech client v2, my, working OK, example #3578

sorokinvj opened this issue Nov 28, 2023 · 11 comments
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. samples Issues that are directly related to samples. triage me I really want to be triaged. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@sorokinvj
Copy link

sorokinvj commented Nov 28, 2023

Hey guys, I am not sure where to put it, but I just want to share my implementation of the speech client v2 and some thoughts about migrating from v1 to v2.
Official docs are unfortunately provide only python code, and there is not much info except this repo and the example I used to migrate: transcribeStreaming.v2.js

My sdk verison is:

"@google-cloud/speech": "^6.0.1",

In my code I first initialize the service as:

const service = createGoogleService({ language, send })

and then use service.transcribeAudio(data) whenever there is a new audio coming from the frontend, which uses

const mediaRecorder = new MediaRecorder(audioStream, { mimeType: 'audio/webm;codecs=opus' }) // its a default param;
mediaRecorder.ondataavailable = (event: BlobEvent) => {
... send the event.data to the backend
}

thus an audio chunk is just a browser Blob object.

My service:

import { logger } from '../..//logger';
import { getText, transformGoogleResponse } from './utils';
import { v2 as speech } from '@google-cloud/speech';
import { StreamingRecognizeResponse } from './google.types';
import { TranscriptionService } from '../transcription.types';
import { MachineEvent } from '../../websocket/websocket.types';
import { Sender } from 'xstate';
import { parseErrorMessage } from '../../../utils';
import { findRecognizerByLanguageCode } from './recognizers';

export const createGoogleService = ({
  language,
  send,
}: {
  language: string;
  send: Sender<MachineEvent>;
}): Promise<TranscriptionService> => {
  return new Promise((resolve, reject) => {
    try {
      const client = new speech.SpeechClient({
        keyFilename: 'assistant-demo.json',
      });

      const recognizer = findRecognizerByLanguageCode(language).name;

      const configRequest = {
        recognizer,
        streamingConfig: {
          config: {
            autoDecodingConfig: {},
          },
          streamingFeatures: {
            enableVoiceActivityEvents: true,
            interimResults: false,
          },
        },
      };

      logger.info('Creating Google service with recogniser:', recognizer);

      const recognizeStream = client
        ._streamingRecognize()
        .on('error', error => {
          logger.error('Error on "error" in recognizeStream', error);
          send({ type: 'ERROR', data: parseErrorMessage(error) });
        })
        .on('data', (data: StreamingRecognizeResponse) => {
          if (data.speechEventType === 'SPEECH_ACTIVITY_END') {
            send({ type: 'SPEECH_END', data: 'SPEECH_END' });
          }
          if (data.results.length > 0) {
            const transcription = transformGoogleResponse(data);
            if (transcription) {
              const transcriptionText = getText(transcription);
              if (!transcriptionText?.length) {
                // if the transcription is empty, do nothing
                return;
              }
              send({ type: 'NEW_TRANSCRIPTION', data: transcriptionText });
            }
          }
        })
        .on('end', () => {
          logger.warn('Google recognizeStream ended');
        });

      let configSent = false;

      const transcribeAudio = (audioData: Buffer) => {
        if (!configSent) {
          recognizeStream.write(configRequest);
          configSent = true;
        }
        recognizeStream.write({ audio: audioData });
      };

      const stop = () => {
        if (recognizeStream) {
          recognizeStream.end();
        }
      };

      resolve({ stop, transcribeAudio });
    } catch (error) {
      logger.error('Error creating Google service:', error);
      reject(error);
    }
  });
};

Migration considerations

  1. to use v2 you need to create a recognizer, I did it with this function:
/**
 * Creates a new recognizer.
 *
 * @param {string} projectId - The ID of the Google Cloud project.
 * @param {string} location - The location for the recognizer.
 * @param {string} recognizerId - The ID for the new recognizer.
 * @param {string} languageCode - The language code for the recognizer.
 * @returns {Promise<object>} The created recognizer.
 * @throws Will throw an error if the recognizer creation fails.
 */
export const createRecognizer = async (
  projectId: string,
  location: string,
  recognizerId: string,
  languageCode: string
) => {
  const client = new v2.SpeechClient({
    keyFilename: 'assistant-demo.json',
  });

  const request = {
    parent: `projects/${projectId}/locations/${location}`,
    recognizer: {
      languageCodes: [languageCode],
      model: 'latest_long',
      // Add any additional configuration here
    },
    recognizerId,
  };

  try {
    console.log('Creating recognizer...', request);
    const [operation] = await client.createRecognizer(request);
    const [recognizer] = await operation.promise();
    return recognizer;
  } catch (error) {
    console.error('Failed to create recognizer:', error);
    throw error;
  }
};
  1. The config object now should be sent as first data to the stream object, immediately before the audio, so if you did recognizingClient.write(audioData) before, now you should do (but only once!)recognizingClient.write(newConfigWithRecognizer) and then recognizingClient.write({ audio: audioData }) <<< notice the object notation
  2. The config object itself has been changed to:
public streamingConfig?: (google.cloud.speech.v2.IStreamingRecognitionConfig|null);

/** Properties of a StreamingRecognitionConfig. */
interface IStreamingRecognitionConfig {

** StreamingRecognitionConfig config */
config?: (google.cloud.speech.v2.IRecognitionConfig|null);

/** StreamingRecognitionConfig configMask */
configMask?: (google.protobuf.IFieldMask|null);

/** StreamingRecognitionConfig streamingFeatures */
streamingFeatures?: (google.cloud.speech.v2.IStreamingRecognitionFeatures|null);
}
  1. When instantiating streamingClient use _streamingRecognize() (this probably is likely to be changed)
@sorokinvj sorokinvj added priority: p3 Desirable enhancement or fix. May not be included in next release. triage me I really want to be triaged. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Nov 28, 2023
@product-auto-label product-auto-label bot added the samples Issues that are directly related to samples. label Nov 28, 2023
@sorokinvj sorokinvj changed the title Speech client v2 example Speech client v2, my working OK example Nov 28, 2023
@sorokinvj sorokinvj changed the title Speech client v2, my working OK example Speech client v2, my, working OK, example Nov 28, 2023
@adambeer
Copy link

adambeer commented Jan 15, 2024

Does this still work for you?

I created my recognizer in cloud console and am just specifying it as a string. Im however getting the error "Error: 3 INVALID_ARGUMENT: Malordered Data Received. Expected audio but none was set. Send exactly one config, followed by audio data.".

I really cant figure out what im doing wrong...

Edit: @sorokinvj I fixed it so the audio data can send but get back no response from STT. The stream ends up timing out. I have interimResults to true.

@sorokinvj
Copy link
Author

Yeah, all of this is because you are somehow sending wrong audio data, if you share your code I might get more insight. @adambeer sorry, I saw it just now. My code runs production server, no problem so far

@MilanHofmann
Copy link

MilanHofmann commented Mar 29, 2024

I get code: 7,
details: 'The caller does not have permission',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }

when I execute recognizeStream.write(streamingRecognizeRequest);

This my code:

const config: IRecognitionConfig = {
languageCodes: ['de-DE'],
model: 'chirp',
autoDecodingConfig: {},
};

const streamingRecognitionConfig: google.cloud.speech.v2.IStreamingRecognitionConfig =
{
config: config,
streamingFeatures: {
interimResults: true,
},
};

const streamingRecognizeRequest: google.cloud.speech.v2.IStreamingRecognizeRequest =
{
recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_,
streamingConfig: streamingRecognitionConfig,
};

const recognizeStream = client
._streamingRecognize()
.on('error', (err) => {
console.error(err);
})
.on('data', async (data) => {
console.log('Data:', data);
});
recognizeStream.write(streamingRecognizeRequest);
for (let i = 0; i < buffer.length; i += 1024) {
// to byte array
const data = Uint8Array.from(buffer.slice(i, i + 1024));
recognizeStream.write({ audioContent: data });
}
recognizeStream.end();`

While a non-streaming request works well:
const recognitionOutputConfig: IRecognitionOutputConfig = { gcsOutputConfig: { uri:${STAGING_BUCKET_URL}/transcriptions/`,
},
};

const request: RecognizeBatchRequest = {
processingStrategy: 'DYNAMIC_BATCHING',
toJSON(): { [p: string]: any } {
return {};
},
recognitionOutputConfig,
recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_,
config: config,
// content: buffer,
files: [
{
uri: ${STAGING_BUCKET_URL}/audio_recordings/${audioRecordingId},
},
],
};
`

Any idea why? the service account i use has cloud speech admin permissions.
Is it maybe because of my location asia-southeast1?
When I change the model to "short" or "long" I get 'The language "auto" is not supported by the model "short" in the location named "asia-southeast1".',
Happens for any other language.

PS: sorry for the non-code formatting.

@carstarai
Copy link

carstarai commented Apr 30, 2024

I am getting Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request.
I cant track it down. Can you please help. Here are relevant parts of code:

import { v2 as speech } from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

const request = {
recognizer:'projects/redacted/locations/us-central1/recognizers/redacted',
streamingConfig: {
config: {
languageCode: 'en-US',
},
streamingFeatures: {
enableVoiceActivityEvents: true,
interimResults: false,
},
},
};

const recognizeStream = speechClient.streamingRecognize(request)

NOTE: I am on node 18.

@sorokinvj
Copy link
Author

I am getting Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request. I cant track it down. Can you please help. Here are relevant parts of code:

import { v2 as speech } from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

const request = { recognizer:'projects/redacted/locations/us-central1/recognizers/redacted', streamingConfig: { config: { languageCode: 'en-US', }, streamingFeatures: { enableVoiceActivityEvents: true, interimResults: false, }, }, };

const recognizeStream = speechClient.streamingRecognize(request)

NOTE: I am on node 18.

I don't know, man, sorry. Your code is doing something else than mine, but to understand your usecase I would need to see more than what you've shared.

  1. You are using .streamingRecognize vs. _streamingRecognize in my code. I am not sure I understand what is your function doing.
  2. Moreover, your setup looks like you want to instantiate a stream calling streamingRecognize once? For streaming you want to write continously some chunks of data, which you are missing.

@sorokinvj
Copy link
Author

sorokinvj commented May 1, 2024

const config: IRecognitionConfig = { languageCodes: ['de-DE'], model: 'chirp', autoDecodingConfig: {}, };

Can you try to not use any model and leave this field undefined?

const streamingRecognizeRequest: google.cloud.speech.v2.IStreamingRecognizeRequest = { recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_, streamingConfig: streamingRecognitionConfig, };

Ensure FIREBASE_PROJECT_ID is set

recognizeStream.write(streamingRecognizeRequest)

It might be useful to have a basic check in the code, that you send this request only once in the first packet, though I doubt it has anything to do with permission issue. Sorry, have no idea why you might stuck with this.

@carstarai
Copy link

carstarai commented May 2, 2024

Thank you for your time. I got the config set correctly. I was also able to get a stream from a local file to transcribe correctly. However, I am getting no response on this.
const parsedMessage = JSON.parse(decodedMessage);

  if (parsedMessage.event === "media" && parsedMessage.media) {
    
    const decodedPayload = Buffer.from(parsedMessage.media.payload, 'base64');
    recognizeStream.write({audio:decodedPayload});

This is a base64 encoded payload coming from a websocket message. The config sends properly but there is clear some sort of problem with the buffer and audio data. I verified the incoming data and that the data is sent to _streamingRecognize. These lines worked in v1 by the way.

No error codes. Just no response then timeout after cancelling the project.

UPDATE: I just realized the other is only giving a response on stream end. I think this is my problem. this needs to transcribe voice data via a websocket.

@rodrifmed
Copy link

rodrifmed commented May 2, 2024

@sorokinvj Im calling it from a google function onCall with the code below and Im receiving Unhandled error Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request.

I tried without the model, but still the same problem

const client = new speech.v2.SpeechClient();

    const config: speech.protos.google.cloud.speech.v2.IRecognitionConfig = {
      languageCodes: ['en-US'],
      features: {
        "profanityFilter": true,
        "enableAutomaticPunctuation": true,
        "enableSpokenEmojis": true,
      },
      autoDecodingConfig: {},
    };
    const speechRequest: speech.protos.google.cloud.speech.v2.IRecognizeRequest = {
      content: request.data.audio,
      config: config,
    };
  
    const [response] = await client.recognize(speechRequest);'

@rodrifmed
Copy link

UPDATE
I could make it work with:

const client = new speech.v2.SpeechClient({
        apiEndpoint: `us-central1-speech.googleapis.com`,
    });

    const config: speech.protos.google.cloud.speech.v2.IRecognitionConfig = {
        languageCodes: ["en-US"],
        model: "long",
        features: {
            profanityFilter: true,
            enableAutomaticPunctuation: true,
            enableSpokenEmojis: true,
        },
        autoDecodingConfig: {},
    };

const speechRequest: speech.protos.google.cloud.speech.v2.IRecognizeRequest = {
        recognizer: "projects/{{FIREBASE_PROJECT_ID}}locations/us-central1/recognizers/_",
        content: request.data.audio,
        config: config,
    };

@carstarai
Copy link

I can get transcriptions but the problem I am having is that transcriptions only come back after the stream has ended

@carstarai
Copy link

I know the models offer mulaw encoding but everytime i set mulaw with a recognizer in console it flips back to linear16. Anyone able to help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. samples Issues that are directly related to samples. triage me I really want to be triaged. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants