Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognising a Sentence and Returning Start & End Times #299

Open
digiphd opened this issue Feb 6, 2023 · 0 comments
Open

Recognising a Sentence and Returning Start & End Times #299

digiphd opened this issue Feb 6, 2023 · 0 comments

Comments

@digiphd
Copy link

digiphd commented Feb 6, 2023

Hi guys,

Great tool! I am wondering if you could give me some pointers in improving this function, it works however I find the timing information as out significantly.

I am running on a Macbook Pro M1 and a Macbook Mini M1, both with the same issue.

Here is my problem:

Say I have a python list with 3 sentences that I know exist in the audio file. Here is an example list item:

well start with planet arrakis the primary setting for the dune series arrakis is known for its desertlike terrain which is full of dangerous creatures well talk about some of the more dangerous and fascinating creatures including wormlike sandworms and the mysterious tleilaxu face dancers

I strip out all punctuation and grammar to keep it as strings.

I then iterate over them and use the Levenshtein distance to find phrases that match the fragment leaves like this:


    tmp_audio_file = './instance/media/tmp_audio.wav'
    sync_map = './instance/syncmap.srt'
    subprocess.run(["ffmpeg", "-y", "-i", input_video, "-vn", "-acodec", "pcm_s16le", tmp_audio_file])
# create a Task object
    config_string = "task_language=eng|is_text_type=plain|os_task_file_format=srt"

    task = Task(config_string=config_string)

    task.audio_file_path_absolute = absolute_path_audio.encode('utf-8')
    task.text_file_path_absolute = absolute_path_script.encode('utf-8')
    task.sync_map_file_path_absolute = absolute_path_syncmap.encode('utf-8')
    task.sync_map_file_path = sync_map.encode('utf-8')

    ExecuteTask(task).execute()


   for phrase in interesting_points:
        phrase = re.sub(r'[,\.\'\-\’]', '', phrase)

        for fragment in task.sync_map_leaves():
           
            if fragment.text.lower():

                if distance(phrase.lower(), fragment.text.lower()) <=40:
                   
                    start_time = fragment.begin
                    end_time = fragment.end
                 
                    if start_time:
                        segments.append([start_time, end_time])


It gives an output with three arrays of start times and finish times.
[[TimeValue('12.520'), TimeValue('37.040')], [TimeValue('59.920'), TimeValue('82.760')], [TimeValue('82.760'), TimeValue('82.760')]]

Which looks roughly as I would expect, except upon closer inspection, the timing is out by a significant amount, and the last found list item is capped right at the end of the audio file even though it is found.

So perhaps it is more a matter of how I am configuring the task.

Do you have any ideas? Or perhaps there is a better way to approach this?

I should also mention that the spoken audio is generated from text-to-speech (polly) from a script in a .txt file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant