Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A neural network to detect a title card within a video file and a tool for it #362

Open
4 tasks done
KOLANICH opened this issue Jun 12, 2023 · 2 comments
Open
4 tasks done
Labels
Advanced Projects that require a high level of understanding of the topics specified, or programming in gnrl. AI/ML Artificial Intelligence and Machine Learning. Including, but not limited to, creating Skynet. Futuristic tech/Unique ideas Sometimes, the ideas are just so cutting edge that it's hard to describe. Much work This project takes little time to complete. (ETA several weeks+)

Comments

@KOLANICH
Copy link

KOLANICH commented Jun 12, 2023

Project description

Imagine, that there is a bunch of movie files. None of them have embedded thumbnails. Your task is to make nice thumbnails for them.

Video summarization is a pretty hard task, not only for AI, but also for people. Because this problem is severely ill-posed and there are a lot of valid choices.

Some video files contain effects added using video-editing software. Such as title cards. A title card is a frame where a name/logo of the clip is shown. Often the font of the name is very stylized and large and can be recognized by its style alone even without reading text in it.

Thumbnails are reduced size images that usually serve as means of making it easier for people to recognize needed files and select the ones the people want without reading the fine print of filename and/or waiting for it to be scrolled into view.

Title cards should make nice thumbnails because of their properties. So, the following program is needed.

  1. video stream is scanned. keyframes are extracted.
  2. images are downscaled to the point neural network inference is fast enough.
  3. the images are passed through a neural network-based one-shot object detector, predicting the probability of a frame being a title card. the score is thresholded.
  4. a machine-readable list of frames and their chances to be a title cards is formed.
  5. semantic segmentation and boundary detection is run on the full-scale images
  6. the frames are cropped to the rectangle enclosing the titles/logos.

Relevant Technology

  • C++
  • FFMpeg
  • ONNX
  • Neural network frameworks, such as pytorch, Tensorflow and tinygrad
  • Python

Complexity and required time

Complexity

  • Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

  • Much work - The project will take more than a couple of weeks and serious planning is required

Categories

  • AI/ML
  • Futuristic Tech/Something Unique
@mihastele
Copy link

@KOLANICH I could also look into this, do you have any sample videos where I could try working on?

Have a great day!
All the best from Slovenia.

@KOLANICH
Copy link
Author

I have no sample videos, and no dataset. You need not the dataset of videos, but a dataset of title card frames of them. I have no such dataset and don't inow where to get it. A good heuristics is the presence of stylized text within frames tyat can probably be detected by other neural network. Anyway, annotation using GPT-4 and other near-AGI models should be helpful. If you have a videocollection, it should contain quite some videos containing title cards. Also quite some videos from YouTube should contain them.

I guess one can start from detection of title screens of presentations. Usually they are the first slide of a presentation, and presentations can be harvested from internet using their filename extension. The title screens can be augnemted by style transfer neural networks to make them stylezed and less text-like.

After a model recognizing title screens of presentations is trained, one can try to recognize title screens of real videos from youtube with it. In order to get title screens you don't need whole videos, the title screens are usually in the first few minutes of them, and for presentations are within first few seconds. There are quite some of videos containing title slide of presentations often overlayed by other objects like persons standing or webcam overlays. After annotation using text+pic AGI models with prompts like "does this slide look like a title" this dataset can be used to train the next generation model.

After that, the new model is applied to videos (everyone knows where to get them) containing very stylized title cards, and again the results are verified using AGI. Certain kinds of videos have title cards on exactly the same timings, it is very widespread, so it may make sense to add the detection of this case.

I guess it is the way to get dataset. Bootstrap and ikprove evolutionary, not try to make the perfect model forom the very first dataset obtained (this would require a dataset it is infeasible to create).

@FredrikAugust FredrikAugust added Much work This project takes little time to complete. (ETA several weeks+) Advanced Projects that require a high level of understanding of the topics specified, or programming in gnrl. AI/ML Artificial Intelligence and Machine Learning. Including, but not limited to, creating Skynet. Futuristic tech/Unique ideas Sometimes, the ideas are just so cutting edge that it's hard to describe. labels Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Advanced Projects that require a high level of understanding of the topics specified, or programming in gnrl. AI/ML Artificial Intelligence and Machine Learning. Including, but not limited to, creating Skynet. Futuristic tech/Unique ideas Sometimes, the ideas are just so cutting edge that it's hard to describe. Much work This project takes little time to complete. (ETA several weeks+)
Projects
None yet
Development

No branches or pull requests

3 participants