-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pyav audio support #962
base: master
Are you sure you want to change the base?
Conversation
…lude a stereo video. WARNING: video not commited, unclear how to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @Azleur 🚀
I think this is pretty much complete and only needs some minor tweaking to make the API consistent with the rest of ImageIO.
There is one feature I was thinking about while reviewing that I think makes a lot of sense, so I wonder what you think about it: In my mind, the default use-case for reading audio is to read it alongside some video frames/stream and to then pass both the frame and the corresponding audio into some ML pipeline or deep learning model. Currently, there is no easy way for a user to figure out which segment of audio corresponds to a video frame stored at a given index. Do you think it would be useful to add code that allows reading/iterating over the audio in batches that correspond to the start and duration at which individual frames are shown?
*, | ||
stream_index: int = 0, | ||
) -> np.ndarray: | ||
"""Yield audio frames from the video's audio stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Yield audio frames from the video's audio stream. | |
"""Yield audio frames from an audio stream in the ImageResource. |
Not 100% happy with this phrasing yet. The "from an audio stream in the ImageResource" sounds a bit generic and doesn't add much value. Should we just cut it and go with "yield audio frames"?
stream_index: int = 0, | ||
frame_index: int = None, | ||
) -> np.ndarray: | ||
"""Read audio data from the video. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Read audio data from the video. | |
"""Read audio frames from an audio stream in the ImageResource. |
Same as with yield. Should we just cut the clutter?
frame_index : int | ||
Select which audio frame to read. If None (default), read all audio | ||
frames. Each audio frame is a batch of contiguous samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this one the first argument and rename it to index
? This would mirror the way images are read and allow users to do
with iio.imopen("...", "r") as file:
file.read_audio(index=6)
Also, the value for index
that instructs ImageIO to read and batch all images in a file is Ellipsis
or ...
for short. It would make sense to match this syntax when reading audio. Could you switch the default value to ...
(while keeping the behavior of batch-reading by default)?
We use index=None
on plugins that support multiple container formats and need to specify different default behavior depending on the format being read. For example, for JPEG pillow reads the first (and only) image in the file (index=0
), but for GIF it reads all frames and returns a batch (index=...
).
A numpy array with shape ``(channels, samples)`` containing the | ||
requested audio data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A numpy array with shape ``(channels, samples)`` containing the | |
requested audio data. | |
A numpy array containing the requested audio data. |
When reading a batch we also get a batch dimension in the result. Either we document this here or we just don't make any claim on the returned shape (since it depends on the input arguments).
Also, what happens if data is stored interleaved rather than planar. In this case we would get a (samples,)
shaped array back, right?
requested audio data. | ||
""" | ||
|
||
self._container.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried that this will mess with reading video data because seeking on the container moves all streams to the desired pts. If I do
with iio.imopen("...", "r") as file:
img = file.read(index=5)
audio = file.read_audio(index=95)
img = file.read(index=6)
will I (a) still get the right frame, and (b) still benefit from the recent performance optimizations that avoid seeking and spooling the file if we read consecutive video frames? (For large videos this gives a 10x speedup)
I think we probably want a unit test that interacts with both images and video at the same time. I'd imagine this to be the default use case and it will be good to have certainty that video reading and audio reading works well when interleaved.
|
||
def test_write_audio(test_images: Path): | ||
in_path = str(test_images / "realshort.mp4") | ||
out_path = str(test_images / "realshort.mp3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We treat the test_images
dir as read-only to keep tests independent of each other. (They share the directory to avoid having to copy a lot of data around when setting up a test).
assert np.allclose(expected_frame, actual_frame) | ||
|
||
|
||
def test_write_audio(test_images: Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_write_audio(test_images: Path): | |
def test_write_audio(test_images: Path, tmp_path: Path): |
tmp_path
is a pytest fixture (test-level) that gives you a unique, temporary folder for a unit test. With it you can get a directory that you can mess around in without influencing other tests. Pytest will take care of the cleanup.
|
||
def test_write_audio(test_images: Path): | ||
in_path = str(test_images / "realshort.mp4") | ||
out_path = str(test_images / "realshort.mp3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out_path = str(test_images / "realshort.mp3") | |
out_path = str(tmp_path / "realshort.mp3") |
with iio.imopen(in_path, "r", plugin="pyav") as in_file: | ||
in_frames = [frame for frame in in_file.iter_audio()] | ||
|
||
with iio.imopen(out_path, "w", plugin="pyav") as out_file: | ||
out_file.init_audio_stream("mp3", format="fltp", layout="mono") | ||
for frame in in_frames: | ||
# This would fail if the frame was float64. | ||
out_file.write_audio_frame(frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can combine these using a contextlib.ExitStack
and avoid keeping the full audio in memory uncompressed.
out_data = np.concatenate(out_frames, axis=1) | ||
|
||
assert in_data.shape == out_data.shape | ||
assert np.allclose(in_data, out_data, atol=0.005) # Lossy conversion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert np.allclose(in_data, out_data, atol=0.005) # Lossy conversion. | |
assert np.allclose(in_data, out_data, atol=0.005) # Lossy compression. |
See issue #940