Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Downsample inputs for faster analysis #26

Open
richard-warren opened this issue Nov 11, 2019 · 11 comments
Open

Feature request: Downsample inputs for faster analysis #26

richard-warren opened this issue Nov 11, 2019 · 11 comments
Labels
enhancement New feature or request

Comments

@richard-warren
Copy link

Hi,

Many people collect videos at much higher spatial resolution than is necessary to perform accurate tracking (myself included). It would be great to have optional MaxPooling2D layer(s) at the input of DPK, which would downsample the input and cause the inference to be (way) faster. The output coordinates would need to be scaled up, etc. I think many would really benefit from the increased speed. What do you think?

Thanks,
Rick

@DenisPolygalov
Copy link
Contributor

What type of downsampling you talking about? If it is about raw video and spatial or temporal downsampling - then it might be better to use OpenCV for that...

@richard-warren
Copy link
Author

I'm suggesting spatial downsampling. Yes, it would definitely work with OpenCV, or ffmpeg. A maxpooling layer in the network itself may be faster (is this accurate?) and more convenient, but definitely not the only way to make it happen. Thanks!

@jgraving
Copy link
Owner

This is definitely possible but we would want to avoid adding too much complexity to the code. The easiest approach is probably to add an option for the TrainingGenerator that tells the model to downsample the input images to some specified resolution or by some factor (with a corresponding adjustment to the confidence maps).

A lot of the overhead of the processing time during inference is actually transferring the images into GPU memory, so I'm not sure how much faster this would be compared to preprocessing the frames with opencv. However, even if this isn't faster it would make using the code much simpler as everything is self-contained within the model.

That being said, MaxPooling2D is probably not the best option to accomplish this as it would add local artefactual distortions. Ideally you would want a custom layer DownSampling2D or ResizeImage similar to UpSampling2D that uses tf.image.resize, which includes proper image interpolation algorithms like bilinear interpolation.

This would also be useful for adjusting image resolution to a power of 2 (for downsampling and upsampling within the model), and could allow for variably sized images. I originally thought zero padding was the best way, but this seems like the better option.

@jgraving jgraving added the enhancement New feature or request label Nov 12, 2019
@richard-warren
Copy link
Author

Thanks Jake. So you prefer incorporating the downsampling in the model? If transferring to GPU is major bottleneck, would downsampling (with opencv) in the generator before transferring to GPU increase the speed?

One point on MaxPooling2D vs. more clever layers: those tracking mouse whiskers (or anything approaching 1 pixel thickness) might prefer max pooling, as it is more likely to preserve very thin features. Probably not super important, but perhaps worth considering.

Would the pooling layer automatically result in power-of-2 dimensions? I implemented zero padding in my branch. It would be nice to get rid of this, as it slows things down a bit.

Thanks again!

@DenisPolygalov
Copy link
Contributor

(Below is a shameless self-PR)
If someone want just to reduce resolution of arbitrary length single or multiple video file(s) or stack of puctures and save result into lossless-compressed avi or multi-page tiff file you may consider to try my CaFFlow framework: https://github.com/DenisPolygalov/CaFFlow
In addition or instead of spatial downsampling one can perform any frame-wise operation available in OpenCV, such as crop, color conversion, flipping, filtering, also PCA removal, etc.

@jgraving
Copy link
Owner

jgraving commented Nov 13, 2019

I ran some tests and it looks like this is probably not worth implementing. The opencv resize function appears to be significantly faster on all counts. There's just a ton of overhead to move the images into GPU memory. Zero padding is cheap, so it's probably best to implement padding as the solution for odd sized images.

import cv2
cv2.setNumThreads(1) # test without parallelism
import tensorflow as tf
import numpy as np
tfl = tf.keras.layers
ORIGINAL = (1024, 1024)
RESIZED = (512, 512)
class CVResize:
    def __init__(self):
        inputs = tfl.Input((None, None, 3), dtype=tf.uint8)
        outputs = inputs[:, :32, :3, 0] # simulate keypoint outputs
        self.tf_model = tf.keras.Model(inputs, outputs)
    def __call__(self, images, size=RESIZED, batch_size=1):
        images = np.stack([cv2.resize(image, size, interpolation=cv2.INTER_NEAREST) for image in images])
        return self.tf_model.predict(images, batch_size=batch_size)
cv_resize = CVResize()
inputs = tfl.Input((None, None, 3), dtype=tf.uint8)
resized = tf.image.resize(inputs, RESIZED, method='nearest')
outputs = resized[:, :32, :3, 0] # simulate keypoint outputs

tf_resize = tf.keras.Model(inputs, outputs)
inputs = tfl.Input((None, None, 3), dtype=tf.uint8)
resized = tfl.MaxPooling2D(ORIGINAL[0] // RESIZED[0])(inputs)
outputs = resized[:, :32, :3, 0] # simulate keypoint outputs

tf_maxpool = tf.keras.Model(inputs, outputs)
images = np.random.randint(0, 255, (256, ORIGINAL[0], ORIGINAL[1], 3), dtype=np.uint8)
%timeit cv_resize(images, batch_size=1)
437 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit cv_resize(images, batch_size=128)
455 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tf_resize.predict(images, batch_size=1)
1.36 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tf_resize.predict(images, batch_size=128)
1.27 s ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tf_maxpool.predict(images, batch_size=1)
5.46 s ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tf_maxpool.predict(images, batch_size=128)
1.27 s ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@richard-warren
Copy link
Author

This is great. Thanks for running these tests. Do you have any plans on implementing an opencv resizing option, e.g. in the DataGenerator, along with automatic rescaling of the network outputs? If not I'll hack something together on my end.

Relatedly, I'm finding that deepposekit underperforms deeplabcut when there are long range spatial contingencies. See the image here, where the left and right paw in the top view get swapped. The bottom view is useful here for resolving ambiguities in the top view; I think the deeper networks may have an easier time with these long range contingencies due to greater receptive field size at the outputs. I'm thinking spatial downsampling of the inputs may actually increase accuracy for deepposekit by effectively increasing receptive field size... Lmk if there are any other parameters I can play with that may help deepposekit perform better under conditions like these. Thanks again!

image

@jgraving
Copy link
Owner

Shouldn't be too difficult to add, but it's not high priority at the moment. I'll need to think about how best to accomplish this. If you want to submit a PR I'm happy to work on it with you.

Do you mean performance between networks within DPK or between the two software packages? Swapping issues might be due to erroneous or overly-aggressive augmentation, especially if the FlipAxis augmenter is being used. If you could open another issue and provide more details such as the augmentation pipeline you're using and the network hyperparameters (i.e. model.get_config()) I can help troubleshoot.

@richard-warren
Copy link
Author

Thanks! I'll open a new issue and let you know if I end up implementing the resizing.

@richard-warren
Copy link
Author

richard-warren commented Nov 15, 2019

I may try to implement a re-scaling option. If you have time (this isn't super high priority for me either), can you let me know if the following strategy seems alright?

@jgraving
Copy link
Owner

Using opencv to resize images doesn't require any interaction with the BaseModel or the maxima layers. I would just modify the BaseGenerator with a resize kwarg (with resize=None as a default value). Then resize the images (and rescale the keypoints to match) within the generator methods (and adjust compute_image_shape) if the resize kwarg is passed. This allows the resize code to be used for any generator that inherits from BaseGenerator as long as kwargs are passed using super() as they are for DataGenerator and DLCGenerator with **kwargs. It would also be useful to add the same code within the VideoReader so it's easy to downscale video frames to the same size when running inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants