Skip to content

asmekal/iccv-2023-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Personal notes from ICCV23

Table of contents

Intro

Conference Stats

  • 2k+ papers accepted image

  • per category:

image

My experience

  • I mostly looked on Diffusion/GAN-related, architectures, training tricks, a little bit few/zero-shot, segmentation/detection.

  • Overall ~110+ papers are covered (more than 5% of the conference) + some workshops + sometimes you can find random ideas

  • Overall conference felt much less useful than ICCV2019 which I attentded in person previously (notes). Maybe the amount of interesting simple ideas is somewhat more depleted? Or the things I was interested in became more narrow? Or I became older and more stup more experienced? On most poster sessions 1 hour was sufficient to check all useful papers (if you don't want to weight 10-15 mins/poster to speak to author). Anyway, conference is in big part for socializing, and research is already dead outdated so it was not bad.

  • Some posters were missing! or at the wrong place! some others were on workshops as well as on main conference or 2 consequtive days/poster sessions in 1 place... my strategy was just walking over all the posters, problem is some posters are added late, some are removed early - so you never know you checked everything. 1 time I tried to look specifically for 1 poster, checked it's allocated place as well as all posters in general - didn't found any... after which I stopped using the schedule as guidance

If you're reading this for some weird reason and you're not me

Recommended order is

Main Insights

  • Data is crucial (highest quality data)
    • EMU from Meta is tuned on just 2k but extremely high quality images
    • Dalle3 report is all about how important is text-image matching in the data
    • Alyosha Efros's talk
    • ... (every 2nd paper/talk)
    • Obvious? Sure, any self-respected ML practitioner learns it in year1, but after hearing it so many time you feel it
  • Domain experts might be of great help
    • Photography quality labelling (there're agencies who label/relabel smartphone cameras quality by many params, also in EMU domain experts helped to select best images, ...)
    • Mentioned in DeepMind's keynote (in the context of "don't try to blindly apply your 'genious' methods, - consult will it make sense / what is needed / etc")
  • Multitask training reduces data requirements by level of magnitude
    • and technically might be equivalent, i.e. does not damage the quality
    • in 2017 or 2019 one of best papers was about ~"which tasks we can combine in multi-task training to improve quality of all"
    • *that might be applicable for huge models though, not for tiny ones
  • Self-supervised training: be careful, ensure you don't do something stupid accidentally (e.g. with large batch of text-image pairs using all other pairs as negative examples leads to incorrect negatives)
  • DeepMind keynote on project selection
  • There's apparently a way of encrypted data inference (so google or whoever is your inference cloud don't know the data you're processing) - not that fast though

Paper description format

Disclaimer: the notes are biased. Also in many cases I spent very few time on paper so there might be some inaccuracies/mistakes.

  • [x/10] (paper main idea description) Paper title my commentary

(some images if idea looked interesting enough and can't be described with a few words and I wasn't lazy)

Ratings are the more the better. Rating is ~usability/novelty of the paper to me (read: "very biased"). You can probably Ctrl+F 9/10, 8/10, 7/10, etc

I mostly grouped papers by primary topic, but there're exceptions. e.g. if the only interesting thing in the paper for me was loss I'd put it to losses section regardless of the main topic.

Workshops

Video workshop

  • Black box interfaces (on ux)
    • chat model is way more convenient for humans.
    • some signals are way easier to provide not with text but image (ref, controlnet, etc)
    • "A good conceptual model let's users predict how input controls affect the output"
    • (just a good question, no great answers I remember) "Low retention rate of GenAI tools, what is missing?"
  • Video understanding
    • (tldr) - we really need hierarchical models
    • unsupervised seems to work better than supervised now
    • (historical note) In 2008 was possible to recognize actions like running, sitting down in a car. quite impressive
    • Vid2seq paper can produce dense captions
    • Unsolved video understanding: long term understanding, embodied video understanding (predicting future, potential, likely, interesting, etc)
      • my thoughts:
        • long-term probably needs just some hierarchical model (like different levels of abstraction in summaries).
        • embodied understanding - just learn to predict the future (also should work great in combination with RL/robotics, curiosity, etc)
        • overall does not look that problematic, 1-2 years and we'll be there easy
    • is scaling LLMs the answer? author provide we need 1000x more data/capacity for videos to scale it directly (which is actually just ~20 years in Moor's law). also llms does not capture 4d world complexity (so we need multimodal something).
  • AI films (~3-10min movies showcast)
    • are very different
    • in general artists do what before but in a new way, sometimes simpler
    • (have not seen higher [than non-AI] quality works but there should be some)

Continual learning workshop

  • Still far from solved, catastrophic forgetting
  • minor ideas are to update teacher model with ema student or unfrozen batchnorms - works but not too good

Quo Vadis / State of Computer Vision

  • shortlist of best thoughts
    • Over fitting is because of multiple epochs - let's train on infinite stream of data instead
    • the word computer in "computer vision" is accidental (vision is central, computers not important in 100 years)
    • New crisis (llms) - focus on creativity instead
  • Alyosha Efros's talk (fun to watch, mostly memes, main point - data is king, use good data)
  • Lana Lazebnik's talk (on modern science pace, no specific solutions mostly just sharing problems)
    • image
    • image
    • image
    • image
    • my thoughts (esp after talking with many PhD students on conference) [speculations]:
      • there're 2 kinds of papers - important/fundamental/groundbraking (new problem introduced & solved, completely new level of quality achieved, conceptually new paradigm in solving problems) and incrimental (tiny incrimental improvements in quality, minor hyperparameter change study, datasets exploration)
      • the first type takes a lot of time, in many cases your ideas do not work at all, in some cases (~idea is on the surface) other people publish it faster than you can complete research
      • the second type can be done in really short time, even 1 week start to finish if you try hard. it's sort of not that useful but you'll get your publications/citations/whatever
      • I noticed most PhDs focus on type 1 and fail to publish or focus on type 2 and feel bad about it (or not)
      • probably reasonable strategy is to find balance, spend some time on incremental and some on foundational (split time in week or allocate few months for 1 and few for 2)
      • todo: write type 1 ideas not yet implemented (it was my final todo but I'm too lazy now, maybe will do if repo gets 25+ stars (which it safely won't, right?))
  • Antonio Torralba's talk (current LLM crisis -> upcoming CV/entire industry crisis -> what to do with it)
    • great talk full of memes and still valuable
    • basically several lessons from history
      • from most recent: before 2012 people has to know all classic computer vision/ML staff. and still nothing worked with good enough quality for practical problems. now you "stack more layers" and it works. is it bad? people feel old knowledge is useless? not really, many feel excited
      • the greeks theory of extramission (emission) theory as first model of vision
      • the original motivation of images/art is ~"to have wild animals at home w/o it being too dangerous so they step on you during sleep - so someone invented cave painting"
      • at some point in art there was artist who could do perfect realism (or photography). and at this moment some artist thought - what do we do now? important ideas are captured! and when Dali & co comes and draws abstract things and ideas which do not exist in real world. ~go back to original idea of having smth beautiful at home/be able to produce it
      • image
      • image
      • image
      • image
      • image
      • image
      • image
      • (comment for last slide^) author provided comparison of number of ~human cognition sensors/cells responsible for vision vs neurons in modeln deep learning - and it's still favorable towards human vision. still mostly a joke as for me but if someone finds natural system easy to build which does not require much training like human vision - that'd be interesting

Efficient networks

  • from big teacher select channels/layers via pruning/etc - still works
  • list of current ~sotas image

Meta GenAI

  • emu
    • image
    • filter, base model sample and filter again
    • 16 channel ae important for quality
  • text-to-3d
    • good conceptual slide, also you can think on other tasks (x-to-image(done)/3d/video/4d/model/etc)
    • image

Keynotes

Robotics training

  • LLMs can act like brains of robots (planning agents, etc), but also they can model different users and their different preferences and therefore be a REWARD model as well

Deepmind Research

  • I liked the part about problem selection - (how to) choose most impactful thing, generally applicable on other scales as well
  • image
  • image
  • image

Papers by topic

Diffusion

Image editing

image

image

image

image

LoRA/Adapters

  • [] [9/10] (database search by (image, edit description), e.g. (img of a train, "at night"). works by textual inversion to S tokens, but distilled (so any image can get it's token with inference-only)) Zero-Shot Composed Image Retrieval with Textual Inversion models & code released. This should have a lot of applications with relatively trivial modifications. Similar to IP-adapter, just a bit different application scenarious

  • [*] [8/10] (what other dimensions you can save in tiny weight part finetuning for big model? precision. so technically for personalized loras you can store them in 1-bit precision as they do in the paper w/o loss in quality) Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy what do they even train in 1-bit? +1/-1? for how many weights? technically if the claim is not exploited too much one can save e.g. per user checkpoints with great savings in memory

  • [*] [8/10] (diffusion model for faces with relighting) DiFaReli: Diffusion Face Relighting faces are reconstructed really well and indeed only lighting changes. maybe useful for other decompositions

image

  • [6/10] (how to add new modalities encoders to pretrained text2image models? basically you only need paired data of your new modalities and text-image, train small adapter from your modality encoder (can be frozen) for merging with text encoder output before all cross-attentions) GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation good to confirm that simple idea works

  • [6/10] (how to tune diffusion with few params - train gamma params (for attn activations and feed forward) - their benchmark showed 8x better quality and slightly more param efficiency than 8/16 rank loras) DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-efficient Fine-Tuning Probably can be used as adapter for sd controlnets as well

  • [6/10] (customization. encoder to text embedding from 1 image (+main object mask) + finetuned keys/values for SD attention + extra "local" attention (preserving spatial structure & masking) embedding preserving spatial structure & extra trainable keys/values. during training predicts main + extra tokens, the rest is abandoned on inference as non-object related. ) ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation results are not that impressive (e.g. check kitten). something is missing, but spatial attention from image embedding itself makes sense to me ("local mapping") as well as one-shot encoder

image

  • [5/10] (set of binary masks, each connected with word in text + prompt -> segmantation-conditioned generation. how: force attention mask for selected words for which exist binary mask to match binary mask via loss, upd z_t iteratively) Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models easier to train controlnet these days, you rarely need only single image edit. but maybe connecting such controlnet with text tokens to force attention may improve quality more

image

Enforce prompt matching

image

image

  • [8/10] (~textual inversion for exclusive sets of attributes, e.g. gender, skintone, etc by image references. but not with actual textual inversion but by clip embedding optimization similar to stylegan-nada) ITI-GEN: Inclusive Text-to-Image Generation you can generate "man with glasses" but you can't generate "man without glasses" (usually negative prompts don't guarantee that, esp if you generate thousands of images) so that work is useful for controllable generation

image

Other / better guidance

  • [*] [8/10] (better claffifier (not free) guidance - backprop to all noises consequitively from original image. super slow. but quality is better) End-to-End Diffusion Latent Optimization Improves Classifier Guidance should also work to any losses (segmentation, identity, etc) since explicit gradient is used. isn't this obvious idea though? too obvious even, I'm surprised clf-guidance was done w/o full denoising, only issue was gradient backward through huge network on all steps so they reformulate it as invertible diffusion here

image

image

image

Domain adaptation

  • [*] [7/10] (domain adaptation (for style) on few images: sample from style-specific noise distribution (vae projection mean/std for mean and covariance of diffusion noise instead of N(0, I)) -> finetune diffusion from that noise distribution for ~1k steps. results look good, paper says 50-200 imgs work, poster was ~10-15) Diffusion in Style

image

  • [6/10] (turns out 2 stochastic diffusion models, trained independantly, given same "seed" produce related images (!!! lol) -> in this work they are generating surprasingly good paired images from 2 models / edits by prompt modification from single model. results look good)A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance mostly interesting theory since there's no community interest -> wide adoption in these models for now

image

Removing/modifying concepts in pretrained diffusion

  • [8/10] (problem: want to change diffusion assumption on the prompt (e.g. messi -> playing basketball not football, roses -> are blue). solution: given original/edited prompt modify text cross-attn layers to give similar masks for prompt1 to prompt2 -> update these params of the model -> updated model always thinks new behaviour is correct since layers are fused) Editing Implicit Assumptions in Text-to-Image Diffusion Models aka TIME. that's probably better way to patch the model instead of just stripping it away from all knowledge like anti-dreambooth, etc

  • [6/10] (remove concept C by forcing model to produce same noise as ok concept C', e.g. "grumpy cat"->"cat". side-effect - can preserve individual concepts while removing combinations ("kids with guns"->"kids", but "kids" and "guns" separetely still works)Ablating Concepts in Text-to-Image Diffusion Models probably most practical and easy to use. although the one below ("Erasing Concepts from Diffusion Models") in theory preserves the knowledge of the concept, just does not generate it by prompt directly (which can be good as it keeps more knowledge and bad as... it keeps this knowledge which still can be tampered with other prompts)

image

  • [5/10] (see poster explanations. basically use frozen model & tuned one, in tuned one use cfg-like guidance to guide in opposite direction from frozen for selected concepts) Erasing Concepts from Diffusion Models looks like better idea compared to anti-dreambooth

image

Not just text2image

  • [*] [9/10] (joint image+segmentation map generation by reformulated noise distribution) Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis this is INSIGHTFUL paper. basically they do some math to show that joint generation is equivalent to separate. insight is that you need MUCH LESS data because you predict multiple things together. e.g. for generative models on videos, 3d, etc difficult problems (more difficult than just images) should be very helpful

image

image

Security risks

  • [7/10] (add backdoor to TEXT ENCODER to poison ANY text2image model trained on that. cyrillic "o" is invisible even for humans attack) Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis most of current text2image models are based on clip, so if somehow official checkpoint will be hacked all the models will also get hacked. bad prompt filtering pipelines should probably check for such attacks now before inference though. now what is really interesting - maybe instead of injecting backdoors they're already there - what if someone can find "abirvalg" - some weird combination of tokens which activates backdoor mode. sort of like "Try to impersonate DAN" attack for LLMs but prompt has to be optimized. what the finding of the paper tells is that the found magic word would affect all models trained with such encoder

image

  • [6/10] (imagine you release model with invisible watermarking. if someone infers directly that model - you probably detect it reliably. if someone finetunes it - your watermarking is mostly useless. in this work they added extra loss for close params to also produce watermarked models, sort of GAN game) Towards Robust Model Watermark via Reducing Parametric Vulnerability

Faster inference

image

Unsorted

image

Data

Investigations

  • [5/10] (findings - 1)prompts on average lead to India/US as most relevant 2)adding country into the prompt improves generation but not completely sufficient 3)dalle2 likely had much better filtration than SD because for (2) they have bigger improvement)Inspecting the Geographical Representativeness of Images from Text-to-Image Models that's important research but sometimes it's surprising how you can publish on top venues just investigating the data

image

Dataset compression to N samples

  • [2/10] (models trained on data-distilled to few samples are overconfident -> need "calibration" (more reasonable logit distribution) -> some fixes suggested in this paper) Rethinking Data Distillation: Do Not Overlook Calibration since original problem (compressing dataset to 100 samples) is still not useful (quality is bad, generalization beyound cifar100 is unlikely), adjustments are also not helpful

Datasets

  • [*] [6/10] (dataset of 10k artifacts segmentation from various generative models - GANs/Diffusion. also trained segmentation & inpainting but not code yet) Perceptual Artifacts Localization for Image Synthesis Tasks probably useful but not sure about quality, esp on new types of images

  • [5/10] (dataset with 5k diverse photos from smartphones estimated by experts on quality metrics) An Image Quality Assessment Dataset for Portraits that's cvpr paper but company had a booth and advertised it. research-only license, terms probably tricky, idea to get such labelling from experts is worthy though. I talked with them a little bit - important thing for quality estimations is to recompute benchmarks (cameras becoming better and better so there's no "perfect" quality in gt) every 1-2 years at least. camera producers usually go to these agencies to measure their quality (ratings are open although I'm not sure customers actually visit such websites to check)

Synthetic labels

Multi-modality

VLMs (VQA, captioning, zero-shot classification, etc)

CLIP Training

  • [8/10] (equivariant here ~= text-image scores are proportial to actual relevance, i.e. 0.7 is meaningful. given 2 image-text pairs [semantically similar! ~= not too different] they design simple losses e.g. similarity(text_1, image_2)==similarity(text_2, image_1), see others below) Equivariant Similarity for Vision-Language Foundation Models labelling is even more crucial for such alignment training. can be combined with other clip optimizations in training (e.g. filtering "hard samples" which are technically valid pairs)

image

image

  • [7/10] (affinity mimicking = same distribution of text-image similarity on train batch, weight inheritance = choose teacher weights part) TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance smaller/faster clip models are useful by themselves when performance matters. side note: progressive distillation works better (e.g. 100%->25% capacity is worse than 100->50->25 for same training time)

Prompt tuning

CLIP zero-shot quality improvements

  • [6/10] (llm generates prompts per every class need to be detected. q for llm: "what does [class_name] look like?". claims to be better than hand-designed (well it scales up easily true)) What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification there were some other work saying that writing "[random characters] [class_name]" gives better accuracy than LLM-designed ones (averaged among these random characters ofc)

  • [6/10] (learn N "style" text embeddings ~ "a S_i style of [object]" where object is dog/cat/etc classes. styles are not supervised on anything, only text encoder and no images are used. on top of learned style augmented prompts linear classifier is trained. in the end argmax clip mean "a s_i style of [class_name]". works very good) PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization again I remember work with image2text clip embedding adapter with similar idea on this conference - should work even better. does not lead to interesting text2image styles, but probably some modifications can help finding interesting ones automatically (although with visual feedback should be better)

image

image

image

CLIP Inference

image

image

CLIP Data/abilities

  • [9/10] (that's some unknown paper from conference) (0 shot better - for contrastive pretraining because of ambiguity and large batch might be good img-text negative pairs (i.e. not on batch diagonal), so instead they consider 3 similarities between img/text/ij - imgs, texts, img-texti for the right loss) very simple/obvious idea yet very helpful image

  • [5/10] (just finetune clip on data with number or objects in captions) Teaching CLIP to Count to Ten

  • [3/10] (basically tuned clip on negative image-text pairs by adding "no" to prompt, e.g. "image of NO cat" with dog image) CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

GANs

Latent space manipulations

image

Domain adaptation

  • [*] [7/10] (stylegan-nada improved. idea: instead of single text direction use multiple ones and match distributions of image & text directions. 1)find multiple directions ~close to original trg text embedding + most dissimilar from one another -> ~uniformly distributed some distance around the embedding of trg text 2)for image-image directions in training batch and text-text directions (original and all augmented) penalize both mean and covariance mismatch. see formulas below for details) Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations stylegan-nada is sort of simple and naive loss, there're many small changes you could propose to improve results (see e.g. this work for clip within loss)

image

image

image

  • [6/10] (photo->avatar. where avatar is parametrized model, i.e. hair/ear/eyebrow/etc params. train 2 unconditional generator (real faces & parameters of avatars) after which mapping between them. they had small paired dataset of 400 imgs, hand-crafted by artist on volunteer selfies) Cross-modal Latent Space Alignment for Image to Avatar Translation random thought: maybe it's possible to learn alignment between 2 different generators (like 2 face GANs -> learn mapping from first GAN latent space to second GAN latent space. only need some paired data)

  • [5/10] (gan finetuning to dissimilar domain. 2 ideas: 1)~regularize init vs tuned feature discribution ("smoothness") 2)multi-resolutioal patch D) Smoothness Similarity Regularization for Few-Shot GAN Adaptation results on 10 imgs are still crap but less crap (in more reasonable problems should help as well). haven't found comparison vs augmentations (stylegan2-ada, etc) - usually that helped a lot)

Unsorted

image

image

image

Training improvements

Training

image

  • [5/10] (lowres -> highres & increase strength of augs. downsample is done with cropping in freq domain [claim is that's more precise/justified]. training cost of huge model trained from scratch reduced ~20-30%) EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones it'd be more curious how it generalizes to LLM/other foundational models. likely useless for finetuning. training on lowres first is beyound obvious as well as increasing strength of augmentations, still maybe take practical tips from here. for augmentations they use RandAug with progressive strengths. frequency domain also can be explored more during training (e.g. more losses, etc).

image

Finetuning/other task adaptation

Decoding

  • [3/10] (autoregressive decoder with k tokens/inference step for image compression. show that predefined token sampling schedule perform as well or better than random (how's that not obvious though?)) M2T: Masking Transformers Twice for Faster Decoding no new insights

Losses

image

  • [6/10] (1-img stylization preserving content. for content preservation patch contrastive loss from eps-predictor of diffusion [so full denoising / noise-aware models not required]) Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer quality is complete crap (which is strange, probably they should not do it in patch-contrastive way because features from different regions of images are often similar?) but general idea of using eps-predictor as sort-of perceptual similarity source/target looks legit, should be applicable for other problems (definetely applied somewhere though?)

  • [6/10] (forced classifier to attend the right place of the image and it improved quality and maybe reliability) Studying How to Efficiently and Effectively Guide Models with Explanations

Federated learning

  • [*] [8/10] (how to adapt classifier per every user to improve quality. user side setup: clf(model(img, prompt)) where prompt is just a few trainable params, model is frozen (e.g. pretrained foundational model), clf is local per-client classifier. server setup: base prompts, prompt generator network (user descriptor -> better user prompt). rough idea: when training starts baseline zero-shot performance already works somehow, every training step you don't have GT but use current inference prediction as GT to update the system. so every step on user side clf, prompt are updated and on backend side prompt generator is updated) Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation some variation for personalized text2image models?

image

Architectures

Tricks

image

image

Attention

  • [9/10] (instead of memory-augmented attention [=learnable extra keys and values] they reuse keys and values from the previous N training samples. motivation is that this should better focus on individual samples instead of being beneficial for the entire dataset on average. note that this is actual memory - previous outputs/thoughts of the network. to not store too many memories they use k-means centers. memories are updated every N training batches with k-means again) With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning memory should be very useful for other generative tasks, maybe not the approach itself but idea at least. like hashgrid encodings in nerfs, some sort of memory for the network to be able to operate, not just extract everything from input & biases, that's very reasonable

  • [8/10] (better linear attention. linear attn has quality drop, so they investigate issues and fix them. Y=phi(Q)phi(K)V + depthwise(V) where phi(x)=||x||*(x^p)/||x^p|| where x^p is elementwise power p) FLatten Transformer: Vision Transformer using Focused Linear Attention

image

image

image

Modules/Layers

  • [7/10] (1d oriented convs with efficient cuda implementation -> quality ~same as 2d on some tasks -> receptive field is bigger) Convolutional Networks with Oriented 1D Kernels on-device efficiency is always questionable for new layers, but idea is interesting. maybe some combination of 2d & 1d-oriented is better. e.g. 1d-oriented with huge kernel sizes to allow good receptive field, but 1 time per block only. or might be interesting to predict orientation first (per-pixel) and when use it similar to attention (although not sure it will be efficient this way)

image

image

Downsample/upsample

image

Misc architectures

  • [5/10] (DiT; unet->transformer leads to much more scaleble architecture (both up and down) & claims better FID with same flops)Scalable Diffusion Models with Transformers why meta/stability didn't use transformers in EMU/sdxl then? or any other work using transformer instead of unet? (tldr from Slavchat discussion - there's some proof that what is important is computation - it should give +- same quality regardless of architecture (with reasonable architectures). in theory the benefit of transformers is that they're more easily scaleble. Kudos to Michael, Vadim, Seva, Aleksandr, George)

  • [4/10] Masked Autoencoders Are Stronger Knowledge Distillers note: masked autoencoders = bert-like

  • [3/10] (hyperbolic space operations end2end first network) Poincare ResNet maybe interesting in 5 years

Encrypted inference

Video

Video Generation

Video + Audio

image

Video Editing

Video Stylization

  • [3/10] (video stylization. train depth-conditioned video generator with extra temporal blocks -> infer on real video depth + edit prompt) Runway GEN1 just one more reminder how outdated is iccv research...

Video Tagging

Other problems

Vector graphics

Style transfer

image

image

  • [5/10] (style transfer: (content img, text description of emotional feeling)->stylized. new text-image dataset of emotional descriptions used as refs to train the model + some sophisticated losses) Affective Image Filter: Reflecting Emotions from Text to Images interesting new problem, not sure about practical application

image

3D

  • [*] [8/10] (nerf + sketch modification on 2+ views + modified prompt -> modified 3d model. sds with new prompt + regularization to preserve prev outout + loss to match masks for sketched img)SKED: Sketch-guided Text-based 3D Editing likely should work for other editing tasks (not just 3d)

image

  • [*] [8/10] (base nerf + it's dataset -> edited nerf. render train view img -> edit with instruct pix2pix -> update train dataset image with it -> continue training nerf) Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions quality is very rough, e.g. identity is not preserved on man or night-mode on scene preserves clouds. smth else than 3d by same method? iterative dataset refinement paired with consistency looks like a decent idea

  • [*] [8/10] (1)detr + sam to segment masks of areas of importance to be edited 2)stylization via base nerf loss + feature matching loss (vgg features of optimized nerf -> vgg features of nearest neighbour in vgg features of style img compared to unedited nerf pixel location) S2RF: Semantically Stylized Radiance Fields

image

image2image

  • [*] [8/10] (problem: real backgrounds -> anime backgrounds with unpaired data. solution: 1)tune stylegan pre-trained on real background images on anime with clip & lpips consistency vs original source images 2)generate synthetic data AND filter them through segmentation consistency check 3)train on combined paired synthetic data and real unpaired photos/anime references. data: 90k real set, 6k real anime backgrounds from Makoto Sinkai's movies, 30k synth data (unkonown amount left after filtration). results look not great but ok ) Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation the approach is sophisticated but it does make sense in my eyes, esp with some modifications. likes: overall paired synth&unpaired real training idea, filtering by segmentation consistency, clip/lpips vs source while stylegan tuning (it works opposite way from stylization but does help consistency). not sure about patch losses (e.g. they can select similar patches with some chance, esp with large batch)

image

Inpainting

image

image

image

Face Recognition

  • [*] [6/10] (98% purely synthetic face recognition vs 99.8% real on some benchmark. (prod quality is smth like 99.99+ now in top solutions?). basically conditioned diffusion on face recognition embeddings) IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Model there's chiken & egg problem: if you don't have baseline FR you can't condition diffusion, if you don't have diffusion you can't produce FR model. (so quality of your conditions depend on FR model baseline and by definition can't be higher). still good to know that such simple conditioning works (although they trained on really really close faces)

Hair simulation/editing/animation

image

image

Detection

Adaptation to the unknown

  • [9/10] (normally there's only known objects (labelled). at some point researchers took ROIs with high confidence and add label as unknown objects / use part of data this way - but it does not really generalize because of limited training data. in this work random boxes are sampled during training -> roi extracted -> matching loss on these boxes which "encourages exploration") Random Boxes Are Open-world Object Detectors I like this paper because it's a way for network to go beyound training data, at least find things it doesn't know but which LOOK interesting. w/o this we're just forcing network to memorize train (and there're always mistakes, most influential papers of conference all say just how important your data quality is)

image

image

Misc