My reading notes on DL papers, along with my personal comment of each paper, so there may exist lots of mistakes, I really appreciate you to point out.
- Neural Style Transfer: A Review ⭐⭐⭐⭐
- Investigate the works of Neural Style Transfer till May of 2016.
- Demystifying Neural Style Transfer
- Prove that matching the Gram matrices is actually equivalent to minimize the Maximum Mean Discrepancy(MMD) with second order polynomial kernel.
- Try out for different kernels and parameters.
- Fast Patch-based Style Transfer of Arbitrary Style
- A more advanced version of "Fast" Neural Style Transfer that can run in real-time and applies to infinite kind of styles.
- The drawback is the quality of stylized images is worse than "Fast" Neural Style which yet can only applies to finite styles.
-
Self-Attention Generative Adversarial Networks(!!Important)⭐⭐⭐⭐⭐
- Self-Attention GAN, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset.
- Using Self-Attention to learn long-range dependency.
- Several tricks inside:
- Used Spectral-Normalization both on generator and discriminater, it proved to be more stable when training compared with
SN-GAN. - Showed two-timescale update rule (TTUR) is an effect way for faster converge.
- Indicated that self-attention mechanism at the middle-to-high level feature maps (e.g., feat32 and feat64) achieve better performance than at low level feature maps. The reason could be that the network receives more evidence with larger feature maps and enjoys more freedom to choose the conditions.
- Used Spectral-Normalization both on generator and discriminater, it proved to be more stable when training compared with
-
Conditional Generative Adversarial Nets⭐⭐⭐⭐
- cGAN, you can embed information to control the generated result.
- The information is feeded both into generator & discriminator. This can be done by concating the
z(after fc) with labely(after fc). - They experimented on MNIST generation with given number as
y(one-hot), and a multimodel tagging, especially for the tagging work, they use an image as information by letting it pass through pretrained CNN to be they.
- SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
- Synthesizing Audio with Generative Adversarial Networks⭐⭐⭐⭐
- The first listenable GAN based audio generation work.
- Using several methods as below:
- 1D Conv(filter len=25) rather thant 5x5
- Upsample by factory of 4 at each layer
- Learned post processing filter & Phase shuffle to prevent discriminator learning to classify fake/real audio only by phase.
- Explore WaveGAN and SpecGAN, though the Inception Score of SpecGAN is higher (6.0) than WaveGAN(4.7), human prefer more about WaveGAN.(So is this means IC criterion can be updated ? Or means SpecGAN has some potential ?)
- Give a 0-9 audio dataset SC09.
- C-RNN-GAN: Continuous recurrent neural networks with adversarial training⭐⭐⭐⭐
- LSTM based Generater and Discriminator with the dataset of MIDI classic work.
- Apply trick such as
curriculum learning(continuing increase sequence length.),freezing(control the capability of G and D) andfeature matching(I don't understand here...) - Evaluation:
Polyphony,Scale consistency,Repetitions,Tone span.
- Semi-Recurrent CNN-based VAE-GAN for Sequential Data Generation(ICASSP 2018)
- A Note on the Inception Score(ICML 2018 Workshop)
- MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment(AAAI 2018)
- MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation(ISMIR’17)
- Language Generation with Recurrent Generative Adversarial Networks without Pre-training(ICML 2017 Workshop)
- Attention Is All You Need
- Neural Machine Translation by Jointly Learning to Align and Translate
- The first paper that proposed
Attention.
- The first paper that proposed
- Effective Approaches to Attention-based Neural Machine Translation
- Pixel Recurrent Neural Networks(Best Paper of ICML2016) ⭐⭐⭐⭐
- I quickly skimmed this paper, it introduced a new method to generate image pixel by pixel with sequence model, which means you can only predict current pixel by it's previous pixels(namely the pixels above and to the left of it). To achieve this, they introduce a
maskto make sure model can not read later pixels. - The loss curve is much more smooth and interpretatable compared to GAN.
- I quickly skimmed this paper, it introduced a new method to generate image pixel by pixel with sequence model, which means you can only predict current pixel by it's previous pixels(namely the pixels above and to the left of it). To achieve this, they introduce a
- Conditional Image Generation with PixelCNN Decoders ⭐⭐⭐⭐⭐
- An improvement to PixelRNN & PixelCNN by adding an additional
Gated activation unit. - Use two stack(vertical and horizontal) to aviod the
blind spotin Mask. - Explore the performance of image generation in this kind of
Gated PixelCNNin conditional distribution image, actually it seems not as good as GAN but, still another method and therefore lead to the famous WaveNet.
- An improvement to PixelRNN & PixelCNN by adding an additional
- WaveNet: A Generative Model for Raw Audio ⭐⭐⭐⭐⭐
- A summary of papers of above, and use these methods in audio.
- Keywords: fuse the technic of
Dilated Casual Convolution,Gated Activation Unitsandresidual networkalong withskip connections. - Based on Conditional WaveNet, they explored the experiments of
Multi-Speaker Speech Generation,TTS(Text-To-Speech)andMusic Generationby feeding additional inputh. In speech generation, it's speaker ID of one-hot vector, in TTS it's the text while in music generation it's the tag of generated musich, like the instruments or the genre.
- Parallel WaveNet: Fast High-Fidelity Speech Synthesis
- Deep Voice: Real-time Neural Text-to-Speech
- Deep Voice 2: Multi-Speaker Neural Text-to-Speech
- Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
- Neural Voice Cloning with a Few Samples
- A fresh new paper by
Baiduof using a few samples to generate a lot of TTS audio.
- A fresh new paper by
- Towards End-to-End Speech Recognition with Deep Convolutional Neural ⭐⭐⭐
- They found it's possiable to use only CNN based end-to-end model to do Speech recognition(SR) task, the results is as good as those of RNNs.
- They treat audio spectrogram as 2-D CNN, building with
CONV2D + Maxout + CTCarchicture and finally evaluating the model in TIMIT dataset.
- Neural Speech Synthesis with Transformer Network. Code:soobinseo/Transformer-TTS
Papers related with my current research.
-
Singing Expression Transfer from One Voice to Another for a Given Song⭐⭐⭐
- I skim the paper, it introduced a method of to improve our singing records: first we have a
sourceaudio(my voice), and atargetaudio which we want to sing as well as him/her. We first align the two pieces voice and compare them frame-by-fram with some features like phoneme etc. - Their sample is Singing Expression Transfer, the results are not so good and it's apparently not what I'm interested in.
- I skim the paper, it introduced a method of to improve our singing records: first we have a
-
Time Domain Neural Audio Style Transfer(NIPS2017)⭐⭐⭐
- This paper presents a method for audio style transfer by directly optimizing a time domain audio signal, it explores many architectures(e.g WaveNet encoder/NSynth encoder), the result is almost the same as
Ulyanovbut real-time? - The github implementation is time-domain-neural-audio-style-transfer
- This paper presents a method for audio style transfer by directly optimizing a time domain audio signal, it explores many architectures(e.g WaveNet encoder/NSynth encoder), the result is almost the same as
-
- Try with
VGG-19,SoundNet,Wide-Shallow-Random NetworkandMcDermott's texture synthesismethod to extract the style. The last 2 has meaningful results,McDermott's..recreates better local texture. - Shows that starting from content image can produce better results comapred to random generated.
- Try with
-
On Using Backpropagation for Speech Texture Generation and Voice Conversion(Google, 2018.03)⭐⭐⭐⭐
- Use the architecture of CTC speech recognition network to train a CONV.
- Extract speaker characteristics from very limited amounts of target speaker data.
- Did the experiment of
style transferand others, the result page is here
-
Audio spectrogram representations for processing with Convolutional Neural Networks⭐⭐⭐
- The major contribution of this paper is, to go on Ulyanov's idea by training a network with two convolutional layers and two
fully-connected layers on the ESC-50 data, and replaced the original
random param CNNwith pre-trained CNN. - Their result demo shows that although style transfer does work without regard to weights, a network trained for audio classification appears to generate a more integrated synthesis of content and style
- The major contribution of this paper is, to go on Ulyanov's idea by training a network with two convolutional layers and two
fully-connected layers on the ESC-50 data, and replaced the original
-
A Fully Convolutional Neural Network for Speech Enhancement
- A paper described how to use CNN for removing babble noise in a audio so as to enchance our hearing. It used encoder-decoder which may worthy to read.
-
The challenge of realistic music generation: modelling raw audio at scale(DeepMind, 26 Jun 2018)
- Authors of WaveNet wrote this paper aiming at capturing long structure of music generation, by using larger RF(RF = hop_size * sample_rate), training with VQ-VAE, and introduce the argmax autoencoder (AMAE) as an alternative to VQ-VAE.
-
A Universal Music Translation Network(FAIR, 21 May 2018)
- This paper aims at music style transfer(though they don't take it as style transfer). Different instruments share the same WaveNet
encoder, and by adding a classifier to remove speaker information(namely instruments texture), and then reconstruct them separately each instrument with their owndecoder. - They do transfer by putting A's audio into shared
encoder, and then using B'sdecoderto reconstruct, so the output will be in A's content with B's style.
- This paper aims at music style transfer(though they don't take it as style transfer). Different instruments share the same WaveNet
- “Style” Transfer for Musical Audio Using Multiple Time-Frequency Representations(ICLR 2018 Rejected)
- Github: Style-Transfer-for-Musical-Audio
- Time Domain Neural Audio Style Transfer(NIPS 2017 Workshop)
- On Using Backpropagation for Speech Texture Generation and Voice Conversion(Google, 2018.03)⭐⭐⭐⭐
- Neural Style Transfer for Audio Spectrograms(NIPS 2017 Workshop)
- Audio texture synthesis and style transfer(Blog)
- Github: neural-style-audio-tf
- A Powerful Generative Model Using Random Weights for the Deep Image Representation(NIPS 2016) ⭐⭐⭐⭐
- This paper shows untrained network can be used for image representation. It used random weights for VGG archicture to do
Inverting deep representation,Texture synthesisandStyle transfer. And the result is comparable with the pretrained VGG. - It shows we can use this for archicture comparison without training them, so we can save a lot of time of comparing different archictures.
- This paper shows untrained network can be used for image representation. It used random weights for VGG archicture to do
- Texture Synthesis Using Shallow Convolutional Networks with Random Filters
- Extreme Style Machines: Using Random Neural Networks to Generate Textures
- On Random Weights and Unsupervised Feature Learning(ICML 2011)
- Attention Is All You Need
- The first paper of Self-Attention proposed by Google.
- Github:Deep-Expression
- A github repo using only Self-Attention for TTS.
- Self-Attention Generative Adversarial Networks
- Han Zhang, Ian Goodfellow.
- Image style transfer using convolutional neural networks. In: CVPR. (2016)
- Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.: Texture networks: Feed-forward synthesis of textures and stylized images. In: ICML. (2016)
- Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and superresolution. In: ECCV. (2016)
- A Wavenet for Speech Denoising.(ICASSP2018)
- An end-to-end learning method for speech denoising based on Wavenet.
- A Universal Music Translation Network(FAIR. 2018,May 21th)⭐⭐⭐⭐
- Use WaveNet autoencoder to translate music across musical instruments, genres, and styles. All instruments share the same encoder, but with different decoder.
- Two major loss, one is for the loss between decoder resconstruction with the ground-truth. the other is an instrument classification loss.
- The results can be listened on youtube. Though the transfer result is not as good as human musician for known voice, for the unknown voice(like whistling) the transfer results is even better than human. (Maybe because human are not so familiar with the melody ?)
- They distance their work with Style Transfer, because they believe that a melody played by a piano is not similar except for audio texture differences to the same melody sung by a chorus
- Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders(Submitted on 5 Apr 2017)
- van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural Discrete Representation Learning. In: NIPS. (2017)
- Vector Quantised-Variational AutoEncoder (VQ-VAE)
- Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
- Use CycleGAN for
Voice Conversion.
- Use CycleGAN for
This project is licensed under the terms of the MIT license.





