Final analyses in Generalization of reimbursements using DeepLearning Keras #238

silviodc · 2017-05-18T22:05:29Z

Converting PDF files to png and then applying sift descriptors
DeepLearning Keras to detect generalization in reimbursements

anaschwendler · 2017-06-07T14:38:49Z

Hi @silviodc, thank you very much for the contribution! 🎉
Thanks for you patience!

Me, @jtemporal and @cabral took a closer look into it and found the best way to understand what you have done! Discussing in the public group helped us to find a way to do that! :)

So you are looking for some answers, so here it is:
The document is the following one

It is clear to me that the description of the items was made by someone else than the restaurant, is it allowed ???

No it is not allowed, should be described on the receipt

Are the deputies or acessors changing a document?? What are the implications about it?

No, it is not allowed that the receipt is altered (as the law says). It looks very suspicious. It is important to notice that the receipt was not changed but extra information was added by hand outside of the receipt while scanning it, and the CEAP clearly states that also can't happen. We have added an image that highlights part of the law bellow that proves that.

Taking a closer look at the place, that is the value of at least two people meal

Can we bring the discussion here?

vmesel · 2017-06-07T17:22:15Z

Hey @silviodc,

I have some questions on the effectiveness of your model! First things first, thanks for sending a PR to serenata that includes CNNs and DNNs to this project, it's a great way to contribute to Brazilian politics! Keep doing it man!

Let me ask you some questions (then it will help you to see if your work is really doing what it is meant to be or not):

Have you tried to create a cross validation set to test out the dataset?
Is the 70% accuracy rate suffering any kind of overfitting/underfitting?
What would happen if a new kind of receipt is inputed (a new place, new items or other data like CNPJ has changed)? Would it still be effective?

silviodc · 2017-06-08T00:30:40Z

Hi everyone,
So, regarding the suspicious recipe, let's bring the discussion here 👍
1- How this scanning of documents is done?
a - by the Deputy Assessor?
b - Someone in the chamber of deputies in charge of it?
2- Since it is not allowed, who assumes the responsibility of this mistake ??
a- The deputies?
b- The employees?
c- Both?

Going to the method:

Have you tried to create a cross validation set to test out the dataset?

I didn't try, but it is my plan to the future. I will use the 2483 suspicious reimbursements i got in my run to build a proper training set and out dataset.

Is the 70% accuracy rate suffering any kind of overfitting/underfitting?

What would happen if a new kind of receipt is inputed (a new place, new items or other data like CNPJ has changed)? Would it still be effective?

So, verifying the first results and the 2483 files, it is suffering of overfitting. I thought that generating many images with ImageDataGenerator could increase the performance of the model, but it seems not the case.

The main point is: I used few data (500 reimbursements; 250 per class), i was really lazy to build the corpus haha.
In the example i followed, they are using (2000 images, 1000 per class) and in my opinion it has a huge impact.
Well, to change the cnpj, data or place don’t lead to get false positives. However, the change of items (description and number) has a huge impact.

So, I already classified 1400 reimbursements of 2483 by hand. Maybe until august I will have the training dataset prepared. I guess with this dataset we can have a better view whether it’s a good approach to include.
In my opinion is too soon to use it directly. However, I think that including the reimbursements showed on twitter and Facebook and validated by the citizens, could be an approach in the future. Only with this 2483 files I’ve noticed some interesting cases.

PS: thank you guys so much for this project! I'm very proud of you 👍 and happy for participate

anaschwendler · 2017-06-08T13:11:36Z

Hi @silviodc, so according the law, we have that:

1- How this scanning of documents is done?
a - by the Deputy Assessor?
b - Someone in the chamber of deputies in charge of it?

According to the law, the congressperson have to give the original document, and the chamber will scan it. By knowing that, they cannot bring any scanned document to the Chamber.
The law is below:
" Art 3 - § 2º Será objeto de ressarcimento a despesa comprovada por documento original, em primeira via, quitado e em nome do Deputado, ressalvado o disposto nos §§ 4º a 6ºdeste artigo e admitindo-se, na hipótese de conta telefônica, apenas a apresentação da folha de rosto, acompanhada do pertinente comprovante de quitação. (Parágrafo com redação dada pelo Ato da Mesa nº 66, de 8/1/2013) "
source

2- Since it is not allowed, who assumes the responsibility of this mistake ??
a- The deputies?
b- The employees?
c- Both?

The responsibility lies on the deputy, according to this piece of the law:

" Art. 4º A solicitação de reembolso será efetuada mediante requerimento padrão, assinado pelo parlamentar, que, nesse ato, declarará assumir inteira responsabilidade pela liquidação da despesa, atestando que:
I - o material foi recebido ou o serviço, prestado;
II - o objeto do gasto obedece aos limites estabelecidos na legislação;
III - a documentação apresentada é autêntica e legítima."
source

So, I already classified 1400 reimbursements of 2483 by hand. Maybe until august I will have the training dataset prepared. I guess with this dataset we can have a better view whether it’s a good approach to include.

So you are thinking about classifying all those reimbursements to create a training dataset, and plan to use it as approach to include? Is there something that we can do to help, besides testing and giving opinion about it?
You always can call us :)

silviodc · 2017-06-08T23:51:56Z

Hi @anaschwendler

Continuing the discussion.
Since the responsibility lies only in the deputy, what we do when they push it to others?

Take a look:

A Câmara possui um setor que analisa e aprova as prestações de contas de cada parlamentar, portanto, se este valor (que consumi e paguei) estivesse ferindo alguma norma da Casa este setor responsável teria apontado o problema e, certamente, não aprovaria minha despesa, devolvendo a nota sem ressarcimento.

Nota deputado Marcon sobre refeição de 130 reais

Did you find an answer to it?

Nosso próximo passo é descobrir a qual instância recorrer quando os próprios deputados negam os números, os dados públicos e a matemática.

Cuducos post

PS: For how long we have to play the game: "I'm not guilty..." ?

So, for the method....

You can help validating the reimbursements i didn't check.
Publishing it on web to be validate by others...

need validation

Best,
Silvio

anaschwendler · 2017-06-13T10:42:49Z

Did you find an answer to it?

We’ve found a answer and we wrote a article in Portuguese about that.

Creating news around it and developing social pressure will have good results and education. We’re aligning with lawyers the next steps.

PS: For how long we have to play the game: "I'm not guilty..." ?

We are not playing, they’re :)

So, for the method....

We had an amazing experience with crowdsourcing before, and we've got plans to do the same here. Soon we’ll release a spreadsheet and drop a line through FB and Twitter.

Thank you very much! 🎉

jtemporal · 2017-06-19T11:01:40Z

hi @silviodc and everyone! Just an update here: We are releasing today the file for the crowdsourcing ;)

Pretty soon we'll have the files identified for testing the method 🎉

silviodc · 2017-06-19T11:26:54Z

Hi @jtemporal
Great news!! I guess already in this week we can test this method again :D

thiagoalencar · 2017-06-19T13:45:34Z

Why not run the model on all ~160k meals receipts and make the spreadsheet available and ordered by the probability generated by the model? It will make the process faster. And after that, a better model can be built. More specifically, we can classify by hand only the receipts that have a probability between 0.75 and 0.35.

silviodc · 2017-06-24T12:25:07Z

Hi everyone!

Finally i executed the code over the new Dataset! Take a look in the results they are amazing :D

silviodc · 2017-06-24T15:52:57Z

Hi @thiagoalencar

Thanks so much for follow this PR too.
So, regarding the overfitting, i guess you missed that the 91% was in an external dataset. It was also suggested by @vmesel to verify whether we could use it on other recipes.

During the construction, we archived 86% in training and 94% in evaluation.
acc: 0.8691 - val_loss: 0.2265 - val_acc: 0.9423

Therefore, the built model can also be generalized to other data.
I suggest you to take a look in the definition of overfitting in this paper vldb ML

For instance, overfitting occurs when the model fits the training data well but does not generalize to unseen or test data.

I think that now the model is pretty good to be used. We have 91% in an external test data.

Regarding the probability value, mentioned here:

Why not run the model on all ~160k meals receipts and make the spreadsheet available and ordered by the probability generated by the model?

Well, this is the probability value regarding the model. It doesn't confirm any think about the reimbursement. In the end we will have a lot of reimbursements which must be validate by hand.
It will lead us to some points:

Validate reimbursements by hand is a laborious task, and we want to avoid it! It is the main reason why we built the ML model.
Since the first model wasn't good enough, we will have tons of data with strange probability.

Thanks so much for your comments and interest.

thiagoalencar · 2017-06-24T18:09:07Z

thanks for the response. but I think that a small test data does not capture all variability of the dataset. and applying in all dataset will not generate too many cases. With this accuracy, only a feel cases will remain in doubt. There will never be a model that prevents manual classification. So, the questionable classification can be iteratively corrected, like google us in they captcha system. Can you post the confusion matrix,sensitivity and specificityof the model?

…-amor # Conflicts: # conda_requirements.txt # research/Dockerfile # research/requirements.txt

# Including pdf > png # png > sift descriptors # png > keras classifier

PDF to PNG ok PNG to SIFT (error in opencv)

Change the workflow for png references

Download files OK Split Files ok Testing trianing ...

…-amor # Conflicts: # research/Dockerfile

# Building Reference Dataset ok # Building Keras model and evaluation OK # PDF-> PNG OK

# Using dhash to detect near duplications.

# Inclusion of Fourier transformation to detect rotation, zoom, and filters.

…tecture-refactor Split logic: public admin and dashboard

silviodc added 3 commits May 18, 2017 23:57

first steps towards analyzing duplicated reimbursiments

7274784

first steps towards analyzing duplicated reimbursiments

434145c

first steps towards analyzing duplicated reimbursiments

15ce17a

silviodc mentioned this pull request May 18, 2017

Detect multiple reimbursements using the same receipt #32

Open

jtemporal added the analysis label May 19, 2017

silviodc added 3 commits May 21, 2017 23:05

detecting reimbursements with generalization description

802203e

fixing pipeline

58d9faa

Classifier to detect generalization in the description of reimbursements

29bdae0

silviodc changed the title ~~First steps towards analyzing duplicated reimbursiments~~ First steps towards analyzing duplicated reimbursiments and DeepLearning Keras May 27, 2017

silviodc mentioned this pull request May 27, 2017

New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188

Closed

Merge branch 'master' into master

6cfbf86

silviodc added 2 commits June 24, 2017 14:18

Final analysis Deep Learning Keras

2dbb3df

Merge remote-tracking branch 'origin/master'

e585117

silviodc mentioned this pull request Jul 20, 2017

First steps towards generalization in reimbursiments description okfn-brasil/rosie#66

Open

silviodc added 4 commits August 13, 2017 00:01

Merge branch 'master' of https://github.com/datasciencebr/serenata-de…

7c2c6b1

…-amor # Conflicts: # conda_requirements.txt # research/Dockerfile # research/requirements.txt

fixing requirements

25aca72

# Including pdf > png # png > sift descriptors # png > keras classifier

Testing image parser

42cbb0e

PDF to PNG ok PNG to SIFT (error in opencv)

OpenCV and SIFT/SURF working

41cd0cb

Change the workflow for png references

silviodc added 7 commits August 16, 2017 23:33

Workflow for csv references

6c01cd7

Fixing pipeline

61f31e1

Download files OK Split Files ok Testing trianing ...

New notebook to dataset reproducibility

215c20a

Fixing building dataset pipeline

9748364

Merge branch 'master' of https://github.com/datasciencebr/serenata-de…

2d1445e

…-amor # Conflicts: # research/Dockerfile

Fixing DockerFile

66facd5

Final Analyses in Generalization in Reimbursements

21af9d0

# Building Reference Dataset ok # Building Keras model and evaluation OK # PDF-> PNG OK

silviodc changed the title ~~First steps towards analyzing duplicated reimbursiments and DeepLearning Keras~~ Final analyses in Generalization of reimbursements using DeepLearning Keras Aug 28, 2017

Indent docker file

2078cc4

anaschwendler added the hacktoberfest label Oct 2, 2017

silviodc added 2 commits October 6, 2017 12:03

First steps to detect duplicate reimbursements

09b9bd9

# Using dhash to detect near duplications.

Final analyses to detect duplicates

aae3a4f

# Inclusion of Fourier transformation to detect rotation, zoom, and filters.

silviodc mentioned this pull request Oct 7, 2017

Attach document image to tweets okfn-brasil/whistleblower#6

Open

silviodc added 2 commits October 11, 2017 23:46

Including version of the libraries

571e69f

Merging serenata-master to silvio-cardo branch

3f76ad6

silviodc force-pushed the master branch from 82499ce to 3f76ad6 Compare October 13, 2017 22:28

Docker version as serenata-git

2c1704b

Irio pushed a commit that referenced this pull request Feb 27, 2018

Merge pull request #238 from datasciencebr/cuducos-public-admin-archi…

dc0f5bc

…tecture-refactor Split logic: public admin and dashboard

cuducos force-pushed the master branch from d4a5126 to e23bf1e Compare September 13, 2018 15:51

silviodc mentioned this pull request Nov 8, 2018

Script to Download and include Supervised Learning okfn-brasil/serenata-toolbox#126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final analyses in Generalization of reimbursements using DeepLearning Keras #238

Final analyses in Generalization of reimbursements using DeepLearning Keras #238

silviodc commented May 18, 2017 •

edited

anaschwendler commented Jun 7, 2017 •

edited

vmesel commented Jun 7, 2017

silviodc commented Jun 8, 2017

anaschwendler commented Jun 8, 2017 •

edited

silviodc commented Jun 8, 2017 •

edited

anaschwendler commented Jun 13, 2017

jtemporal commented Jun 19, 2017

silviodc commented Jun 19, 2017

thiagoalencar commented Jun 19, 2017 •

edited

silviodc commented Jun 24, 2017

silviodc commented Jun 24, 2017 •

edited

thiagoalencar commented Jun 24, 2017

Final analyses in Generalization of reimbursements using DeepLearning Keras #238

Are you sure you want to change the base?

Final analyses in Generalization of reimbursements using DeepLearning Keras #238

Conversation

silviodc commented May 18, 2017 • edited

anaschwendler commented Jun 7, 2017 • edited

vmesel commented Jun 7, 2017

silviodc commented Jun 8, 2017

anaschwendler commented Jun 8, 2017 • edited

silviodc commented Jun 8, 2017 • edited

anaschwendler commented Jun 13, 2017

jtemporal commented Jun 19, 2017

silviodc commented Jun 19, 2017

thiagoalencar commented Jun 19, 2017 • edited

silviodc commented Jun 24, 2017

silviodc commented Jun 24, 2017 • edited

thiagoalencar commented Jun 24, 2017

silviodc commented May 18, 2017 •

edited

anaschwendler commented Jun 7, 2017 •

edited

anaschwendler commented Jun 8, 2017 •

edited

silviodc commented Jun 8, 2017 •

edited

thiagoalencar commented Jun 19, 2017 •

edited

silviodc commented Jun 24, 2017 •

edited