Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final analyses in Generalization of reimbursements using DeepLearning Keras #238

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

silviodc
Copy link

@silviodc silviodc commented May 18, 2017

  1. Converting PDF files to png and then applying sift descriptors
  2. DeepLearning Keras to detect generalization in reimbursements

@silviodc silviodc changed the title First steps towards analyzing duplicated reimbursiments First steps towards analyzing duplicated reimbursiments and DeepLearning Keras May 27, 2017
@anaschwendler
Copy link
Collaborator

anaschwendler commented Jun 7, 2017

Hi @silviodc, thank you very much for the contribution! 🎉
Thanks for you patience!

Me, @jtemporal and @cabral took a closer look into it and found the best way to understand what you have done! Discussing in the public group helped us to find a way to do that! :)

So you are looking for some answers, so here it is:
The document is the following one

It is clear to me that the description of the items was made by someone else than the restaurant, is it allowed ???

No it is not allowed, should be described on the receipt

Are the deputies or acessors changing a document?? What are the implications about it?

No, it is not allowed that the receipt is altered (as the law says). It looks very suspicious. It is important to notice that the receipt was not changed but extra information was added by hand outside of the receipt while scanning it, and the CEAP clearly states that also can't happen. We have added an image that highlights part of the law bellow that proves that.

image

Taking a closer look at the place, that is the value of at least two people meal

Can we bring the discussion here?

@vmesel
Copy link

vmesel commented Jun 7, 2017

Hey @silviodc,

I have some questions on the effectiveness of your model! First things first, thanks for sending a PR to serenata that includes CNNs and DNNs to this project, it's a great way to contribute to Brazilian politics! Keep doing it man!

Let me ask you some questions (then it will help you to see if your work is really doing what it is meant to be or not):

  • Have you tried to create a cross validation set to test out the dataset?
  • Is the 70% accuracy rate suffering any kind of overfitting/underfitting?
  • What would happen if a new kind of receipt is inputed (a new place, new items or other data like CNPJ has changed)? Would it still be effective?

@silviodc
Copy link
Author

silviodc commented Jun 8, 2017

Hi everyone,
So, regarding the suspicious recipe, let's bring the discussion here 👍
1- How this scanning of documents is done?
a - by the Deputy Assessor?
b - Someone in the chamber of deputies in charge of it?
2- Since it is not allowed, who assumes the responsibility of this mistake ??
a- The deputies?
b- The employees?
c- Both?

Going to the method:

  • Have you tried to create a cross validation set to test out the dataset?

I didn't try, but it is my plan to the future. I will use the 2483 suspicious reimbursements i got in my run to build a proper training set and out dataset.

  • Is the 70% accuracy rate suffering any kind of overfitting/underfitting?
  • What would happen if a new kind of receipt is inputed (a new place, new items or other data like CNPJ has changed)? Would it still be effective?

So, verifying the first results and the 2483 files, it is suffering of overfitting. I thought that generating many images with ImageDataGenerator could increase the performance of the model, but it seems not the case.

The main point is: I used few data (500 reimbursements; 250 per class), i was really lazy to build the corpus haha.
In the example i followed, they are using (2000 images, 1000 per class) and in my opinion it has a huge impact.
Well, to change the cnpj, data or place don’t lead to get false positives. However, the change of items (description and number) has a huge impact.

So, I already classified 1400 reimbursements of 2483 by hand. Maybe until august I will have the training dataset prepared. I guess with this dataset we can have a better view whether it’s a good approach to include.
In my opinion is too soon to use it directly. However, I think that including the reimbursements showed on twitter and Facebook and validated by the citizens, could be an approach in the future. Only with this 2483 files I’ve noticed some interesting cases.

PS: thank you guys so much for this project! I'm very proud of you 👍 and happy for participate

@anaschwendler
Copy link
Collaborator

anaschwendler commented Jun 8, 2017

Hi @silviodc, so according the law, we have that:

1- How this scanning of documents is done?
a - by the Deputy Assessor?
b - Someone in the chamber of deputies in charge of it?

According to the law, the congressperson have to give the original document, and the chamber will scan it. By knowing that, they cannot bring any scanned document to the Chamber.
The law is below:
" Art 3 - § 2º Será objeto de ressarcimento a despesa comprovada por documento original, em primeira via, quitado e em nome do Deputado, ressalvado o disposto nos §§ 4º a 6ºdeste artigo e admitindo-se, na hipótese de conta telefônica, apenas a apresentação da folha de rosto, acompanhada do pertinente comprovante de quitação. (Parágrafo com redação dada pelo Ato da Mesa nº 66, de 8/1/2013) "
source

2- Since it is not allowed, who assumes the responsibility of this mistake ??
a- The deputies?
b- The employees?
c- Both?

The responsibility lies on the deputy, according to this piece of the law:

" Art. 4º A solicitação de reembolso será efetuada mediante requerimento padrão, assinado pelo parlamentar, que, nesse ato, declarará assumir inteira responsabilidade pela liquidação da despesa, atestando que:
I - o material foi recebido ou o serviço, prestado;
II - o objeto do gasto obedece aos limites estabelecidos na legislação;
III - a documentação apresentada é autêntica e legítima."
source

So, I already classified 1400 reimbursements of 2483 by hand. Maybe until august I will have the training dataset prepared. I guess with this dataset we can have a better view whether it’s a good approach to include.

So you are thinking about classifying all those reimbursements to create a training dataset, and plan to use it as approach to include? Is there something that we can do to help, besides testing and giving opinion about it?
You always can call us :)

@silviodc
Copy link
Author

silviodc commented Jun 8, 2017

Hi @anaschwendler

Continuing the discussion.
Since the responsibility lies only in the deputy, what we do when they push it to others?

Take a look:

A Câmara possui um setor que analisa e aprova as prestações de contas de cada parlamentar, portanto, se este valor (que consumi e paguei) estivesse ferindo alguma norma da Casa este setor responsável teria apontado o problema e, certamente, não aprovaria minha despesa, devolvendo a nota sem ressarcimento.

Nota deputado Marcon sobre refeição de 130 reais

Did you find an answer to it?

Nosso próximo passo é descobrir a qual instância recorrer quando os próprios deputados negam os números, os dados públicos e a matemática.

Cuducos post

PS: For how long we have to play the game: "I'm not guilty..." ?

So, for the method....

  1. You can help validating the reimbursements i didn't check.
  2. Publishing it on web to be validate by others...

need validation

Best,
Silvio

@anaschwendler
Copy link
Collaborator

Did you find an answer to it?

We’ve found a answer and we wrote a article in Portuguese about that.

Creating news around it and developing social pressure will have good results and education. We’re aligning with lawyers the next steps.

PS: For how long we have to play the game: "I'm not guilty..." ?

We are not playing, they’re :)

So, for the method....

We had an amazing experience with crowdsourcing before, and we've got plans to do the same here. Soon we’ll release a spreadsheet and drop a line through FB and Twitter.

Thank you very much! 🎉

@jtemporal
Copy link
Collaborator

hi @silviodc and everyone! Just an update here: We are releasing today the file for the crowdsourcing ;)

Pretty soon we'll have the files identified for testing the method 🎉

@silviodc
Copy link
Author

Hi @jtemporal
Great news!! I guess already in this week we can test this method again :D

@thiagoalencar
Copy link

thiagoalencar commented Jun 19, 2017

Why not run the model on all ~160k meals receipts and make the spreadsheet available and ordered by the probability generated by the model? It will make the process faster. And after that, a better model can be built. More specifically, we can classify by hand only the receipts that have a probability between 0.75 and 0.35.

@silviodc
Copy link
Author

Hi everyone!

Finally i executed the code over the new Dataset! Take a look in the results they are amazing :D

@silviodc
Copy link
Author

silviodc commented Jun 24, 2017

Hi @thiagoalencar

Thanks so much for follow this PR too.
So, regarding the overfitting, i guess you missed that the 91% was in an external dataset. It was also suggested by @vmesel to verify whether we could use it on other recipes.

During the construction, we archived 86% in training and 94% in evaluation.
acc: 0.8691 - val_loss: 0.2265 - val_acc: 0.9423

Therefore, the built model can also be generalized to other data.
I suggest you to take a look in the definition of overfitting in this paper vldb ML

For instance, overfitting occurs when the model fits the training data well but does not generalize to unseen or test data.

I think that now the model is pretty good to be used. We have 91% in an external test data.

Regarding the probability value, mentioned here:

Why not run the model on all ~160k meals receipts and make the spreadsheet available and ordered by the probability generated by the model?

Well, this is the probability value regarding the model. It doesn't confirm any think about the reimbursement. In the end we will have a lot of reimbursements which must be validate by hand.
It will lead us to some points:

  1. Validate reimbursements by hand is a laborious task, and we want to avoid it! It is the main reason why we built the ML model.
  2. Since the first model wasn't good enough, we will have tons of data with strange probability.

Thanks so much for your comments and interest.

@thiagoalencar
Copy link

thanks for the response. but I think that a small test data does not capture all variability of the dataset. and applying in all dataset will not generate too many cases. With this accuracy, only a feel cases will remain in doubt. There will never be a model that prevents manual classification. So, the questionable classification can be iteratively corrected, like google us in they captcha system. Can you post the confusion matrix,sensitivity and specificityof the model?

…-amor

# Conflicts:
#	conda_requirements.txt
#	research/Dockerfile
#	research/requirements.txt
# Including pdf > png
# png > sift descriptors
# png > keras classifier
PDF to PNG ok
PNG to SIFT (error in opencv)
Change the workflow for png references
@silviodc silviodc changed the title First steps towards analyzing duplicated reimbursiments and DeepLearning Keras Final analyses in Generalization of reimbursements using DeepLearning Keras Aug 28, 2017
# Using dhash to detect near duplications.
# Inclusion of Fourier transformation to detect rotation, zoom, and filters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants