Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesize zero error while scanning the git repo with talisman #374

Open
venkatn087 opened this issue Jul 7, 2022 · 7 comments
Open

Filesize zero error while scanning the git repo with talisman #374

venkatn087 opened this issue Jul 7, 2022 · 7 comments
Labels
needs further info Need more information from the reporter

Comments

@venkatn087
Copy link

Hi Team,

I am getting an Error message while scanning my github repo as mentioned below. Could you please let me know why as i getting an error?
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF

thanks
venkat

@jmatias
Copy link
Collaborator

jmatias commented Jul 11, 2022

Can you share the steps to reproduce the error?

@jmatias jmatias added the needs further info Need more information from the reporter label Jul 11, 2022
@svishwanath-tw
Copy link
Collaborator

@venkatn087 :
What version of talisman are you using ? What OS and ARCH ?

Are there size 0 files in the repository ?

Can you set TALISMAN_DEBUG or execute talisman binary with --debug flag and show us the output. You can paste it in a public/private gist and give us access (@jmatias or @svishwanath-tw ) to it. I suggest that you sanitize(clean/remove) any sensitive information in the output first.

@nasreenwahab
Copy link

@venkatn087 @svishwanath-tw I am also getting out of memory error when running the scan
image

@svishwanath-tw
Copy link
Collaborator

svishwanath-tw commented Jul 22, 2022

@nasreenwahab : Thanks for the update. I'm not convinced you and @venkatn087 are facing the same underlying issue.

From what little I've worked on the codebase, I remember that files once read remain in memory.
IMO, to fix this issue, a re-architecting of the way file (git object) data is read and streamed through various detectors is warranted.
This might have OS specific repercussions and I worry the fix won't be quick.

I have some follow up questions to plan next course of action:
How many files / objects are in your code base ?
How many commits ?
How many files per commit ?
What is average/median size of files ?
What is the size of the largest object in the repo ?

The scan feature was built in a haste and its repercussions are showing now.

@svishwanath-tw
Copy link
Collaborator

Additional complications to factor in. If the decision is to go with blocked reads of files, how to deal with secrets at the edges of a block ? This shouldn't (usually) be a problem with source code. Unless talisman is being used to scan individual files with more than a 100 thousand lines of code. (100K is an arbitrary number)

How many files do should be maintain in memory ? One early request from promoters of talisman within thoughtworks was to display the actual secret in the output (which was later limited to upto 50 chars of output with a ... overflow indicator) . I believe this to be the reason to hold blobs in memory after the detector chain execution completes.

A (comparatively) recent re-factoring I made prevented git sub-process proliferation by using a single sub-process to read contents using batch mode of git cat-file. This replaced the earlier just in time file IO (by spawning a git cat-file at will for each detector/detector type).

The number of variables to consider are huge and I believe going forward without data is not going to yield good results.

@nasreenwahab , @venkatn087, @jmatias , @harinee

@svishwanath-tw
Copy link
Collaborator

The refactoring was an attempt to quash some annoying race condition bugs reported and analysed differently by various users of talisman both internal and external to thoughtworks.

Some related issues and PR's :
#181
#203
#270

@svishwanath-tw
Copy link
Collaborator

@jmatias : FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs further info Need more information from the reporter
Projects
None yet
Development

No branches or pull requests

4 participants