Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large inputs cause hang and never finish #5

Open
fanleung opened this issue Jun 29, 2018 · 9 comments
Open

large inputs cause hang and never finish #5

fanleung opened this issue Jun 29, 2018 · 9 comments

Comments

@fanleung
Copy link

fanleung commented Jun 29, 2018

I have install multidiff and I execute the command with window10
“multidiff test1.bin test2.bin”
it doesn't work...

@juhakivekas
Copy link
Owner

juhakivekas commented Jul 5, 2018

Are you getting any output at all, for example an error or exception?

Does the program exit successfully or does it hang?

What’s your Python version? I haven’t tested with anything lower than 3.5

How big are your files? The required memory will be much more than the combined file sizes, so test with files that are a few kilobytes maximum.

I haven’t tested with windows at all, and it may be that some implementation details have been missed. I’ll get a VM on Monday to test this, but if you could provide he above info it would help me :)

@fanleung
Copy link
Author

fanleung commented Jul 9, 2018

Hi, here is more detail as follows.

  1. Window 10, 64bit, Python 3.6.4
  2. I have install multidiff, and I execute the commandmultidiff -h. it output
    usage: multidiff [-h] [-p PORT] [-s] [-m MODE] [-i INFORMAT] [-o OUTFORMAT]
    [--html]
    [file [file ...]]

N E O N S E N S E
augmentations inc
┌───────────────┐
│ M U L T I │
│ D I F F │
│ sensor module │
└───────────────┘

positional arguments:
file file or directory to include in multidiff

optional arguments:
-h, --help show this help message and exit
-p PORT, --port PORT start a local socket server on the given port
-s, --stdin read data from stdin, objects split by newlines
-m MODE, --mode MODE mode of operation, either "baseline" or "sequence"
-i INFORMAT, --informat INFORMAT
input data format:
utf8 (stdin default)
raw (file and server default)
hex
json
-o OUTFORMAT, --outformat OUTFORMAT
output data format:
utf8
hex
hexdump (default)
--html use html for colors instead of ansi codes

  1. now I want to diff two bin file and I input the command multidiff 1.bin 2.bin.
    It hangs and output nothing...
    Only Ctrl+c can exit, and the output information is :
    C:\Users\Fanleung\Desktop\multidiff-master\multidiff-master>multidiff 1.bin 2.bin Traceback (most recent call last): File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\Scripts\multidiff-script.py", line 6, in <module> from pkg_resources import load_entry_point File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pkg_resources\__init__.py", line 70, in <module> from pkg_resources.extern import appdirs File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 656, in _load_unlocked File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pkg_resources\extern\__init__.py", line 43, in load_module __import__(extant) File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pkg_resources\_vendor\appdirs.py", line 510, in <module> import win32com.shell File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\win32com\__init__.py", line 6, in <module> import pythoncom File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\pythoncom.py", line 3, in <module> pywintypes.__import_pywin32_system_module__("pythoncom", globals()) File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\site-packages\win32\lib\pywintypes.py", line 123, in __import_pywin32_system_module__ mod = imp.load_dynamic(modname, found) File "C:\Users\Fanleung\AppData\Local\Programs\Python\Python36\lib\imp.py", line 343, in load_dynamic return _load(spec) KeyboardInterrupt

@juhakivekas
Copy link
Owner

I set up a virtual machine and tested this in Windows 10 with python 3.7 and everything seems to work fine. Did you install multidiff by running python setup.py install? Are you sure the files are not too large?

@fanleung
Copy link
Author

ahh... When I create two test bin file, and the file size is about 4KB, It done.
what is the limit size of the file?

@juhakivekas
Copy link
Owner

juhakivekas commented Jul 10, 2018

The limit is constrained by the available ram on your computer so I’m not sure but I’d say it’s in the tens or hundreds of megabytes. The whole file will be printed too so a large file might be a little unpractical to work with.

What kind of data are you looking at and what are you trying to find? Small differences in large files?

@fanleung
Copy link
Author

Yes, The kind of data I trying to find is small differences in large file.
The last time the program hung, the test file size was about 4MB.
This time I try to diff two 1MB files and It can run but costs a lot of time(1 min).
Maybe 4MB file also can work but I didn't wait.

@juhakivekas
Copy link
Owner

Yes, that's due to the python difflib needing to always diff the whole sequence. The difflib documentation says:

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

So in the worst case scenario 4M takes 4^3=128 times as long as 1M, which is clearly too long for your use. I also assume you wouldn't really want to see all the matching parts, but only the differing ones? I think making the tool faster would need some work on the underlying diffing algorithms, which is something I'm unlikely to have time for in the near future.

@juhakivekas juhakivekas changed the title excuse me , how to diff two bin or hex file? large inputs cause hang and never finish Apr 25, 2019
@Utkarsh1308
Copy link
Contributor

I modified the multidiff library to only show the addresses where the bytes have changed.
The output now generates a diff which only shows the addresses where the bytes have changed. This only works for hexdump outformat.

The output looks something like this - https://pastebin.com/csT3dpRK

I tested it on a 2MB file and it took me approxmately 10 minutes.

@juhakivekas
Copy link
Owner

Great, if you want me to merge those changes, then just make a pull request but add a flag to the commandline for the feature. I think the hang issue is related to calculating the diff rather than outputting the result so I wont be closing this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants