Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow execution time of reading a big file #128

Open
tomkeee opened this issue Sep 26, 2022 · 4 comments
Open

Slow execution time of reading a big file #128

tomkeee opened this issue Sep 26, 2022 · 4 comments

Comments

@tomkeee
Copy link

tomkeee commented Sep 26, 2022

I've tested the platform on how fast it would analyze a big .csv file (532MB, 3 235 282 lines). The execution time of the program (code below) is about 25 minutes.

The program should just print the current line with a very simple comment

main.py

from scramjet.streams import Streamfrom scramjet.streams import Stream

lines_number = 0 

def count(x):
    global lines_number
    lines_number += 1
    return lines_number

def show_line_number(x):
    global lines_number
    if lines_number <1000:
        return f"{lines_number} \n"
    elif lines_number > 2000:
        return f"{lines_number} bigger than 2000 \n"
    return None

def run(context,input):            
    x = (Stream
        .read_from(input)
        .each(count)
        .map(show_line_number)
    )
    return x

lines_number = 0 

def count(x):
    global lines_number
    lines_number += 1
    return lines_number

def show_line_number(x):
    global lines_number
    if lines_number <1000:
        return f"{lines_number} \n"
    elif lines_number > 2000:
        return f"{lines_number} bigger than 2000 \n"
    return None

def run(context,input):            
    x = (Stream
        .read_from(input)
        .each(count)
        .map(show_line_number)
    )
    return x

package.json

{
    "name": "@scramjet/python-big-files",
    "version": "0.22.0",
    "lang": "python",
    "main": "./main.py",
    "author": "XYZ",
    "license": "GPL-3.0",
    "engines": {
        "python3": "3.9.0"
    },
    "scripts": {
        "build:refapps": "yarn build:refapps:only",
        "build:refapps:only": "mkdir -p dist/__pypackages__/ && cp *.py dist/ && pip3 install -t dist/__pypackages__/ -r requirements.txt",
        "postbuild:refapps": "yarn prepack && yarn packseq",
        "packseq": "PACKAGES_DIR=python node ../../scripts/packsequence.js",
        "prepack": "PACKAGES_DIR=python node ../../scripts/publish.js",
        "clean": "rm -rf ./dist"
    }
}

requirements.txt
scramjet-framework-py

@Tatarinho
Copy link

@tomkeee Print operation in every language is slow, if you want to print every line of big file, you have to keep in mind that will drastically slow down your operation. What is a time of execution (processing full file) without printing?

@tomkeee
Copy link
Author

tomkeee commented Sep 29, 2022

@Tatarinho
You are right that print operations are quite slow, yet I tried to do a similar operation on my local machine (code below) and the execution time was 55 sec (on Scramjet it was 25min).

import time

lines_number = 0
with open("/home/sirocco/Pulpit/data.csv") as file_in:
    start = time.time()
    for i in file_in:
        if lines_number <1000:
            print(f"{lines_number} \n")
        elif lines_number > 2000:
            print(f"{lines_number} bigger than 2000 \n")
        lines_number += 1

    print(f"the line_numbers is {lines_number}\n execution time: {time.time()-start}")

Zrzut ekranu z 2022-09-29 09-20-57

@MichalCz
Copy link
Sponsor Member

Hi @tomkeee, we'll be looking into this next week.

@MichalCz
Copy link
Sponsor Member

MichalCz commented Oct 3, 2022

Hmm... so I did some initial digging and was able to run a test with local network and a similar program in node works quite fast, but not as fast as from the disk...

We need to take into account the network connection, but that wouldn't explain 25 minutes.

Could you follow this guide: https://docs.scramjet.org/platform/self-hosted-installation

Then based on this, can you try your program with the data sent to 127.0.0.1? We'd exclude the network and the platform configuration as the culprit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants