Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run node-parquet in AWS Lambda #20

Open
taureliloome opened this issue Mar 14, 2017 · 13 comments
Open

run node-parquet in AWS Lambda #20

taureliloome opened this issue Mar 14, 2017 · 13 comments

Comments

@taureliloome
Copy link

Hi,
I wanted to use this wonderful module in aws lambda, the key blocker is that when I compile node-parquet module then the whole thing is over 400MB; Unfortunately AWS Lambda allows to upload ~240 MB max per lambda function.
I was wondering is there any possibility to slim the whole output down. Or is this is what we get?
In any case I'm looking through make files to understand if I can do something on my own.
Thanks for your time!

@mvertes
Copy link
Collaborator

mvertes commented Mar 14, 2017

It should be possible to be smaller than 400 MB

mvertes added a commit to mvertes/node-parquet that referenced this issue Apr 3, 2017
It allows to remove the build_deps directory after building the
module, and reduce drastically the size of module, addressing skale-me#20.
@mvertes
Copy link
Collaborator

mvertes commented Apr 4, 2017

Hi, can you check again, and run npm run clean after npm install ? It should clean most of the stuff necessary to build but useless at running.

@alaister
Copy link

Any luck getting it to work? I am getting the following error when trying to run in lambda:


{
  "errorMessage": "libboost_regex.so.1.62.0: cannot open shared object file: No such file or directory",
  "errorType": "Error",
  "stackTrace": [
    "Object.Module._extensions..node (module.js:597:18)",
    "Module.load (module.js:487:32)",
    "tryModuleLoad (module.js:446:12)",
    "Function.Module._load (module.js:438:3)",
    "Module.require (module.js:497:17)",
    "require (internal/module.js:20:19)",
    "Object.<anonymous> (/var/task/node_modules/node-parquet/index.js:5:17)",
    "Module._compile (module.js:570:32)",
    "Object.Module._extensions..js (module.js:579:10)"
  ]
}

@aib-nick
Copy link

aib-nick commented Mar 8, 2018

@alaister make a 'lib' folder on your lambda function and copy that library in there.. that worked for me

@fzaffarana
Copy link

@aib-nick can you give me more information about your work with aws lambdas & the module?

I'm getting this error:

module initialization error: Error
at Object.Module._extensions..node

i guess that it is the same @alaister error.

Thanks!

@aib-nick
Copy link

aib-nick commented Apr 17, 2018

@fzaffarana this is my lambda application layout. As you can see I just made a lib directory and copied the missing library in there. I use AWS cloud9 for lambda development, so I got the library from there, and it works when deployed.

./myprogram
./myprogram/index.js
./lib
./lib/libboost_regex.so.1.53.0
./node_modules/node-parquet/...
... other modules installed with normal npm install  ...
./template.yaml
./.application.json

and then I just include and use stuff normally

I have successfully made parquet files on s3 with this by putting a function inside a kinesis stream as a transformation function, and then throwing away all the transformations.. so the lambda functions writes to s3, and the kinesis stream does not. It almost worked, but I got a few errors where kinesis aborted, and I couldn't really debug what was going on... ultimately I had to abandon this method because of time constraints. But it was very close. I was able to read the resulting files from athena.

// setup AWS access
const setRegion = "us-east-1";
const AWS = require('aws-sdk');
AWS.config.update({region: setRegion});

// setup s3 access
const s3 = new AWS.S3();

// parquet access
const parquet = require('node-parquet');
....

exports.handler = (event, context, callback) => {

...
   
    // schema for this parquet file
    const schema = { ... };

.... loop through input and build up out_data[]  ...

            var tmpobj = tmp.fileSync();    
            var writer = new parquet.ParquetWriter(tmpobj.name, schema, 'snappy');
            writer.write(out_data[k]);
            writer.close();

... write to s3 ...

            // give s3 the ability to read the local file and stream it
            var rs = fs.createReadStream(tmpobj.name);
            
            var s3_key = "parquet/stuff/year=" + moment(k).format('YYYY');
            s3_key = s3_key + "/month=" + moment(k).format('MM');
            s3_key = s3_key + "/day=" + moment(k).format('DD');
            s3_key = s3_key + "/" + invocationId + ".snappy.parquet";

...

 s3.putObject(s3_put_params, function(err, data) {

.. throw away records so kinesis doesn't write them after we wrote ok...

                // this tells kinesis to throw away all the records we saved otherwise
                output.push({
                    recordId: record.recordId,
                    result: 'Dropped'
                });

...

callback(null, { records: output });

@fzaffarana
Copy link

@aib-nick thank you first of all for the help.

I can see that we have similar lambdas (this is good). (i'm going to take your trick => 'give s3 the ability to read the local file and stream it').

But, i don't know if we have the same error.

this is mine (in aws console):

module initialization error: Error
at Object.Module._extensions..node (module.js:681:18)
at Module.load (module.js:565:32)
at tryModuleLoad (module.js:505:12)
at Function.Module._load (module.js:497:3)
at Module.require (module.js:596:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (/var/task/src/project/classes/node-parquet/index.js:5:17)
at Module._compile (module.js:652:30)
at Object.Module._extensions..js (module.js:663:10)
at Module.load (module.js:565:32)
at tryModuleLoad (module.js:505:12)
at Function.Module._load (module.js:497:3)
at Module.require (module.js:596:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (/var/task/src/project/classes/Tools.js:4:17)
at Module._compile (module.js:652:30)

It doesn't show any specific lib missing. On the other hand, when i test this lambda in my local environment, it works correctly.

@visuddha
Copy link

This would be a useful feature!

@palafoxernesto
Copy link

Is there any fix?

@dogenius01
Copy link

@aib-nick
Hi, Could you list the lib file?
I'm running lambda to see every lib errors... :(
Please help..~~

@mikeytag
Copy link

It's been a while since this question was originally asked, but I wanted to followup and see if anyone has a tried and true way of doing the npm install/adding lib files that always works to get node-parquet working on Lambda?

I'm about to embark on this task and would love to hear the wisdom of others as far as any gotchas.

@paflopes
Copy link

paflopes commented Dec 11, 2019

I've managed to run node-parquet on AWS Lambda version NodeJS 10.x, I think it's worth mentioning that I couldn't build it on newer NodeJS versions. You'll also need Docker installed on your machine.
The steps are the following:

Run this in the root folder of your project

$ docker run --rm -it -v "$PWD":/var/task lambci/lambda:build-nodejs10.x /bin/bash

This will give you an environment similar to the AWS Lambda.

Inside the container run the following commands:

# First we update the cmake version since this image comes with the version 2
cmake_name="cmake-3.16.1-Linux-x86_64"
cmake_tar="${cmake_name}.tar.gz"
curl -L https://github.com/Kitware/CMake/releases/download/v3.16.1/${cmake_tar} -o /opt/${cmake_tar}
mkdir -p /opt/${cmake_name}
tar xf /opt/${cmake_tar} -C /opt
chmod a+x /opt/${cmake_name}/bin/cmake
mv /bin/cmake /bin/cmake.bkp
ln -s /opt/${cmake_name}/bin/cmake /bin/cmake

# Now we install the last dependencies and build the project
yum install -y boost-devel bison flex
npm install

# Cleanup dependencies so we can actually deploy to AWS Lambda
rm -Rf ./node_modules/node-parquet/build_deps

I hope this helps!

@dreadjr
Copy link

dreadjr commented Dec 11, 2019

I have done similar to what @paflopes describes and putting that into a layer which the application can use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests