Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(dev/docker/hive): shrink hive Docker image size by 420MB #3268

Merged
merged 10 commits into from May 22, 2024

Conversation

unknowntpo
Copy link
Contributor

@unknowntpo unknowntpo commented May 5, 2024

What changes were proposed in this pull request?

Use multi-stage build to reduce hive image size from 2.27GB to 1.83GB.

I use first stage to download archive file and unzip them, then in the final stage, copied them to destination directory.

Why are the changes needed?

Fix: #3262

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

TODO

@unknowntpo unknowntpo changed the title refactor(dev/docker/hive): shrink hive Docker image size by 200MB refactor(dev/docker/hive): shrink hive Docker image size by 500MB May 5, 2024
@unknowntpo unknowntpo closed this May 5, 2024
@unknowntpo
Copy link
Contributor Author

reopen for rerunning ci tests

@unknowntpo unknowntpo reopened this May 5, 2024
@unknowntpo unknowntpo marked this pull request as ready for review May 5, 2024 09:58
@unknowntpo
Copy link
Contributor Author

@mchades Would you like to review this PR ?

@@ -83,6 +83,7 @@ public void startHiveContainer() {
HiveContainer.Builder hiveBuilder =
HiveContainer.builder()
.withHostName("gravitino-ci-hive")
.withImage("unknowntpo/gravitino-ci-hive:latest")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use my hive image here, we need to upload newly created gravitino-ci-hive image before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @unknowntpo , It's not good practice to use your image here, @xunliu Please help to verify if changes about the Hive image are reasonable, If everything is OK, we can release the image before this PR is merged, then @unknowntpo can change the image to the newly released one.

@unknowntpo Another point is that you may need to modify the related document about the Hive image, please see
https://github.com/datastrato/gravitino/blob/14297cddc894a4d47851ddfbcaec50bc547e0387/docs/docker-image-details.md?plain=1#L87C1-L88C1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, if @xunliu find this modification reasonable and push new image to docker hub, I'll update the changelog in docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've add some todos in the PR description.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @unknowntpo , It's not good practice to use your image here, @xunliu Please help to verify if changes about the Hive image are reasonable, If everything is OK, we can release the image before this PR is merged, then @unknowntpo can change the image to the newly released one.

@unknowntpo Another point is that you may need to modify the related document about the Hive image, please see https://github.com/datastrato/gravitino/blob/14297cddc894a4d47851ddfbcaec50bc547e0387/docs/docker-image-details.md?plain=1#L87C1-L88C1

@yuqi1129 Can you help release this Docker image to hub?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @unknowntpo , It's not good practice to use your image here, @xunliu Please help to verify if changes about the Hive image are reasonable, If everything is OK, we can release the image before this PR is merged, then @unknowntpo can change the image to the newly released one.
@unknowntpo Another point is that you may need to modify the related document about the Hive image, please see 14297cd/docs/docker-image-details.md?plain=1#L87C1-L88C1

@yuqi1129 Can you help release this Docker image to hub?

OK

@@ -45,8 +64,6 @@ RUN apt-get update && apt-get upgrade -y && apt-get install --fix-missing -yq \
RUN mkdir /root/.ssh
RUN cat /dev/zero | ssh-keygen -q -N "" > /dev/null && cat /root/.ssh/id_rsa.pub > /root/.ssh/authorized_keys

COPY packages /tmp/packages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use ADD, if use ADD in docker file for a zip file, docker builder will copy and unzip it, it will reduce the image size.

then, seems we do not need two stage build.

Copy link
Contributor Author

@unknowntpo unknowntpo May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you advise! I used ADD in the beginning, but I found that we need --strip-component to remove outer directory, ADD cannot do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, --strip-component did not worked in ADD, so I use a soft link in Doris Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll try the soft-link approach, thanks 😀

@@ -3,14 +3,33 @@
# This software is licensed under the Apache License version 2.
#

FROM ubuntu:16.04
LABEL maintainer="support@datastrato.com"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need kept LABEL maintainer="support@datastrato.com" in the here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't use 2 stage build right now, this review is outdated.

@@ -3,14 +3,33 @@
# This software is licensed under the Apache License version 2.
#

FROM ubuntu:16.04
LABEL maintainer="support@datastrato.com"
FROM ubuntu:16.04 AS packages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better change to FROM ubuntu:16.04 AS install-hadoop, The packages name ease confuse with /tmp/packages directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't use 2 stage build right now, this is not needed.


# hadoop
RUN mkdir ${HADOOP_HOME}
RUN tar -xz -C ${HADOOP_HOME} --strip-components 1 -f /tmp/packages/${HADOOP_PACKAGE_NAME} && rm -rf /tmp/packages/${HADOOP_PACKAGE_NAME}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need use --strip-component params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as #3268 (comment)

@unknowntpo unknowntpo force-pushed the feat-shrink-docker-image-size branch from b4fd3a5 to 8c3ffa5 Compare May 12, 2024 03:37
@unknowntpo
Copy link
Contributor Author

unknowntpo commented May 12, 2024

I use soft-link approach as @zhoukangcn mentioned at #3268 (comment), because we don't need to maintain lots of env var like JAVA_HOME=/usr/local/jdk at 1st stage and final stage, which is a little bit ugly.
c.c. @xunliu , @mchades

@xunliu
Copy link
Collaborator

xunliu commented May 17, 2024

hi @unknowntpo thank you for your continue work.
Please attach your latest Docker image size screen snapshot in the comments.
Use soft link also shrink size by 500MB?

@unknowntpo
Copy link
Contributor Author

@xunliu Shrinked hive Docker image: 1.85 GB (with soft-link)

Original Docker image: 2.27 GB
image

It's not 500 MB, but 420 MB 😭

@unknowntpo unknowntpo changed the title refactor(dev/docker/hive): shrink hive Docker image size by 500MB refactor(dev/docker/hive): shrink hive Docker image size by 420MB May 17, 2024
@xunliu
Copy link
Collaborator

xunliu commented May 21, 2024

hi @unknowntpo Please fix branch conflicts. Thanks

@unknowntpo unknowntpo force-pushed the feat-shrink-docker-image-size branch from 4fc6112 to 38fde2d Compare May 21, 2024 14:29
xunliu
xunliu previously approved these changes May 22, 2024
Copy link
Collaborator

@xunliu xunliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuqi1129
Copy link
Contributor

yuqi1129 commented May 22, 2024

@unknowntpo
Could you also modify the change log?
please see:

- gravitino-ci-hive:0.1.11
- Remove `yarn` from the startup script; Remove `yarn-site.xml` and `yarn-env.sh` files;
- Change the value of `mapreduce.framework.name` from `yarn` to `local` in the `mapred-site.xml` file.

@yuqi1129
Copy link
Contributor

I have released gravitino-ci-hive:0.1.12. Please upgrade it to Gravitino from version 0.1.11 to 0.1.12.

@unknowntpo unknowntpo force-pushed the feat-shrink-docker-image-size branch from 87acb60 to 200c0e8 Compare May 22, 2024 05:03
@unknowntpo
Copy link
Contributor Author

I have released gravitino-ci-hive:0.1.12. Please upgrade it to Gravitino from version 0.1.11 to 0.1.12.

@yuqi1129 Done.

@yuqi1129
Copy link
Contributor

I have released gravitino-ci-hive:0.1.12. Please upgrade it to Gravitino from version 0.1.11 to 0.1.12.

@yuqi1129 Done.

I mean you need to change the following place:

image

There should be several similar places.

@unknowntpo unknowntpo force-pushed the feat-shrink-docker-image-size branch from 200c0e8 to acf79aa Compare May 22, 2024 06:01
Copy link
Contributor

@yuqi1129 yuqi1129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@unknowntpo
Copy link
Contributor Author

@yuqi1129 done, the tests has passed.

@yuqi1129 yuqi1129 merged commit 506963e into datastrato:main May 22, 2024
22 checks passed
github-actions bot pushed a commit that referenced this pull request May 22, 2024
)

### What changes were proposed in this pull request?

Use multi-stage build to reduce hive image size from 2.27GB to 1.83GB.

I use first stage to download archive file and unzip them, then in the
final stage, copied them to destination directory.

### Why are the changes needed?

Fix: #3262 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.


## TODO
- [ ] @mchades needs to push new image release to docker hub
- [ ] @unknowntpo should update changelog
@unknowntpo unknowntpo deleted the feat-shrink-docker-image-size branch May 22, 2024 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] [Improvement] Use multi-staged build to shrink Hive image size
4 participants