-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XPK update for hybridsim #102
Conversation
@@ -123,10 +123,20 @@ | |||
volumeMounts: | |||
- mountPath: /dev/shm | |||
name: dshm-2 | |||
- mountPath: /var/run/docker.sock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to talk about the architecture, this doesn't make sense to me to hard code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to avoiding hardcoding this in the common path for TPUs/ CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like both GPU and Pathway mount /tmp
folder
https://github.com/google/xpk/blob/main/xpk.py#L325-L327
https://github.com/google/xpk/blob/main/xpk.py#L271-L272
Maybe I should mount /tmp
folder instead of /tmp/xla_dump/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU / Pathways has specific workload create files (also a sad state of duplication) but that's why it is ok for those mounts. The issue here is we are mounting something that doesn't always exist.
I assume this PR is not needed anymore? @tonyjohnchen let me know if we can close it! |
Yeah, I can close it. |
Fixes / Features
For
AOT+Hybridsim
integration, we need to pull Hybridsim docker image, so we need to mount/var/run/docker.sock
(https://screenshot.googleplex.com/BD3SMwL57tP5cQY)For running Hybridsim, we need to mount
/tmp/xla_dump/
so the Hybridsim docker image can access to HLO graph (https://screenshot.googleplex.com/64Fwrjwt6g55orn)Testing / Documentation
Tested
AOT+Hybridsim
e2e on a GKE node: https://screenshot.googleplex.com/5LQwBNBp4p6iyLqThis should be a combo PR of Maxtext and XPK:
google/maxtext#564
#102