Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Synchronization Dependency with Holistic Trace Analysis #57

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

JoongunPark
Copy link
Contributor

@JoongunPark JoongunPark commented May 10, 2024

Summary

This PR is to process synchronization dependency between the Chakra nodes.
In order to do that, we use CriticalPathAnalyzer in Holistic Trace Analysis (https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/hta/analyzers/critical_path_analysis.py).

Please note that,

Test Plan

Download and Install HTA (Later version than 5c7898abbc52a1d4051ef6c93365477feb6c08a8 causes error. WIP).

git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git
cd HolisticTraceAnalysis
git checkout 5c7898abbc52a1d4051ef6c93365477feb6c08a8
git submodule update --init
pip install -r requirements.txt
pip install -e .

Run trace_link and chakra converter with Resnet-50

# Resnet-50
# Trace Link
chakra_trace_link --pytorch-et-file Resnet-50/eg.rank_0.pt.trace.json --kineto-file Resnet-50/kineto.rank_0_* --rank 0 --output-file Resnet-50/rank_0.json
chakra_trace_link --pytorch-et-file Resnet-50/eg.rank_1.pt.trace.json --kineto-file Resnet-50/kineto.rank_1_* --rank 1 --output-file Resnet-50/rank_1.json

# Converter
chakra_converter --input_filename Resnet-50/rank_0.json --output_filename Resnet-50/rank.0.et --input_type PyTorch
chakra_converter --input_filename Resnet-50/rank_1.json --output_filename Resnet-50/rank.1.et --input_type PyTorch

Run trace_link and chakra converter with Llama2

# Trace Link
chakra_trace_link --pytorch-et-file llama2/eg.rank_0.pt.trace.json --kineto-file llama2/kineto.rank_0_* --rank 0 --output-file llama2/rank_0.json
chakra_trace_link --pytorch-et-file llama2/eg.rank_1.pt.trace.json --kineto-file llama2/kineto.rank_1_* --rank 1 --output-file llama2/rank_1.json

# Converter
chakra_converter --input_filename llama2/rank_0.json --output_filename llama2/rank.0.et --input_type PyTorch
chakra_converter --input_filename llama2/rank_1.json --output_filename llama2/rank.1.et --input_type PyTorch

Here are traces that I used.

Resnet-50.zip
llama2.zip

Test Result on ASTRA-Sim

Test result shows that supporting Synchonization dependency with those traces does not change the result in ASTRA-Sim.
There might be few reasons, i) Chakra converter already has very strict dependency. ii) There are few Synchronization dependencies. iii) Dependencies within in the same GPU group (in trace_link) are not included to avoid circular dependency, since they share the same PyTorch node ID.

Resnet-50

ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
sys[0] finished, 581291000 cycles
sys[1] finished, 585752000 cycles

Llama2

ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
sys[1] finished, 361330259 cycles
sys[0] finished, 375773000 cycles

@JoongunPark JoongunPark requested a review from a team as a code owner May 10, 2024 22:10
Copy link

github-actions bot commented May 10, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant