feat: Dagster Data pipeline #798

yan91083 · 2023-11-15T21:17:36Z

There are 5 assets:

model_predict: choose model & language to conduct prediction for all three files.
matching: after prediction, choose model & language to analyze how prediction matches with groundtruth.
tabby_eval_result: generate a report(csv file) for all models and languages.
tabby_dataset: read the csv to dataframe.
tabby_jupyter: Jupyter notebook to show the result.

wsxiaoys

First round of feedback, focusing on prediction part.

wsxiaoys · 2023-11-15T22:47:14Z

python/tabby/tabby-data-pipeline/predict.py

+        "pandas"
+    )
+    .copy_local_file(local_path="/tmp/tabby_model_id", remote_path="/tmp/tabby_model_id")
+    .run_function(download_model)


use https://modal.com/docs/reference/modal.Image#env to pass MODEL_ID as environment variable, and in download_model you can access model id as os.environ.get("MODEL_ID")

.env("MODEL_ID", os.environ.get("MODEL_ID"))

wsxiaoys · 2023-11-15T22:48:28Z

python/tabby/tabby-data-pipeline/predict.py

+    )
+    .dockerfile_commands("ENTRYPOINT []")
+    .pip_install(
+        "git+https://github.com/TabbyML/tabby.git#egg=tabby-python-client&subdirectory=experimental/eval/tabby-python-client",


embed this directory inside tabby-data-pipeline, use https://modal.com/docs/reference/modal.Image#copy_local_dir to copy and run pip install

wsxiaoys · 2023-11-15T22:48:59Z

python/tabby/tabby-data-pipeline/predict.py

+
+        my_env = os.environ.copy()
+        my_env["TABBY_DISABLE_USAGE_COLLECTION"] = "1"
+        MODEL_ID = os.popen("cat /tmp/tabby_model_id").read().strip()


wsxiaoys · 2023-11-15T22:49:45Z

python/tabby/tabby-data-pipeline/predict.py

+        from tabby_python_client.api.v1 import health
+
+        resp = await health.asyncio(client=self.client)
+        return resp.to_dict()


just return resp?

wsxiaoys · 2023-11-15T22:50:41Z

python/tabby/tabby-data-pipeline/predict.py

+        return resp.to_dict()
+
+    @method()
+    async def complete(self, language, crossfile_context, index, row):


type annotation for every argument. For row, define a named tuple: https://docs.python.org/3/library/typing.html#typing.NamedTuple

Do you want to list all column name of the dataframe in the namedtuple?

For columns used for input / output

What if there's a dynamic column? For example, "prediction" didn't exist when you first run the file and it will be added to the file. Next time, we need to pass it through the "row"

then it can be a optional typed field:
https://docs.python.org/3/library/typing.html?highlight=optional#typing.Optional

wsxiaoys · 2023-11-15T22:51:28Z

python/tabby/tabby-data-pipeline/predict.py

+
+    df = pd.DataFrame(objs)
+
+    outputs = await asyncio.gather(*[model.complete.remote.aio(language, crossfile_context, index, row) for index, row in df.iterrows()])


does it still necessary to run this in chunks?

wsxiaoys · 2023-11-15T22:52:22Z

python/tabby/tabby-data-pipeline/predict.py

+
+
+@stub.local_entrypoint()
+async def main(language, file):


wsxiaoys · 2023-11-15T22:53:15Z

python/tabby/tabby-data-pipeline/predict.py

Put prediction code under modal directory.

wsxiaoys · 2023-11-17T21:08:18Z

python/tabby/tabby-data-pipeline/modal/predict.py

        LAUNCH_FLAGS = ["serve", "--model", MODEL_ID, "--port", "8000", "--device", "cuda"]
        self.launcher = subprocess.Popen(["/opt/tabby/bin/tabby"] + LAUNCH_FLAGS, env=my_env)
        self.client = Client("http://127.0.0.1:8000", timeout=240)

        # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
        def webserver_ready():
            try:
-                socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
+                socket.create_connection(("127.0.0.1", 8000), timeout=30).close()


Suggested change

socket.create_connection(("127.0.0.1", 8000), timeout=30).close()

socket.create_connection(("127.0.0.1", 8000), timeout=1).close()

Line 91 already contains retry logic, no need to increase timeout

wsxiaoys · 2023-11-17T21:08:18Z

python/tabby/tabby-data-pipeline/modal/predict.py

        LAUNCH_FLAGS = ["serve", "--model", MODEL_ID, "--port", "8000", "--device", "cuda"]
        self.launcher = subprocess.Popen(["/opt/tabby/bin/tabby"] + LAUNCH_FLAGS, env=my_env)
        self.client = Client("http://127.0.0.1:8000", timeout=240)

        # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
        def webserver_ready():
            try:
-                socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
+                socket.create_connection(("127.0.0.1", 8000), timeout=30).close()


Suggested change

socket.create_connection(("127.0.0.1", 8000), timeout=30).close()

socket.create_connection(("127.0.0.1", 8000), timeout=1).close()

Line 91 already contains retry logic, no need to increase timeout

wsxiaoys · 2023-11-17T21:09:35Z

python/tabby/tabby-data-pipeline/modal/predict.py

@@ -111,15 +115,10 @@ async def complete(self, language, crossfile_context, index, row):
        from tabby_python_client import errors
        import pandas as pd

-        if 'prediction' in row and not pd.isnull(row['prediction']):
+        # if prediction exists, just skip


if prediction exists, you can simply don't call complete function

wsxiaoys · 2023-11-17T21:10:59Z

python/tabby/tabby-data-pipeline/modal/predict.py

+
+    for file in ['line_completion.jsonl', 'line_completion_rg1_bm25.jsonl', 'line_completion_oracle_bm25.jsonl']:


please extract function for this, e.g read_pandas_frame

I'm not sure what this meant? extract function for the "for" loop? or do you mean extract function for reading all the three files?

either interface is fine, it's just good to split the main function here into smaller chunks for better readability

This reverts commit 1e677de.

This reverts commit 8faf236.

…into dagster-pipeline

wsxiaoys · 2023-11-28T03:53:15Z

python/tabby-eval/tabby_data_pipeline.egg-info/PKG-INFO

This directory should be in side .gitignore

wsxiaoys · 2023-11-28T03:55:48Z

python/tabby-eval/tabby_data_pipeline/assets.py

+
+    context.add_output_metadata(metadata={"model_id": MetadataValue.md(model_id)})
+
+    files = 'line_completion.jsonl, line_completion_rg1_bm25.jsonl, line_completion_oracle_bm25.jsonl'


where's these files? Should they being added as assets?

wsxiaoys · 2023-11-28T03:56:35Z

python/tabby-eval/tabby_data_pipeline/assets.py

+
+
+    model = model_id.split("/")[-1]
+    for file in ["line_completion.jsonl", "line_completion_rg1_bm25.jsonl", "line_completion_oracle_bm25.jsonl"]:


shouldn't each file itself being an asset?

wsxiaoys · 2023-11-28T03:57:19Z

python/tabby-eval/tabby_data_pipeline/assets.py

+    Int,
+    file_relative_path
+)
+from . import analyze, create_csv


there's no need to extract utility functions. You could just define the asset as python function and orgnize them in individual python files

They are individual python files, but I have to import them and call them in the assets

they don't have to be. They can be defined as asset directly.

wsxiaoys · 2023-11-29T04:24:17Z

python/tabby-eval/.gitignore

@@ -0,0 +1,2 @@
+tmp*
+tabby_data_pipeline.egg-info


you still need to remove the directory from this commit

wsxiaoys · 2023-11-29T04:24:31Z

python/tabby/trainer.py

don't remove this file

wsxiaoys · 2023-11-29T04:24:47Z

python/tabby-eval/log.txt

this should be in git ignore as well

yan91083 added 2 commits November 15, 2023 12:27

add dagster pipeline project

a93d3c6

delete data files

fc2c0e0

wsxiaoys reviewed Nov 15, 2023

View reviewed changes

updating predict.py, adding tabby_python_client package

8faf236

wsxiaoys reviewed Nov 17, 2023

View reviewed changes

yan91083 added 9 commits November 17, 2023 19:41

update predict.py

1e677de

add .gitignore under tabby-data-pipeline folder

a4130bf

ignore tmp* folders

6434390

add edit_distance_analysis.ipynb

3e84a89

Revert "update predict.py"

29d4796

This reverts commit 1e677de.

Revert "updating predict.py, adding tabby_python_client package"

efc886c

This reverts commit 8faf236.

move to tabby-eval directory

0c65b89

Merge branch 'dagster-pipeline' of https://github.com/yan91083/tabby …

c550e54

…into dagster-pipeline

delete old files

96ab64e

yan91083 changed the title ~~Dagster Data pipeline~~ feat: Dagster Data pipeline Nov 25, 2023

Merge branch 'main' into dagster-pipeline

c68811a

wsxiaoys reviewed Nov 28, 2023

View reviewed changes

yan91083 added 2 commits November 28, 2023 11:08

update assets

aee55ac

update .gitignore

ad2eedb

wsxiaoys reviewed Nov 29, 2023

View reviewed changes

python/tabby/trainer.py Outdated

Copy link

Member

wsxiaoys Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't remove this file

wsxiaoys reviewed Nov 29, 2023

View reviewed changes

python/tabby-eval/log.txt Outdated

Copy link

Member

wsxiaoys Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in git ignore as well

yan91083 and others added 3 commits November 28, 2023 21:16

delete unnecessary files

6a8503b

update assets: multiple imports on one line

b5a2d39

[autofix.ci] apply automated fixes

863e63a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Dagster Data pipeline #798

feat: Dagster Data pipeline #798

yan91083 commented Nov 15, 2023

wsxiaoys left a comment

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023 •

edited

yan91083 Nov 16, 2023

wsxiaoys Nov 16, 2023

yan91083 Nov 16, 2023

wsxiaoys Nov 16, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 15, 2023

wsxiaoys Nov 17, 2023

wsxiaoys Nov 17, 2023

wsxiaoys Nov 17, 2023

wsxiaoys Nov 17, 2023

yan91083 Nov 17, 2023

wsxiaoys Nov 17, 2023

wsxiaoys Nov 28, 2023

wsxiaoys Nov 28, 2023

wsxiaoys Nov 28, 2023

wsxiaoys Nov 28, 2023

yan91083 Nov 28, 2023

wsxiaoys Nov 28, 2023

wsxiaoys Nov 29, 2023

wsxiaoys Nov 29, 2023

wsxiaoys Nov 29, 2023


		df = pd.DataFrame(objs)

		outputs = await asyncio.gather(*[model.complete.remote.aio(language, crossfile_context, index, row) for index, row in df.iterrows()])

	socket.create_connection(("127.0.0.1", 8000), timeout=30).close()
	socket.create_connection(("127.0.0.1", 8000), timeout=1).close()


		for file in ['line_completion.jsonl', 'line_completion_rg1_bm25.jsonl', 'line_completion_oracle_bm25.jsonl']:


		context.add_output_metadata(metadata={"model_id": MetadataValue.md(model_id)})

		files = 'line_completion.jsonl, line_completion_rg1_bm25.jsonl, line_completion_oracle_bm25.jsonl'



		model = model_id.split("/")[-1]
		for file in ["line_completion.jsonl", "line_completion_rg1_bm25.jsonl", "line_completion_oracle_bm25.jsonl"]:

feat: Dagster Data pipeline #798

Are you sure you want to change the base?

feat: Dagster Data pipeline #798

Conversation

yan91083 commented Nov 15, 2023

wsxiaoys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wsxiaoys Nov 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wsxiaoys Nov 15, 2023 •

edited