Embedding factory script looping through each MGRS tile #125

weiji14 · 2024-01-17T03:25:40Z

Jupyter notebook script to generate GeoParquet embedding files on a per MGRS tile basis.

Steps:

The script first generates an mgrs_world.txt file with a list of MGRS code names like 12ABC. Need to run this command first:
```
aws s3 ls s3://clay-tiles-02/02/ | tr -s ' ' |  cut -d ' ' -f 3 | cut -d '/' -f 1 > mgrs_world.txt
```
A for-loop then goes through each MGRS tile, with the model running the prediction to generate GeoParquet files that are uploaded to s3.

Notes:

There were about 947019 rows of embeddings generated from the clay-small-70MT-1100T-10E.ckpt model checkpoint in Dec 2023.
Embeddings were generated using a g5.4xlarge EC2 instance with 1 NVIDIA A10G GPU that allows for bfloat16 dtype calculations.

Closes #120

Jupyter notebook script to generate GeoParquet embedding files on a per MGRS tile basis. The script first generates an mgrs_world.txt file with a list of MGRS code names like 12ABC. A for-loop then goes through each MGRS tile, with the model running the prediction to generate GeoParquet files that are uploaded to s3. There are about 947019 rows of embeddings generated from the clay-small-70MT-1100T-10E.ckpt model checkpoint in Dec 2023.

weiji14 · 2024-01-17T03:27:12Z

generate_embeddings.ipynb

+    "# !aws s3 cp s3://clay-model-ckpt/v0/clay-small-70MT-1100T-10E.ckpt checkpoints/\n",
+    "trainer = L.Trainer(precision=\"bf16-mixed\", logger=False)\n",
+    "model: L.LightningModule = CLAYModule.load_from_checkpoint(\n",
+    "    checkpoint_path=\"checkpoints/clay-small-70MT-1100T-10E.ckpt\"\n",


Should be possible to load the checkpoint directly from HuggingFace now, instead of manually downloading from the s3 bucket, as mentioned at #116 (comment)

Suggested change

"# !aws s3 cp s3://clay-model-ckpt/v0/clay-small-70MT-1100T-10E.ckpt checkpoints/\n",

"trainer = L.Trainer(precision=\"bf16-mixed\", logger=False)\n",

"model: L.LightningModule = CLAYModule.load_from_checkpoint(\n",

" checkpoint_path=\"checkpoints/clay-small-70MT-1100T-10E.ckpt\"\n",

"trainer = L.Trainer(precision=\"bf16-mixed\", logger=False)\n",

"model: L.LightningModule = CLAYModule.load_from_checkpoint(\n",

" checkpoint_path=\"https://huggingface.co/made-with-clay/Clay/resolve/main/Clay_v0.1_epoch-24_val-loss-0.46.ckpt\"\n",

weiji14 · 2024-01-17T03:33:06Z

generate_embeddings.ipynb

Might want to move this into the scripts/ folder, and will need to update the paths below accordingly.

weiji14 · 2024-01-17T03:34:07Z

generate_embeddings.ipynb

+    "#!mamba install triton\n",
+    "# model.model.encoder = torch.compile(model=model.model.encoder)"


Can probably remove this torch.compile line. Was trying to speed up the model by compiling it, but there were some layers that didn't work.

weiji14 · 2024-01-17T03:34:56Z

generate_embeddings.ipynb

+    "import os\n",
+    "import warnings\n",
+    "\n",
+    "import duckdb\n",


DuckDB is optional and can be removed, but it's nice to get a quick count of all the rows across the GeoParquet files (see last cell).

…ings.py

for more information, see https://pre-commit.ci

weiji14 added this to the v1 Release milestone Jan 17, 2024

weiji14 assigned srmsoumya Jan 17, 2024

weiji14 changed the title ~~🚧 Embedding factory script looping through each MGRS tile~~ Embedding factory script looping through each MGRS tile Jan 17, 2024

weiji14 commented Jan 17, 2024

View reviewed changes

SRM and others added 3 commits January 17, 2024 23:40

Move code for embedding generation from nb to scripts/generate_embedd…

ad6fd5f

…ings.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

97ffebb

for more information, see https://pre-commit.ci

Move the generate embeddings code from notebook to script

57a20c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding factory script looping through each MGRS tile #125

Embedding factory script looping through each MGRS tile #125

weiji14 commented Jan 17, 2024 •

edited

weiji14 Jan 17, 2024

weiji14 Jan 17, 2024

weiji14 Jan 17, 2024

weiji14 Jan 17, 2024

		"#!mamba install triton\n",
		"# model.model.encoder = torch.compile(model=model.model.encoder)"

Embedding factory script looping through each MGRS tile #125

Are you sure you want to change the base?

Embedding factory script looping through each MGRS tile #125

Conversation

weiji14 commented Jan 17, 2024 • edited

weiji14 Jan 17, 2024

Choose a reason for hiding this comment

weiji14 Jan 17, 2024

Choose a reason for hiding this comment

weiji14 Jan 17, 2024

Choose a reason for hiding this comment

weiji14 Jan 17, 2024

Choose a reason for hiding this comment

weiji14 commented Jan 17, 2024 •

edited