Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for shortening dimensions for AOAI embeddings #502

Open
wants to merge 9 commits into
base: development
Choose a base branch
from

Conversation

tomconte
Copy link
Member

@tomconte tomconte commented Apr 22, 2024

This is a proposal to add support for shortening dimensions when using newer Azure OpenAI embeddings models, e.g. text-embedding-3-large.

Also includes a fix for the AML experiment name in eval.py (it was using the index prefix instead of the experiment name).

Fixes #503

Note: I believe CI fails because of #452, but I did run the CI steps manually in the dev container.

@@ -83,5 +84,5 @@ def from_index_name(cls, index_name: str, config: Any) -> "IndexConfig":
generate_title=bool(int(cls.__get_index_value(values[7]))),
generate_summary=bool(int(cls.__get_index_value(values[8]))),
override_content_with_summary=bool(int(cls.__get_index_value(values[9]))),
embedding_model=config._find_embedding_model_by_name(values[10].strip()),
embedding_model=config._find_embedding_model_by_name(values[11].strip()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the dimension being returned as part of the IndexConfig?
Does this break "round tripping"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimensions are already part of the embedding_model, they are read from the config file. I only added them to the index name so that we can have two indices with the same model but different number of dimensions, e.g.

surface_p-0_cs-1000_o-200_efc-400_efs-400_sp-0_t-0_s-0_oc-0_dim-3072_text-3-large

and

surface_p-0_cs-1000_o-200_efc-400_efs-400_sp-0_t-0_s-0_oc-0_dim-256_text-3-large

@@ -774,7 +774,7 @@ def evaluate_prompts(
os.path.join(config.EVAL_DATA_LOCATION, f"sum_{name_suffix}.csv")
)
draw_hist_df(sum_df, run_id, client)
generate_metrics(config.INDEX_NAME_PREFIX, run_id, client)
generate_metrics(config.EXPERIMENT_NAME, run_id, client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?
Is it related to the shortened dimensions, or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a bug related to this change?

Copy link
Member Author

@tomconte tomconte Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for an unrelated issue I had when running the process with my config.

The experiment name is set like this in evaluation.py:

    mlflow.set_experiment(config.EXPERIMENT_NAME)

And later generate_metrics() complained that the experiment did not exist here in eval.py, I believe because INDEX_NAME_PREFIX was used instead of EXPERIMENT_NAME.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a bug related to this change?

I can't see one, no. I can remove this commit until the issue is confirmed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine. I wanted to check that fixing this wouldn't cause it's own issues (e.g. that the change in name wouldn't have a back-compat issue for someone using it today). I also tend to think that it's better to avoid too many "while I'm here" bug fixes, unless they're already logged somewhere.

That said, if @ritesh-modi is happy with this, I'm also happy. :)

Copy link
Member Author

@tomconte tomconte Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, here is the output of the final step, using the CI configuration, on the current development branch:

vscode ➜ /workspaces/rag-experiment-accelerator (development) $ python 04_evaluation.py --data_dir='data-ci' --config_path=.github/workflows/config.json
2024/04/26 09:19:22 INFO mlflow.tracking.fluent: Experiment with name 'baseline' does not exist. Creating a new experiment.
2024-04-26 09:19:24,632 - INFO - rag_experiment_accelerator.run.evaluation - Evaluating Index: ci_p-0_cs-1000_o-200_efc-400_efs-400_sp-0_t-1_s-1_oc-0_all-mpnet-base-v2
2024-04-26 09:19:24,643 - INFO - rag_experiment_accelerator.artifact.handlers.artifact_handler - Loading artifacts from path: .github/workflows/artifacts/query_data/eval_output_ci_p-0_cs-1000_o-200_efc-400_efs-400_sp-0_t-1_s-1_oc-0_all-mpnet-base-v2_baseline_baseline_job.jsonl
2024-04-26 09:19:24,645 - INFO - rag_experiment_accelerator.artifact.handlers.artifact_handler - Loaded 6 artifacts from path: .github/workflows/artifacts/query_data/eval_output_ci_p-0_cs-1000_o-200_efc-400_efs-400_sp-0_t-1_s-1_oc-0_all-mpnet-base-v2_baseline_baseline_job.jsonl
Traceback (most recent call last):
  File "/workspaces/rag-experiment-accelerator/04_evaluation.py", line 35, in <module>
    run(
  File "/workspaces/rag-experiment-accelerator/rag_experiment_accelerator/run/evaluation.py", line 36, in run
    eval.evaluate_prompts(
  File "/workspaces/rag-experiment-accelerator/rag_experiment_accelerator/evaluation/eval.py", line 777, in evaluate_prompts
    generate_metrics(config.INDEX_NAME_PREFIX, run_id, client)
  File "/workspaces/rag-experiment-accelerator/rag_experiment_accelerator/evaluation/eval.py", line 377, in generate_metrics
    experiment = dict(client.get_experiment_by_name(experiment_name))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for shortening dimensions in new OpenAI embeddings models
2 participants