Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Sampling fails with ValueError: [E1041] Expected a string, Doc, #487

Open
1 task done
guybartal opened this issue Apr 17, 2024 · 0 comments · May be fixed by #538
Open
1 task done

[Bug] Sampling fails with ValueError: [E1041] Expected a string, Doc, #487

guybartal opened this issue Apr 17, 2024 · 0 comments · May be fixed by #538
Assignees
Labels
bug Something isn't working Must have Sprint 2 May 2 to May 22 2024

Comments

@guybartal
Copy link
Collaborator

guybartal commented Apr 17, 2024

steps to reproduce:

set config.json with the following sampling settings:

    "sampling": {
        "sample_data": true,
        "sample_percentage": 5,
        "optimum_k": "auto",
        "min_cluster": 2,
        "max_cluster": 30
    },

and type make all in the terminal. after a while you should get the following error:

Traceback (most recent call last):
  File "/workspaces/rag-experiment-accelerator/01_index.py", line 22, in <module>
    index_dict = run(environment, config, index_config, file_paths)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/rag-experiment-accelerator/rag_experiment_accelerator/run/index.py", line 68, in run
    docs = cluster(docs, config)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/rag-experiment-accelerator/rag_experiment_accelerator/sampling/clustering.py", line 244, in cluster
    df["processed_text"] = df["text"].progress_apply(spacy_tokenizer)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/tqdm/std.py", line 917, in inner
    return getattr(df, df_function)(wrapper, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/pandas/core/series.py", line 4915, in apply
    ).apply()
      ^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/pandas/core/apply.py", line 1427, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/pandas/core/apply.py", line 1507, in apply_standard
    mapped = obj._map_values(
             ^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/pandas/core/base.py", line 921, in _map_values
    return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/pandas/core/algorithms.py", line 1743, in map_array
    return lib.map_infer(values, mapper, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 2972, in pandas._libs.lib.map_infer
  File "/home/vscode/.local/lib/python3.11/site-packages/tqdm/std.py", line 912, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/rag-experiment-accelerator/rag_experiment_accelerator/sampling/clustering.py", line 41, in spacy_tokenizer
    mytokens = parser(sentence)
               ^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/spacy/language.py", line 1037, in __call__
    doc = self._ensure_doc(text)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/spacy/language.py", line 1131, in _ensure_doc
    raise ValueError(Errors.E1041.format(type=type(doc_like)))
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'dict'>
make: *** [Makefile:30: index] Error 1

Tasks

  1. breaking change
@guybartal guybartal changed the title Sampling fails with ValueError: [E1041] Expected a string, Doc, [Bug] Sampling fails with ValueError: [E1041] Expected a string, Doc, Apr 17, 2024
@guybartal guybartal added bug Something isn't working Must have labels Apr 17, 2024
@shanepeckham shanepeckham self-assigned this Apr 24, 2024
@Lep06fg Lep06fg added the Sprint 2 May 2 to May 22 2024 label May 2, 2024
@shanepeckham shanepeckham linked a pull request May 6, 2024 that will close this issue
shanepeckham added a commit that referenced this issue May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Must have Sprint 2 May 2 to May 22 2024
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants