Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Llm operator #1313

Draft
wants to merge 8 commits into
base: staging
Choose a base branch
from
Draft

feat: Llm operator #1313

wants to merge 8 commits into from

Conversation

gaurav274
Copy link
Member

@gaurav274 gaurav274 commented Oct 23, 2023

  • Add support for LLM in select as operator SELECT DummyLLM({prompt}, data) FROM fruitTable;
  • Provide LLM token usage and cost for the above queries.

def exec(self, *args, **kwargs) -> Iterator[Batch]:
child_executor = self.children[0]
for batch in child_executor.exec(**kwargs):
llm_result = self.llm_expr.evaluate(batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the batch optimization done in the LLMExecutor and will be added in future PRs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

llm_exprs = []
for expr in exprs:
if is_llm_expression(expr):
llm_exprs.append(expr.copy())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note here, the chained function call will not work here. For example STRTODATAFRAME(LLM('EXTRACT SOME COLUMN', data))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add that in next PR

new_root.append_child(plan_root)
plan_root = new_root
self._plan = plan_root

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the generic way is to do it in optimizer with apply and merge. What will the plan looks like, if we have SELECT id, LLM(...) FROM some_table;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be Project(id, llm.response) -> LLMExec() -> Get

def generate(self, prompts: List[str]) -> List[str]:
import openai

@retry(tries=6, delay=20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good time to also add logging to the retry logic -
https://tenacity.readthedocs.io/en/latest/#before-and-after-retry-and-logging

This will log the retry attempts in our logger so the user knows when rate-limiting errors occur. I found this helpful when waiting for a long time. The downside is that we must add the tenacity library to the requirements.

try_to_import_tiktoken()
import tiktoken

encoding = tiktoken.encoding_for_model(self.model_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have the response, we can directly compute the cost using the response["usage"] parameter?
Tiktoken would be good to estimate the cost before executing the query (helpful for query opt). Maybe we can have two functions, estimate_cost and get_cost. Estimating cost is not simple, though, because we do not know the completion tokens apriori. We would then need a heuristic for the estimated completion tokens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a valid concern. I was also thinking about it.

@@ -21,3 +21,4 @@
IFRAMES = "IFRAMES"
AUDIORATE = "AUDIORATE"
DEFAULT_FUNCTION_EXPRESSION_COST = 100
LLM_FUNCTIONS = ["chatgpt", "completion"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not adding the LLM operator to the parser? So, the allowed LLM names are restricted here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants