Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sweep: add docstrings to vector_db.py #3142

Open
2 tasks done
wwzeng1 opened this issue Feb 23, 2024 · 1 comment · May be fixed by #3145
Open
2 tasks done

Sweep: add docstrings to vector_db.py #3142

wwzeng1 opened this issue Feb 23, 2024 · 1 comment · May be fixed by #3145
Labels
sweep Assigns Sweep to an issue or pull request.

Comments

@wwzeng1
Copy link
Contributor

wwzeng1 commented Feb 23, 2024

Checklist
  • Create sweepai/vector_db.pye02e1d1 Edit
  • Running GitHub Actions for sweepai/vector_db.pyEdit
@wwzeng1 wwzeng1 added the sweep Assigns Sweep to an issue or pull request. label Feb 23, 2024
Copy link
Contributor

sweep-nightly bot commented Feb 23, 2024

🚀 Here's the PR! #3145

See Sweep's progress at the progress dashboard!
💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: None)

Tip

I can email you next time I complete a pull request if you set up your email here!


Actions (click)

  • ↻ Restart Sweep

Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description.

import modal
from sweepai.config.server import DOCS_MODAL_INST_NAME
stub = modal.Stub(DOCS_MODAL_INST_NAME)
# doc_url = "https://docs.anthropic.com/claude"
doc_url = "https://modal.com/docs/guide"
update = modal.Function.lookup(DOCS_MODAL_INST_NAME, "daily_update")
search = modal.Function.lookup(DOCS_MODAL_INST_NAME, "search_vector_store")
write = modal.Function.lookup(DOCS_MODAL_INST_NAME, "write_documentation")
# print(write.call(doc_url))
results = search.call(
doc_url,
"In get_relevant_snippets parallelize the computation of the query embedding with the vector store. Do this using Modal primitives",
)
# metadatas = results["metadata"]
# docs = results["text"]
# vector_scores = results["score"]
# url_and_docs = [(metadata["url"], doc) for metadata, doc in zip(metadatas, docs)]
# ix = prepare_index_from_docs(url_and_docs)
# docs_to_scores = search_docs("How do I add random particles", ix)
# max_score = max(docs_to_scores.values())
# min_score = min(docs_to_scores.values()) if min(docs_to_scores.values()) < max_score else 0
# max_vector_score = max(vector_scores)
# min_vector_score = min(vector_scores) if min(vector_scores) < max_vector_score else 0
# text_to_final_score = []
# for idx, (url, doc) in enumerate(url_and_docs):
# lexical_score = docs_to_scores[url] if url in docs_to_scores else 0
# vector_score = vector_scores[idx]
# normalized_lexical_score = (lexical_score - (min_score / 2)) / ((max_score + min_score))
# normalized_vector_score = (vector_score - (min_vector_score / 2)) / ((max_vector_score + min_vector_score))
# final_score = normalized_lexical_score * normalized_vector_score
# text_to_final_score.append((doc, final_score))
# sorted_docs = sorted(text_to_final_score, key=lambda x: x[1], reverse=True)
# sorted_docs = [doc for doc, _ in sorted_docs]
# # get docs until you reach a 20k character count
# final_docs = []
# for doc in sorted_docs:
# if len("".join(final_docs)) + len(doc) < 20000:
# final_docs.append(doc)
# else:
# break
# new_docs = []
# for doc in docs:
# if doc not in new_docs:
# new_docs.append(doc)
# new_docs = new_docs[:min(5, len(new_docs))]
# for doc in new_docs:
# print(doc + "\n\n\n")

2: print_items(a)
3: print(a) # copied twice by accident!
4: return a
5:
6: def print_items(a):
7: for item in a:
8: print(item)
9: print(a) # copied twice by accident!
```
It’s possible to rely on a unit test feedback loop to fix this (write the new code, check the tests pass, and then fix anything broken), but it’s a bit too chaotic in practice. We might not have unit tests for the code, or the unit tests might pass but not cover this case(arguably worse!).
We wanted something deterministic, so we used [Rope](https://github.com/python-rope/rope). Rope is a refactoring library for Python that uses the Python AST, allowing for refactors that work 100% of the time.
To use Rope, we can ask GPT4 to define sections of the code that it wants to extract. These should correspond to the request.
Say we want to extract the code snippet and metadata preparation code out of our `get_vector_database` function.
```markdown {2-3}
Request:
Extract the code that prepares documents, metadatas, and ids into
a new function named `prepare_documents_metadata_ids`.
This function should take `snippets`, `cloned_repo`, and `files_to_scores` as parameters
and return `documents`, `metadatas`, and `ids`.
```
We have Sweep generate this response:
```xml {3,7,12-28}
<contextual_request_analysis>
The user request is to refactor the `get_deeplake_vs_from_repo` function in the `vector_db.py` file by extracting certain sections of the code into separate functions. The sections to be extracted are as follows:
The code that prepares documents, metadatas, and ids. This section starts with the line `documents = []` and ends with the line `ids.append(f"{gh_file_path}:{snippet.start}:{snippet.end}")`.
</contextual_request_analysis>
<new_function_names>
"prepare_documents_metadata_ids"
</new_function_names>
<extractions>
<<<<<<< EXTRACT
documents = []
metadatas = []
ids = []
for snippet in snippets:
documents.append(snippet.get_snippet(add_ellipsis=False, add_lines=False))
metadata = {
"file_path": snippet.file_path[len(cloned_repo.cache_dir) + 1 :],
"start": snippet.start,
"end": snippet.end,
"score": files_to_scores[snippet.file_path],
}
metadatas.append(metadata)
gh_file_path = snippet.file_path[len("repo/") :]

import modal
from sweepai.config.server import DB_MODAL_INST_NAME
stub = modal.Stub(DB_MODAL_INST_NAME)
get_relevant_snippets = modal.Function.lookup(
DB_MODAL_INST_NAME, "get_relevant_snippets"
)
dev = 36855882
staging = 40419656
test_query = """Sweep: Fix Uncaught SyntaxError: Unexpected token '&' (at examples:14:21). https://github.com/sweepai/sweep/blob/d9d53a78b4fab18b89e4003268cf6ba50da4f068/docs/theme.config.tsx#L15
Fix Uncaught SyntaxError: Unexpected token '&' (at examples:14:21)
In docs/theme.config.tsx"""
lexical = get_relevant_snippets.call(
repo_name="sweepai/sweep",
query=test_query,
n_results=5,
installation_id=dev,
username="wwzeng1",
lexical=True,
)
vector = get_relevant_snippets.call(
repo_name="sweepai/sweep",
query=test_query,
n_results=5,
installation_id=dev,
username="wwzeng1",
lexical=False,
)
# format vector and lexical titles one by one
print("Lexical Results:")
for result in lexical[:5]:
print(result)
print("Vector Results:")
for result in vector[:5]:

---
Request: "Complete the implementation of the 'Client' class by adding the remaining methods that mirror the functionality of the Python client found in /vectordb/client.py.".
You previously edited this file. First indicate whether the request was fulfilled, then suggest a set of changes. Respond in the following format:
Thoughts on additional required changes:
1. Thought 1: Summary 1
2. Thought 2: Summary 2
...
Additional proposed changes: (prefer few small changes over one large change)
```
<<<< ORIGINAL
line_before
old_code
line_after
====
line_before
new_code
line_after
>>>> UPDATED

raise_error_schema = {
"name": "raise_error",
"parameters": {
"type": "object",
"properties": {
"message": {
"type": "string",
"description": "Message for the user describing the error, either indicating that there's an internal error or that you do not have the necessary information to complete the task. Add all potentially relevant details and use markdown for formatting.",
}
},
"required": ["message"],
},
"description": "Use this when you absolutely cannot complete the task on your own.",
}
search_and_replace_schema = {
"name": "search_and_replace",
"parameters": {
"type": "object",
"properties": {
"analysis_and_identification": {
"type": "string",
"description": "Identify and list the minimal changes that need to be made to the file, by listing all locations that should receive these changes and the changes to be made. Be sure to consider all imports that are required to complete the task.",
},
"replaces_to_make": {
"type": "array",
"description": "Array of sections to modify",
"items": {
"type": "object",
"properties": {
"section_id": {
"type": "string",
"description": "The section ID the original code belongs to.",
},
"old_code": {
"type": "string",
"description": "The old lines of code that belongs to section with ID section_id. Be sure to add lines before and after to disambiguate the change.",
},
"new_code": {
"type": "string",
"description": "The new code to replace the old code.",
},
},
"required": ["section_id", "old_code", "new_code"],
},
},
},
"required": ["analysis_and_identification", "replaces_to_make"],
},
"description": "Make edits to the code file.",
}
keyword_search_schema = {
"name": "keyword_search",
"parameters": {
"type": "object",
"properties": {
"justification": {
"type": "string",
"description": "Justification for searching the keyword.",
},
"keyword": {
"type": "string",
"description": "The keyword to search for.",
},
},
"required": ["justification", "keyword"],
},
"description": "Searches for all lines in the file containing the keyword.",

python_code = '''
import io
import os
import zipfile
import openai
import requests
from loguru import logger
from sweepai.core.gha_extraction import GHAExtractor
from sweepai.events import CheckRunCompleted
from sweepai.handlers.on_comment import on_comment
from sweepai.utils.config.client import SweepConfig, get_gha_enabled
from sweepai.utils.github_utils import get_github_client, get_token
openai.api_key = os.environ.get("OPENAI_API_KEY")
log_message = """GitHub actions yielded the following error.
{error_logs}
This is likely a linting or type-checking issue with the source code but if you are updating the GitHub Actions or versioning, this could be an issue with the GitHub Action yaml files."""
def download_logs(repo_full_name: str, run_id: int, installation_id: int):
headers = {
"Accept": "application/vnd.github+json",
"Authorization": f"Bearer {get_token(installation_id)}",
"X-GitHub-Api-Version": "2022-11-28"
}
response = requests.get(f"https://api.github.com/repos/{repo_full_name}/actions/runs/{run_id}/logs",
headers=headers)
logs_str = ""
if response.status_code == 200:
zip_file = zipfile.ZipFile(io.BytesIO(response.content))
for file in zip_file.namelist():
if "/" not in file:
with zip_file.open(file) as f:
logs_str += f.read().decode("utf-8")
else:
logger.warning(f"Failed to download logs for run id: {run_id}")
return logs_str
def clean_logs(logs_str: str):
log_list = logs_str.split("\\n")
truncated_logs = [log[log.find(" ") + 1:] for log in log_list]
patterns = [
# for docker
"Already exists",
"Pulling fs layer",
"Waiting",
"Download complete",
"Verifying Checksum",
"Pull complete",
# For github
"remote: Counting objects",
"remote: Compressing objects:",
"Receiving objects:",
"Resolving deltas:"
]
return "\\n".join([log.strip() for log in truncated_logs if not any(pattern in log for pattern in patterns)])
def on_check_suite(request: CheckRunCompleted):
logger.info(f"Received check run completed event for {request.repository.full_name}")
g = get_github_client(request.installation.id)
repo = g.get_repo(request.repository.full_name)
if not get_gha_enabled(repo):
logger.info(f"Skipping github action for {request.repository.full_name} because it is not enabled")
return None
pr = repo.get_pull(request.check_run.pull_requests[0].number)
num_pr_commits = len(list(pr.get_commits()))
if num_pr_commits > 20:
logger.info(f"Skipping github action for PR with {num_pr_commits} commits")
return None
logger.info(f"Running github action for PR with {num_pr_commits} commits")
logs = download_logs(
request.repository.full_name,
request.check_run.run_id,
request.installation.id
)
if not logs:
return None
logs = clean_logs(logs)
extractor = GHAExtractor()
logger.info(f"Extracting logs from {request.repository.full_name}, logs: {logs}")
problematic_logs = extractor.gha_extract(logs)
if problematic_logs.count("\\n") > 15:
problematic_logs += "\\n\\nThere are a lot of errors. This is likely a larger issue with the PR and not a small linting/type-checking issue."
comments = list(pr.get_issue_comments())
if len(comments) >= 2 and problematic_logs == comments[-1].body and comments[-2].body == comments[-1].body:
comment = pr.as_issue().create_comment(log_message.format(error_logs=problematic_logs) + "\\n\\nI'm getting the same errors 3 times in a row, so I will stop working on fixing this PR.")
logger.warning("Skipping logs because it is duplicated")
raise Exception("Duplicate error logs")
print(problematic_logs)
comment = pr.as_issue().create_comment(log_message.format(error_logs=problematic_logs))
on_comment(
repo_full_name=request.repository.full_name,
repo_description=request.repository.description,
comment=problematic_logs,
pr_path=None,
pr_line_position=None,
username=request.sender.login,
installation_id=request.installation.id,
pr_number=request.check_run.pull_requests[0].number,
comment_id=comment.id,
repo=repo,
)
return {"success": True}
'''
js_text = """
import { Document, BaseNode } from "../Node";
import { v4 as uuidv4 } from "uuid";
import { BaseRetriever } from "../Retriever";
import { ServiceContext } from "../ServiceContext";
import { StorageContext } from "../storage/StorageContext";
import { BaseDocumentStore } from "../storage/docStore/types";
import { VectorStore } from "../storage/vectorStore/types";
import { BaseIndexStore } from "../storage/indexStore/types";
import { BaseQueryEngine } from "../QueryEngine";
import { ResponseSynthesizer } from "../ResponseSynthesizer";
/**
* The underlying structure of each index.
*/
export abstract class IndexStruct {
indexId: string;
summary?: string;
constructor(indexId = uuidv4(), summary = undefined) {
this.indexId = indexId;
this.summary = summary;
}
toJson(): Record<string, unknown> {
return {
indexId: this.indexId,
summary: this.summary,
};
}
getSummary(): string {
if (this.summary === undefined) {
throw new Error("summary field of the index dict is not set");
}
return this.summary;
}
}
export enum IndexStructType {
SIMPLE_DICT = "simple_dict",
LIST = "list",
}
export class IndexDict extends IndexStruct {
nodesDict: Record<string, BaseNode> = {};
docStore: Record<string, Document> = {}; // FIXME: this should be implemented in storageContext
type: IndexStructType = IndexStructType.SIMPLE_DICT;
getSummary(): string {
if (this.summary === undefined) {
throw new Error("summary field of the index dict is not set");
}
return this.summary;
}
addNode(node: BaseNode, textId?: string) {
const vectorId = textId ?? node.id_;
this.nodesDict[vectorId] = node;
}
toJson(): Record<string, unknown> {
return {
...super.toJson(),
nodesDict: this.nodesDict,
type: this.type,
};
}
}
export function jsonToIndexStruct(json: any): IndexStruct {
if (json.type === IndexStructType.LIST) {
const indexList = new IndexList(json.indexId, json.summary);
indexList.nodes = json.nodes;
return indexList;
} else if (json.type === IndexStructType.SIMPLE_DICT) {
const indexDict = new IndexDict(json.indexId, json.summary);
indexDict.nodesDict = json.nodesDict;
return indexDict;
} else {
throw new Error(`Unknown index struct type: ${json.type}`);
}
}
export class IndexList extends IndexStruct {
nodes: string[] = [];
type: IndexStructType = IndexStructType.LIST;
addNode(node: BaseNode) {
this.nodes.push(node.id_);
}
toJson(): Record<string, unknown> {
return {
...super.toJson(),
nodes: this.nodes,
type: this.type,
};
}
}
export interface BaseIndexInit<T> {
serviceContext: ServiceContext;
storageContext: StorageContext;
docStore: BaseDocumentStore;
vectorStore?: VectorStore;
indexStore?: BaseIndexStore;
indexStruct: T;
}
/**
* Indexes are the data structure that we store our nodes and embeddings in so
* they can be retrieved for our queries.
*/
export abstract class BaseIndex<T> {
serviceContext: ServiceContext;
storageContext: StorageContext;
docStore: BaseDocumentStore;
vectorStore?: VectorStore;
indexStore?: BaseIndexStore;
indexStruct: T;
constructor(init: BaseIndexInit<T>) {
this.serviceContext = init.serviceContext;
this.storageContext = init.storageContext;
this.docStore = init.docStore;
this.vectorStore = init.vectorStore;
this.indexStore = init.indexStore;
this.indexStruct = init.indexStruct;
}
/**
* Create a new retriever from the index.
* @param retrieverOptions
*/
abstract asRetriever(options?: any): BaseRetriever;
/**
* Create a new query engine from the index. It will also create a retriever
* and response synthezier if they are not provided.
* @param options you can supply your own custom Retriever and ResponseSynthesizer
*/
abstract asQueryEngine(options?: {
retriever?: BaseRetriever;
responseSynthesizer?: ResponseSynthesizer;
}): BaseQueryEngine;
}
export interface VectorIndexOptions {
nodes?: BaseNode[];
indexStruct?: IndexDict;
indexId?: string;
serviceContext?: ServiceContext;
storageContext?: StorageContext;
}
export interface VectorIndexConstructorProps extends BaseIndexInit<IndexDict> {
vectorStore: VectorStore;
}
"""
# if __name__ == "__main__":
# chunks, metadata, _id = chunker.call(js_text, "main.py")
# for chunk in chunks:

import json
import traceback
from loguru import logger
from sweepai.agents.assistant_functions import (
keyword_search_schema,
search_and_replace_schema,
)
from sweepai.agents.assistant_wrapper import openai_assistant_call
from sweepai.core.entities import AssistantRaisedException, Message
from sweepai.utils.chat_logger import ChatLogger, discord_log_error
from sweepai.utils.diff import generate_diff
from sweepai.utils.progress import AssistantConversation, TicketProgress
from sweepai.utils.utils import check_code, chunk_code
# Pre-amble using ideas from https://github.com/paul-gauthier/aider/blob/main/aider/coders/udiff_prompts.py
# Doesn't regress on the benchmark but improves average code generated and avoids empty comments.
instructions = """You are an expert software developer assigned to write code to complete the user's request.
You are diligent and tireless and always COMPLETELY IMPLEMENT the needed code!
You NEVER leave comments describing code without implementing it!
Always use best practices when coding.
Respect and use existing conventions, libraries, etc that are already present in the code base.
Your job is to make edits to the file to complete the user "# Request".
# Instructions
Modify the snippets above according to the request by calling the search_and_replace function.
* Keep whitespace and comments.
* Make the minimum necessary search_and_replaces to make changes to the snippets. Only write diffs for lines that have been asked to be changed.
* Write multiple small changes instead of a single large change."""
def int_to_excel_col(n):
result = ""
while n > 0:
n, remainder = divmod(n - 1, 26)
result = chr(65 + remainder) + result
return result
def excel_col_to_int(s):
result = 0
for char in s:
result = result * 26 + (ord(char) - 64)
return result - 1
MAX_CHARS = 32000
TOOLS_MAX_CHARS = 20000
# @file_cache(ignore_params=["file_path", "chat_logger"])


Step 2: ⌨️ Coding

Create sweepai/vector_db.py with contents:
• Begin by adding a module-level docstring at the top of the `vector_db.py` file, explaining the purpose of the module and any important information about its usage.
• For each class in the file, add a class-level docstring immediately below the class definition. The docstring should describe the purpose of the class and any important attributes or behaviors.
• Within each class, add method-level docstrings for every public method (those not prefixed with an underscore). The docstring should explain what the method does, its parameters, any exceptions it may raise, and what it returns.
• For standalone functions in the module, add function-level docstrings that describe the purpose of the function, its parameters, any exceptions it may raise, and its return value.
• Ensure that all docstrings are formatted correctly with triple quotes and are placed immediately below the class or function signature.
• Use descriptive language that provides clear and useful information to anyone who might use or modify the code in the future.
• If there are any complex algorithms or data structures used within the functions or methods, include a brief explanation within the docstring.
• After adding docstrings, review the entire file to ensure consistency in style and level of detail across all docstrings.
  • Running GitHub Actions for sweepai/vector_db.pyEdit
Check sweepai/vector_db.py with contents:

Ran GitHub Actions for e02e1d1e8d3c9747a6fcdf6ca0d091f3d2404fa7:
• Vercel Preview Comments:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/add_docstrings_to_vector_dbpy.


🎉 Latest improvements to Sweep:
  • New dashboard launched for real-time tracking of Sweep issues, covering all stages from search to coding.
  • Integration of OpenAI's latest Assistant API for more efficient and reliable code planning and editing, improving speed by 3x.
  • Use the GitHub issues extension for creating Sweep issues directly from your editor.

💡 To recreate the pull request edit the issue title or description.
Something wrong? Let us know.

This is an automated message generated by Sweep AI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sweep Assigns Sweep to an issue or pull request.
Projects
None yet
1 participant