Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Vision API #500

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

lectrician1
Copy link

@lectrician1 lectrician1 commented Nov 26, 2023

Adds the ability to use the vision API in the client. Resolves #488 and #473

Message interface:
image

Editing interface:
image

  • Add image to blob (base64) conversion when a file is uploaded, and store that base64 in the browser (localstorage). Not perfect and localstorage might run out of space but I don't have any other better solutions to keep the images you store from previous chats and IndexedDB looks too hard to implement.
  • Add image detail selection for users to select which detail their image should be processed at. see: https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding
  • Add a new data structure to differentiate models based on the types of inputs they can take and update editor to check if model supports images or not. I added a const modelTypes and check for it in EditView.tsx
  • Clean up the image upload UI (it's not perfect, but works)
  • Need to make some data structure version migration function so that clients with old data structures will get theirs updated (I need help with this)

Notable changes

This PR changes the the data structures of ChatInterface, MessageInterface, and creates ContentInterface, TextContentInterface and ImageContentInterface so that it follows the updated API JSON structure for prompts that contain images.

Before:

export interface MessageInterface {
  role: Role;
  content: string;
}

After:

export type Content = 'text' | 'image_url';

export interface ImageContentInterface extends ContentInterface {
  type: 'image_url';
  image_url: {
    url: string;
  }
}

export interface TextContentInterface extends ContentInterface {
  type: 'text';
  text: string;
}

export interface ContentInterface {
  type: Content;
}

export interface MessageInterface {
  role: Role;
  content: ContentInterface[];
}

The url parameter stores the URL of the image locally (the blob: URL) and at generation-time the client converts all the blob URLs into base64 for the API to take in.

@OldKrab
Copy link

OldKrab commented Jan 19, 2024

Now yarn run build fail with type errors

@Ahmet-Dedeler
Copy link

Has there been any updates with this?

@lectrician1
Copy link
Author

Uhhh well haven't had time to finish this up but maybe will try this weekend...

@Ahmet-Dedeler
Copy link

@lectrician1 That would be amazing, this feature would literally be the revolutionary feature for this repo.

Lately the repo hasn't been very active, so this might just change the future of how active and innovative it becomes.

Please keep us updated here, will try to help :)

@lectrician1 lectrician1 marked this pull request as ready for review February 19, 2024 08:27
@lectrician1
Copy link
Author

lectrician1 commented Feb 19, 2024

@Ahmet-Dedeler it's almost ready.

I just need help from @ztjhz @akira0245 or @ayaka14732 to write the migration code in migrate.ts and chat.ts to migrate the old data structure to the new one for ChatInterface and associated interfaces (see the PR description). I tried doing this and it's really confusing. I'm not sure how the whole migrate process works.

@ztjhz
Copy link
Owner

ztjhz commented Feb 25, 2024

I'll take a look at this.

@Ahmet-Dedeler
Copy link

Sorry for tagging but did any of you chance to look at this @ztjhz @akira0245 @ayaka14732 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BetterChatGPT compatibility with GPT-4 Turbo Vision API for image and text processing
4 participants