Skip to content

crazycoderF12/TGS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Generation Server

Serving Large Language Models as an API Endpoint for Inference

Table of Contents

Introduction

This Project Aims to Effortlessly Deploy and Serve Large Language Models in the Cloud as an API Endpoint and a Simple chat interface for inference. This Repository provides streamlined Server Side Code to deploy your LLMs in a cloud environment, Enabling Seamless Inference through chat interface on your local system.

Getting Started

If You Want to Start Deploying and Host your Model in Cloud. Follow the Steps below to get started.

Installation

First of all, Clone this Repository and Get in into this directory

git clone https://github.com/TheFaheem/TGS.git && cd TGS

Requirements

Make Sure You Install all the Required Libraries by Running,

pip install -r requirements.txt

Now You are Good to go ...

Usage

After Setting this repository in your Cloud Machine. You can start deploying your model by following the steps below.

Serving:

You Can Start the Deploying your Model as an API Endpoint in the Cloud by Running the Following Command with the AppropriateArguments Below.

python serve.py --model_type ${MODEL TYPE} --repo_id ${REPO ID} --revision ${REVISION} --model_basename ${MODEL BASENAME} --trust_remote_code ${TRUST REMOTE CODE} --safetensors ${SAFETENSOR}

Arguments Detail:

  • MODEL TYPE - Type of the Model. eg., llama, mpt, falcon, rwkv \n
  • REPO ID - Repo id of the Huggingface Model
  • REVISION - Specific Branch to download the model repo from
  • MODEL BASENAME - Name of the Safetensor File, use all of that name except '.safetensors'
  • TRUST REMOTE CODE - Whether or not to use remote code
  • SAFETENSOR - Whether or not to use Safetensor

Inferencing:

You can Start the Chat Interface backed by you're Model from your local system by running the following command in your terminal. The inference.py File will Take care of All The API Calls Behind

python inference.py --endpoint ${ENDPOINT} --streaming ${STREAMING} --max_tokens ${MAX TOKENS} --ht_ws ${HTWS} --temperature ${TEMPERATURE} --top_p ${TOP_P} --top_k ${TOP_K}

Arguments Detail:

  • ENDPOINT - Url Which will be Given From the Cloud after seconds when you start deploying your model.
  • STREAMING - Whether to Stream the Result or Not
  • MAX TOKENS - Maximum Tokens to Genrate
  • HTWS - Whether to use http ("http") or websockets ("ws")
  • TEMPERATURE - Temperature for Sampling. temperature 0.0 will produce concise response whereas temperature close 1.0 will increase randomness in output
  • TOP_P - Top Probalities for Logits to Sample From.
  • TOP_K - Top K Logits used for Sampling

Contributing

If You Have Any Ideas, or found a bug or if you want to improve this further more. I Encourage you Contribute by creating fork of this repo and If You are done with your work, just create a pull request, I'll Check that and pull that in as soon as i can.

License

This project is licensed under the terms of the MIT License

If You Find This Repo Useful, Just a Reminder, There's a Star button up there. Hope this'll be useful for you :)

About

Effortlessly Deploy and Serve Large Language Models in the Cloud as an API Endpoint for Inference

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages