Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export model to fp16 #347

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

export model to fp16 #347

wants to merge 3 commits into from

Conversation

kroggen
Copy link
Contributor

@kroggen kroggen commented Aug 23, 2023

No description provided.

@rdentato
Copy link
Contributor

Ok. I see you went for a much deeper change.

Did you manage to test it?

@kroggen
Copy link
Contributor Author

kroggen commented Aug 23, 2023

It is not tested. I am trying to implement the load of model (version 0 and maybe 1)

@karpathy
Copy link
Owner

Question: what is the benefit of fp16?

  • As the Llama 2 models were trained in bf16 I find fp16 conversion sketchy. For newly trained models this is less of a concern
  • The file sizes are ofc ~2X smaller
  • The code is a little bit more bloated

Am I missing some considerations?

b = struct.pack(f'{len(d)}b', *d)
def serialize(file, tensor, type):
""" writes one tensor to file that is open in wb mode """
if type == 'fp32':
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels simplfiable

@@ -129,7 +133,7 @@ def legacy_export(model, filepath):
# -----------------------------------------------------------------------------
# new version

def version1_export(model, filepath):
def version1_export(model, filepath, type):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd have to serialize the type to the header too

@rdentato
Copy link
Contributor

rdentato commented Aug 25, 2023

The point is that they can be directly loaded into the GPU.
Not needing conversion on-the-flying (and having a smaller file to load) significantly reduce the load time (which, for my Tesla T4 is around 2min. for the llama2_7b models.

Also, I tested llama2.c on an ARM machine using their native support for fp16 and it works like a charm (and ARM CPU are cheaper on AWS).

@@ -450,6 +454,7 @@ def torchscript_export(model, filepath, zero_params=False, gzip_output=False):
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=str, help="the output filepath")
parser.add_argument("--version", default=0, type=int, help="the version to export with")
parser.add_argument("--type", default='fp32', type=str, help="the data type to export to (fp32, fp16, bfloat16)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not 100% decided if type should be a separate variable that is written into the header, or if it should just be absorbed into version. E.g.:

version 0 original float32
version 1 original float16
version 2 new header int8

etc and just go that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another PR that just uses "--version" for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the support to bf16 (and maybe others to come)

If using a version number for each, it is both not intuitive as also it will have a lot of "versions"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants