google / sentencepiece Public

Notifications
Fork 1.1k
Star 9.6k

Code
Issues 25
Pull requests 1
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: google/sentencepiece

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

25 Open 696 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

Runtime error on iOS

#1010 opened May 15, 2024 by l3utterfly

Tokenization for phonetic languages

#1009 opened May 14, 2024 by divyeshrajpura4114

pip subprocess to install build dependencies did not run successfully. │ exit code: 1

#989 opened Mar 21, 2024 by Anubiiss

coredump when build with CXXFLAGS -Wp,-D_GLIBCXX_ASSERTIONS

#987 opened Mar 20, 2024 by Henry-ZHR

Allow whitespace-only pieces

#984 opened Feb 26, 2024 by bauwenst

High frequency token segmented into letter sequence when input is a tsv file bug

#967 opened Jan 30, 2024 by TingxunShi

Evaluate Profile-Guided Optimization (PGO)

#961 opened Jan 9, 2024 by zamazan4ik

Tips for Termux installation

#950 opened Dec 14, 2023 by Manamama

A library that conflicts with the use of protobuf in vcpkg

#927 opened Oct 31, 2023 by hhxdestiny

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation

#924 opened Oct 24, 2023 by lsy641

Unexpected behavior with sampling of repeated character sequence.

#904 opened Aug 14, 2023 by kellymarchisio

Duplicate tokens in BPE vocabulary

#881 opened Jun 9, 2023 by astanic

Patches carried by conda-forge for packaging sentencepiece

#876 opened Jun 6, 2023 by h-vetinari

Tokens Chunking to respect Language Word Boundaries

#866 opened May 17, 2023 by loretoparisi

Python from source on armv7l raises ' undefined symbol: __atomic_fetch_add_8 '

#865 opened May 17, 2023 by FrancescoScandiffio

Only support 64bit?

#842 opened Apr 6, 2023 by logicvv

tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type bug

#801 opened Dec 15, 2022 by lintangsutawika

split_by_number doesn't match documentation?

#776 opened Aug 29, 2022 by ywrt

bazel support for C++ API feature request

Add new feature

#683 opened Aug 17, 2021 by BBerabi

user defined char set feature request

Add new feature

#649 opened Apr 18, 2021 by wenjie-p

Would plan to support BBPE feature request

Add new feature

#621 opened Feb 2, 2021 by MrRace

Sentencepiece with pre-defined vocabulary feature request

Add new feature

help wanted

#571 opened Oct 22, 2020 by vladmosin

How to create new model file with restricted vocabulary? feature request

Add new feature

help wanted

#522 opened Jul 23, 2020 by sshleifer

can we train by Parallel Computing or Multithreading or multi-Progress feature request

Add new feature

#366 opened Jul 12, 2019 by joytianya

Guidance on how to implement subword sampling at train time sample code

Asks toprovide sample code

#103 opened Jun 14, 2018 by sooheon

ProTip! What’s not been updated in a month: updated:<2024-04-16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly