Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Matching content in our doctests #197

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

k0pernicus
Copy link

This PR solves issue #189 in order to match the content in our doctests.

I updated all the sources in the texthero folder - the main issue is in the scatterplot function, in visualization.py, where the 3D representation on the browser does not show anything (WiP).

I also updated the file CONTRIBUTING.md in order to inform the project contributors to match as much as possible the doctests in their examples / tests.

To finish, I also updated some doctests in order to add a new line between the doc and the source code sometimes (for clarity), to remove extra whitespaces, etc...

@k0pernicus k0pernicus changed the title WiP Matching content in our doctests [WIP] Matching content in our doctests Dec 4, 2020
@k0pernicus
Copy link
Author

k0pernicus commented Dec 5, 2020

Hi @jbesomi,
I have an issue about the return of replace_stopwords for Python 3.6.

Based on the runners, the doctests for Python 3.6 are not valid due that replace_stopwords filters the punctuations: https://travis-ci.com/github/jbesomi/texthero/jobs/454997146.

However, the doctests for Python 3.X (with X >= 7) are valid because replace_stopwords does not filter the punctuations, which is, checking directly in the regex, seems the good behaviour.
As an example, this is output of the runner for Python 3.8: https://travis-ci.com/github/jbesomi/texthero/jobs/454997148.

Do you have any idea about this issue?
It does not seems, in the code, that there is a specific code version for Python 3.6 and another one for the other Python versions...

@jbesomi
Copy link
Owner

jbesomi commented Dec 8, 2020

Hi @k0pernicus, thank you for your PR! Amazing 🎉

Regarding the issue with replace_stopwords, probably this is due to the regex pattern. I will investigate that and get back to you. For you to know, we were thinking about adding a tokenization function and require all preprocessing functions to receive an already tokenized function, to avoid this kind of problems (for this function, for instance, we would have to go through the list of tokens and remove the stopwords).

@k0pernicus
Copy link
Author

Hi @jbesomi, thank you for the update :)

For you to know, we were thinking about adding a tokenization function and require all preprocessing functions to receive an already tokenized function, to avoid this kind of problems[...].

Great!
Do not hesitate if you want to test this feature and integrate in this PR, I would be glad to help!

@k0pernicus
Copy link
Author

k0pernicus commented Jan 9, 2021

Hi @jbesomi ,
Do you have any news about the regex issue please? :)

@jbesomi
Copy link
Owner

jbesomi commented Jan 12, 2021

Hi @k0pernicus,

I've intensively thought about this and come to the conclusion that it's better for both of us developers and for all Texthero users to have all preprocessing functions to accept an already Tokenized Series. See here #145 for a complete discussion on the subject. Would you like to help with #145 too? Once implemented, writing replace_stopwords will be much easier, hence we will be able to integrate this PR too.

@k0pernicus
Copy link
Author

Hi @jbesomi,
I would be glad to help for this issue, I can take a look starting tomorrow about #145.

@jbesomi
Copy link
Owner

jbesomi commented Jan 12, 2021

Thank you @k0pernicus !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants