Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction to Text Analysis for Non-English and Multilingual Texts #612

Open
hawc2 opened this issue Mar 23, 2024 · 7 comments
Open

Introduction to Text Analysis for Non-English and Multilingual Texts #612

hawc2 opened this issue Mar 23, 2024 · 7 comments

Comments

@hawc2
Copy link
Collaborator

hawc2 commented Mar 23, 2024

Programming Historian in English has received a proposal for a lesson, 'Introduction to Text Analysis for Non-English and Multilingual Texts' by @ian-nai.

I have circulated this proposal for feedback within the English team. We have considered this proposal for:

  • Openness: we advocate for use of open source software, open programming languages and open datasets
  • Global access: we serve a readership working with different operating systems and varying computational resources
  • Multilingualism: we celebrate methodologies and tools that can be applied or adapted for use in multilingual research-contexts
  • Sustainability: we're committed to publishing learning resources that can remain useful beyond present-day graphical user interfaces and current software versions

We are pleased to have invited @ian-nai to develop this Proposal into a Submission under the guidance of @lachapot as editor.

The Submission package should include:

  • Lesson text (written in Markdown)
  • Figures: images / plots / graphs (if using)
  • Data assets: codebooks, sample dataset (if using)

We ask @ian-nai to share their Submission package with our Publishing team by email, copying in @lachapot.

We've agreed a submission date of mid-late April. We ask @ian-nai to contact us if they need to revise this deadline.

When the Submission package is received, our Publishing team will process the new lesson materials, and prepare a Preview of the initial draft. They will post a comment in this Issue to provide the locations of all key files, as well as a link to the Preview where contributors can read the lesson as the draft progresses.

If we have not received the Submission package by late April, @lachapot will attempt to contact @ian-nai. If we do not receive any update, this Issue will be closed.

Our dedicated Ombudspersons are Ian Milligan (English), Silvia Gutiérrez De la Torre (español), Hélène Huet (français), and Luis Ferla (português) Please feel free to contact them at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudspersons will have no impact on the outcome of any peer review.

@charlottejmc
Copy link
Collaborator

charlottejmc commented Apr 19, 2024

Hello @lachapot and @ian-nai,

You can find the key files here:

You can review a preview of the lesson here:


There are a couple of small things I noticed while processing this lesson, which I will outline below:

  • I believe the lesson file is missing the 'core' section of the lesson, Sample Code and Exercises (and its subheadings). I can see that this section is indeed provided in the associated Google Colab notebook, but we would want to see it in the main text.

    We’ve developed some guidelines for authors who choose to integrate codebooks in their lessons. Our aim is to support maintenance, future translatability, and flexible usability. The guidelines are based on a key understanding that we want our readers to be able to make the choice to work in Google Colab, work in their preferred alternative cloud-based development environment, or opt to run the code locally. If authors provide codebooks to accompany their lesson, we ask that:

    • Codebooks consist of the code + line comments only
    • Headings and subheadings mirror those of the lesson to support readers' navigation
    • Codebooks do not extend or replicate commentary from the lesson

    @ian-nai, when you make changes to the notebook, please share the new version with me (publishing.assistant[@]programminghistorian.org): we'll want to save a new copy of it in the lesson's /assets folder.

  • Is it necessary that readers download the full corpus from Wikipedia? If so, we could consider hosting this asset directly in the lesson's assets folder. (Alternatively, if this code is specifically written to download / scape data assets from a webpage, it is fine that the data remains outside our repository as long as the data is open access.) If not, it may be helpful to make it clearer that downloading the full corpus is optional.

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Apr 19, 2024

Hello Ian @ian-nai,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 2: Initial Edit. In this Phase, your editor Laura @lachapot will read your lesson, and provide some initial feedback.

(I see that Charlotte has raised a couple of queries above about 1. a missing core section in the Markdown file? and 2. whether the sample data would be useful to host on our repository, or if downloading that dataset from the web is intended to be part of the learning actions? I imagine that Laura will have thoughts on these, and you can take the conversation forwards together).

Laura will post feedback and suggestions as a comment in this Issue, so that you can revise your draft in the following Phase 3: Revision 1.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 1 <br> Submission
Who worked on this? : Publishing Assistant (@charlottejmc) 
All  Phase 1 tasks completed? : Yes
Section Phase 2 <br> Initial Edit
Who's working on this? : Laura (@lachapot)  
Expected completion date? : May 19
Section Phase 3 <br> Revision 1
Who's responsible? : Author (@ian-nai) 
Expected timeframe? : ~30 days after feedback is received

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

@lachapot
Copy link

Hi everyone,

Thank you very much for your lesson, @ian-nai. And thank you @charlottejmc and @anisa-hawes for setting it all up. I will provide some initial feedback in about two weeks time. Looking forward to working with you on this!

@charlottejmc
Copy link
Collaborator

Hi @lachapot,

Just an update to say that @ian-nai has very helpfully worked on solving the two points I flagged in my comment above.

  • Ian has opted to embed the code into the Markdown file directly rather than include a separate codebook. I've now updated the lesson file with the new copy of his lesson, which he shared over email, and deleted the codebook from the assets folder.

  • We confirmed that readers don't need to download the whole corpus from Wikipedia (only the excerpt from the text file which we host in the assets folder), and this is now reflected in the lesson's wording.

@lachapot
Copy link

Great, thank you very much for this update @charlottejmc and @ian-nai!

@lachapot
Copy link

lachapot commented May 6, 2024

Thank you very much for your lesson submission, @ian-nai! This is an exciting contribution that addresses important gaps in the field of text analysis/NLP. Here are some suggestions for revisions as we start preparing the lesson for external peer-review.

Overall, I’d suggest that there are three broad areas of revision we could focus on at this point:

  • The first is to define more specifically the scope of the lesson: what will readers learn from the lesson, and what foundational generalizable skills does it provide for them to apply to their own research projects? In essence, this is to help more clearly and directly “advertise” the lesson so that readers can more immediately recognize what the lesson is about and whether it’s relevant to them.

  • Currently, the lesson leaves quite a lot unexplained and could flesh out further both the broader context as well as the specific methodological procedures presented in this lesson — i.e. delving more deeply into the broader political and technical stakes of doing text analysis with different languages; the core steps and methods of text analysis (e.g. digitization/OCR, preprocessing procedures, text analysis methods etc.) and how the lesson fits in to this (e.g. some preprocessing procedures are problematic when working with different languages and how do we address this?). Without this broader context and background information, the lesson is a little hard to follow and there isn’t an obvious methodological narrative to guide the reader through the lesson (i.e. it’s difficult to understand what the rationale for each step is (language recognition, POS, lemmatization), how the steps fit together, and how they might be useful in a text analysis workflow). Once a clearer scope is defined, some restructuring and expanding of parts of the lesson could help give more flow to the lesson and lay out more fully the different steps covered as well as the rationale for each step (i.e. the broader context — why are these steps useful to know about and how do they fit in with steps involved in text analysis workflow more generally?).

  • Somewhat relatedly, there is also a question around the level of difficulty of the lesson — is this an introductory lesson aimed at beginners or is this an intermediate lesson that assumes some prior knowledge? Currently, it seems aimed at beginners, but background information and specialist concepts are not consistently explained in lay terms and complex procedures are not always broken down into simple steps with each step fully explained (e.g. the packages introduction mentions pre-trained models, pipelines, processing times, which might not necessarily be beginner-friendly terminology and could either be explained in more simple, lay terms or could provide links to external sources for more information). I’m assuming this lesson is aimed at beginners, so the suggestions I make below are for a beginner level lesson.

Here are some more specific suggestions for revisions for each section of the lesson to address the areas outlined above (when I mention paragraph numbers I’m referring to the lesson preview):

Lesson Goals section

  • As mentioned above, the “Lesson Goals” could be more specific. It seems to me that what’s particularly exciting and unique about the lesson is the focus on multilingual text (specifically, text that includes both Russian and French) and this lesson shows how you can perform two fundamental preprocessing steps that are widely used in text analysis (POS tagging and lemmatization) for multilingual text. Rather than simply presenting this lesson as an introduction to text analysis, this section could be clearer about the specific goals and methods the lesson covers so that readers immediately know what specific skills are covered and why.

  • Similarly, I might suggest tweaking the title of the lesson to be a bit more descriptive and specific. For example, it might be useful to name the specific tools used, and perhaps also follow the Bender rule and explicitly name the specific languages that are addressed in the lesson.

  • Don’t forget to flag up any prerequisites here (perhaps in its own separate section), i.e. give some indication the level of difficulty and what users need (in terms of skills/knowledge, tools/packages, and data) to follow this lesson. Cf. for example the “Preparation” section in this lesson or this lesson. It might also make sense to move all initial installation information to this section.

Basics of Text Analysis and Working with Non-English and Multilingual Text sections

I’d suggest restructuring these two sections — perhaps merging parts of both sections and potentially also breaking them down into subsections to expand on particular points and examples — in order to set up the focus of the lesson more clearly and get more quickly to the heart of the lesson, i.e. working with multilingual text. As I understand it, this section should be an introduction to text analysis and why text analysis is a useful skill (providing examples of how people have applied it and what it could be used for), but it should also introduce this in context of multilingual text analysis and issues of language diversity in text analysis/NLP so that the reader can understand the broader issues and how the methods presented in this lesson address these issues.

  • Paragraphs 3 and 4 of “Basics of Text Analysis” could be condensed and could also offer more concrete examples of projects and applications — you could point to other Programming Historian lessons (or provide examples of other projects) that illustrate the many applications of text analysis for further reference for readers.

  • Then I’d suggest adding more general context and discussion on issues of linguistic bias in computational text analysis, and the challenges and considerations to take into account when working with different languages or with multilingual texts specifically. You don’t necessarily have to cover everything in detail, you can sketch out main points and link to further reading/resources if people want more information, but providing more crucial background knowledge for people unfamiliar with these issues of language diversity and text analysis will help strengthen the narrative flow of the lesson as well as clarify the lessons own stance in relation to these issues. The information currently in bullet points in “Working with Non-English and Multilingual Text” could be expanded into more flowing prose commentary — with some of the information integrated into these contextual discussions of linguistic bias in NLP, and the examples you provide (encoding, right to left scripts, logographic languages, etc) could be expanded further (potentially adding further reading/resources, images and more specific examples) to illustrate more concretely challenges that people might encounter when working with different languages.

  • I’d also suggest adding here a section that introduces key steps and concepts of text analysis relevant to the lesson (e.g. laying out how parts of speech tagging and lemmatization are fundamental steps in text analysis amongst others, but that these can be difficult to realize with multilingual texts because of issues outlined above). This can be an occasion to introduce and explain any specialist vocabulary or fundamental concepts, and make clear what specific methodologies are presented in this lesson and how they might fit into a broader text analysis workflow.

Tools We’ll Cover section

  • When discussing the different packages, perhaps try to consistently add links to documentation and information on the different languages these tools work with (perhaps also note whether these packages have documentation in languages other than English if possible). Also make sure to explain in beginner terms the general features of the packages and how that might be relevant to the user’s considerations.

  • It might also help with the structure and flow of the lesson to have a few introductory sentences to this section that link back to the previous discussion and clarify why we’re comparing these packages and what the payoff is of comparing these different packages (e.g. that different libraries exist for NLP, these are widely used NLP packages, they cover different languages, they might be more or less difficult to use etc.).

Sample Code and Exercises section

  •  Add the original Russian title for War and Peace with English title in brackets after the Russian in paragraph 7.

  •  The sample data is a few lines of text, it might be clearer to say “We will take a few lines of text from…” rather than “a corpus of text”?

  • For clarity, I’d suggest using the comments in code snippets strictly for explaining the code and move out of code blocks into the main body of the lesson any more contextual comments (e.g. in code snippet at paragraph 9, the comment “we are using minimally preprocessed excerpt..” could be moved below the code snippet rather than in the snippet). Some of this more contextual information could also be more fully explained and linked back to discussions introduced in the previous sections (e.g. summarize key points of the article you reference: what constitutes typical preprocessing steps, and how is this sometimes problematic in relation to questions of language diversity specifically?).

  • Consider splitting longer blocks of code where appropriate and, as mentioned above, moving code comments to markdown text where appropriate (e.g. code block at paragraph 24 could be split and unpacked further, similarly paragraph 28 could be split into two or three blocks, etc.).

  • Make sure that code comments are as descriptive and explicit as possible of what the code is actually doing (e.g. rather than just “Russian only” in code at paragraph 14, specifying that this is storing in the variable rus_sent the sentence at index 5 (or the sixth sentence in the list), or adding clarifying comments breaking down the procedure when setting up spaCy of downloading the relevant model, loading, and creating spaCy document that contains rich linguistic information we can use for further analysis/processing, (e.g. POS) and then linking to the website and specifying that readers can choose the model relevant to their research). In general, keep in mind, as far as possible, how readers might want to generalize to their own projects and provide pointers, where possible, to help readers generalize to their own projects.

  • Similarly, make sure to clarify any specialist concepts and terminology by providing explanations in the lesson itself and/or by linking to external resources (e.g. tokenization, regex, etc).

  • Perhaps add in more explanation for some of the outputs (e.g. what do the lemmatization outputs show specifically?). It might also be useful for readers to have some brief discussion or at least flagging of limitations they might want to consider (e.g. how well is lemmatization performing for each language?) and perhaps pointers to how lemmatization or POS can be used in further analysis (by linking to other Programming Historian lessons for example). 

  • Consider perhaps renaming or adding in titles for your sections to bring out more explicitly the methodological narrative you’re presenting here. E.g. “Identifying Languages” could be something like “How to automatically detect different languages and scripts”… It also seems that perhaps a section title could be added, after loading and tokenizing the text, to indicate the part of the lesson that demonstrates or compares how well the different packages detect languages and the limitations when working with multilingual text.

  •  One small problem with the code at paragraph 30: The output is only Russian POS (probably need to iterate over the processed docs?).

  •  For spaCy code at paragraph 32 I get an error 'Document' object is not iterable (perhaps check the naming of your variables: fre_nlp but then nlp used for creating the doc…)?

Otherwise the code runs smoothly!

I hope this is helpful! Let me know if there’s anything you’d like to discuss or if you have any questions. Ideally, this first round of revisions would happen within 30 days so we can move swiftly on to the next phase, but let us know if there are any adjustments you need to make on the timeline.

Thanks again for this exciting contribution and looking forward to working on this with you!

Laura

@anisa-hawes
Copy link
Contributor

What's happening now?

Hello Ian @ian-nai. Your lesson has been moved to the next phase of our workflow which is Phase 3: Revision 1.

This Phase is an opportunity for you to revise your draft in response to @lachapot's initial feedback.

I've sent you an invitation to join us as an Outside Collaborator here on GitHub. This will give you the Write access you'll need to edit your lesson directly.

We ask authors to work on their own files with direct commits: we prefer you don't fork our repo, or use the Pull Request system to edit in ph-submissions. You can make direct commits to your file here: /en/drafts/originals/non-english-and-multilingual-text-analysis.md. @charlottejmc and I can help if you encounter any practical problems!

When you and Laura @lachapot are both happy with the revised draft, we will move forward to Phase 4: Open Peer Review.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 2 <br> Initial Edit
Who worked on this? : Editor (@lachapot) 
All  Phase 2 tasks completed? : Yes
Section Phase 3 <br> Revision 1
Who's working on this? : Author (@ian-nai)  
Expected completion date? : June 8
Section Phase 4 <br> Open Peer Review
Who's responsible? : Reviewers (TBC) 
Expected timeframe? : ~60 days after request is accepted

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 3 Revision 1
Development

No branches or pull requests

4 participants