Introduction to Text Analysis for Non-English and Multilingual Texts #612

hawc2 · 2024-03-23T20:47:46Z

Programming Historian in English has received a proposal for a lesson, 'Introduction to Text Analysis for Non-English and Multilingual Texts' by @ian-nai.

I have circulated this proposal for feedback within the English team. We have considered this proposal for:

Openness: we advocate for use of open source software, open programming languages and open datasets
Global access: we serve a readership working with different operating systems and varying computational resources
Multilingualism: we celebrate methodologies and tools that can be applied or adapted for use in multilingual research-contexts
Sustainability: we're committed to publishing learning resources that can remain useful beyond present-day graphical user interfaces and current software versions

We are pleased to have invited @ian-nai to develop this Proposal into a Submission under the guidance of @lachapot as editor.

The Submission package should include:

Lesson text (written in Markdown)
- For guidance, we recommend Sarah Simpkin's lesson Getting Started with Markdown
Figures: images / plots / graphs (if using)
Data assets: codebooks, sample dataset (if using)

We ask @ian-nai to share their Submission package with our Publishing team by email, copying in @lachapot.

We've agreed a submission date of mid-late April. We ask @ian-nai to contact us if they need to revise this deadline.

When the Submission package is received, our Publishing team will process the new lesson materials, and prepare a Preview of the initial draft. They will post a comment in this Issue to provide the locations of all key files, as well as a link to the Preview where contributors can read the lesson as the draft progresses.

If we have not received the Submission package by late April, @lachapot will attempt to contact @ian-nai. If we do not receive any update, this Issue will be closed.

Our dedicated Ombudspersons are Ian Milligan (English), Silvia Gutiérrez De la Torre (español), Hélène Huet (français), and Luis Ferla (português) Please feel free to contact them at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudspersons will have no impact on the outcome of any peer review.

charlottejmc · 2024-04-19T11:12:33Z

Hello @lachapot and @ian-nai,

You can find the key files here:

.md: /en/drafts/originals/non-english-and-multilingual-text-analysis.md
images: [none]
assets: /assets/non-english-and-multilingual-text-analysis

You can review a preview of the lesson here:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/non-english-and-multilingual-text-analysis

There are a couple of small things I noticed while processing this lesson, which I will outline below:

I believe the lesson file is missing the 'core' section of the lesson, Sample Code and Exercises (and its subheadings). I can see that this section is indeed provided in the associated Google Colab notebook, but we would want to see it in the main text.

We’ve developed some guidelines for authors who choose to integrate codebooks in their lessons. Our aim is to support maintenance, future translatability, and flexible usability. The guidelines are based on a key understanding that we want our readers to be able to make the choice to work in Google Colab, work in their preferred alternative cloud-based development environment, or opt to run the code locally. If authors provide codebooks to accompany their lesson, we ask that:
- Codebooks consist of the code + line comments only
- Headings and subheadings mirror those of the lesson to support readers' navigation
- Codebooks do not extend or replicate commentary from the lesson
@ian-nai, when you make changes to the notebook, please share the new version with me (publishing.assistant[@]programminghistorian.org): we'll want to save a new copy of it in the lesson's /assets folder.
Is it necessary that readers download the full corpus from Wikipedia? If so, we could consider hosting this asset directly in the lesson's assets folder. (Alternatively, if this code is specifically written to download / scape data assets from a webpage, it is fine that the data remains outside our repository as long as the data is open access.) If not, it may be helpful to make it clearer that downloading the full corpus is optional.

anisa-hawes · 2024-04-19T21:00:21Z

Hello Ian @ian-nai,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 2: Initial Edit. In this Phase, your editor Laura @lachapot will read your lesson, and provide some initial feedback.

(I see that Charlotte has raised a couple of queries above about 1. a missing core section in the Markdown file? and 2. whether the sample data would be useful to host on our repository, or if downloading that dataset from the web is intended to be part of the learning actions? I imagine that Laura will have thoughts on these, and you can take the conversation forwards together).

Laura will post feedback and suggestions as a comment in this Issue, so that you can revise your draft in the following Phase 3: Revision 1.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 1 <br> Submission
Who worked on this? : Publishing Assistant (@charlottejmc) 
All  Phase 1 tasks completed? : Yes
Section Phase 2 <br> Initial Edit
Who's working on this? : Laura (@lachapot)  
Expected completion date? : May 19
Section Phase 3 <br> Revision 1
Who's responsible? : Author (@ian-nai) 
Expected timeframe? : ~30 days after feedback is received

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

lachapot · 2024-04-19T21:13:08Z

Hi everyone,

Thank you very much for your lesson, @ian-nai. And thank you @charlottejmc and @anisa-hawes for setting it all up. I will provide some initial feedback in about two weeks time. Looking forward to working with you on this!

charlottejmc · 2024-04-24T14:15:53Z

Hi @lachapot,

Just an update to say that @ian-nai has very helpfully worked on solving the two points I flagged in my comment above.

Ian has opted to embed the code into the Markdown file directly rather than include a separate codebook. I've now updated the lesson file with the new copy of his lesson, which he shared over email, and deleted the codebook from the assets folder.
We confirmed that readers don't need to download the whole corpus from Wikipedia (only the excerpt from the text file which we host in the assets folder), and this is now reflected in the lesson's wording.

lachapot · 2024-04-24T16:29:50Z

Great, thank you very much for this update @charlottejmc and @ian-nai!

lachapot · 2024-05-06T23:26:21Z

Thank you very much for your lesson submission, @ian-nai! This is an exciting contribution that addresses important gaps in the field of text analysis/NLP. Here are some suggestions for revisions as we start preparing the lesson for external peer-review.

Overall, I’d suggest that there are three broad areas of revision we could focus on at this point:

The first is to define more specifically the scope of the lesson: what will readers learn from the lesson, and what foundational generalizable skills does it provide for them to apply to their own research projects? In essence, this is to help more clearly and directly “advertise” the lesson so that readers can more immediately recognize what the lesson is about and whether it’s relevant to them.
Currently, the lesson leaves quite a lot unexplained and could flesh out further both the broader context as well as the specific methodological procedures presented in this lesson — i.e. delving more deeply into the broader political and technical stakes of doing text analysis with different languages; the core steps and methods of text analysis (e.g. digitization/OCR, preprocessing procedures, text analysis methods etc.) and how the lesson fits in to this (e.g. some preprocessing procedures are problematic when working with different languages and how do we address this?). Without this broader context and background information, the lesson is a little hard to follow and there isn’t an obvious methodological narrative to guide the reader through the lesson (i.e. it’s difficult to understand what the rationale for each step is (language recognition, POS, lemmatization), how the steps fit together, and how they might be useful in a text analysis workflow). Once a clearer scope is defined, some restructuring and expanding of parts of the lesson could help give more flow to the lesson and lay out more fully the different steps covered as well as the rationale for each step (i.e. the broader context — why are these steps useful to know about and how do they fit in with steps involved in text analysis workflow more generally?).
Somewhat relatedly, there is also a question around the level of difficulty of the lesson — is this an introductory lesson aimed at beginners or is this an intermediate lesson that assumes some prior knowledge? Currently, it seems aimed at beginners, but background information and specialist concepts are not consistently explained in lay terms and complex procedures are not always broken down into simple steps with each step fully explained (e.g. the packages introduction mentions pre-trained models, pipelines, processing times, which might not necessarily be beginner-friendly terminology and could either be explained in more simple, lay terms or could provide links to external sources for more information). I’m assuming this lesson is aimed at beginners, so the suggestions I make below are for a beginner level lesson.

Here are some more specific suggestions for revisions for each section of the lesson to address the areas outlined above (when I mention paragraph numbers I’m referring to the lesson preview):

Lesson Goals section

As mentioned above, the “Lesson Goals” could be more specific. It seems to me that what’s particularly exciting and unique about the lesson is the focus on multilingual text (specifically, text that includes both Russian and French) and this lesson shows how you can perform two fundamental preprocessing steps that are widely used in text analysis (POS tagging and lemmatization) for multilingual text. Rather than simply presenting this lesson as an introduction to text analysis, this section could be clearer about the specific goals and methods the lesson covers so that readers immediately know what specific skills are covered and why.
Similarly, I might suggest tweaking the title of the lesson to be a bit more descriptive and specific. For example, it might be useful to name the specific tools used, and perhaps also follow the Bender rule and explicitly name the specific languages that are addressed in the lesson.
Don’t forget to flag up any prerequisites here (perhaps in its own separate section), i.e. give some indication the level of difficulty and what users need (in terms of skills/knowledge, tools/packages, and data) to follow this lesson. Cf. for example the “Preparation” section in this lesson or this lesson. It might also make sense to move all initial installation information to this section.

Basics of Text Analysis and Working with Non-English and Multilingual Text sections

I’d suggest restructuring these two sections — perhaps merging parts of both sections and potentially also breaking them down into subsections to expand on particular points and examples — in order to set up the focus of the lesson more clearly and get more quickly to the heart of the lesson, i.e. working with multilingual text. As I understand it, this section should be an introduction to text analysis and why text analysis is a useful skill (providing examples of how people have applied it and what it could be used for), but it should also introduce this in context of multilingual text analysis and issues of language diversity in text analysis/NLP so that the reader can understand the broader issues and how the methods presented in this lesson address these issues.

Paragraphs 3 and 4 of “Basics of Text Analysis” could be condensed and could also offer more concrete examples of projects and applications — you could point to other Programming Historian lessons (or provide examples of other projects) that illustrate the many applications of text analysis for further reference for readers.
Then I’d suggest adding more general context and discussion on issues of linguistic bias in computational text analysis, and the challenges and considerations to take into account when working with different languages or with multilingual texts specifically. You don’t necessarily have to cover everything in detail, you can sketch out main points and link to further reading/resources if people want more information, but providing more crucial background knowledge for people unfamiliar with these issues of language diversity and text analysis will help strengthen the narrative flow of the lesson as well as clarify the lessons own stance in relation to these issues. The information currently in bullet points in “Working with Non-English and Multilingual Text” could be expanded into more flowing prose commentary — with some of the information integrated into these contextual discussions of linguistic bias in NLP, and the examples you provide (encoding, right to left scripts, logographic languages, etc) could be expanded further (potentially adding further reading/resources, images and more specific examples) to illustrate more concretely challenges that people might encounter when working with different languages.
I’d also suggest adding here a section that introduces key steps and concepts of text analysis relevant to the lesson (e.g. laying out how parts of speech tagging and lemmatization are fundamental steps in text analysis amongst others, but that these can be difficult to realize with multilingual texts because of issues outlined above). This can be an occasion to introduce and explain any specialist vocabulary or fundamental concepts, and make clear what specific methodologies are presented in this lesson and how they might fit into a broader text analysis workflow.

Tools We’ll Cover section

When discussing the different packages, perhaps try to consistently add links to documentation and information on the different languages these tools work with (perhaps also note whether these packages have documentation in languages other than English if possible). Also make sure to explain in beginner terms the general features of the packages and how that might be relevant to the user’s considerations.
It might also help with the structure and flow of the lesson to have a few introductory sentences to this section that link back to the previous discussion and clarify why we’re comparing these packages and what the payoff is of comparing these different packages (e.g. that different libraries exist for NLP, these are widely used NLP packages, they cover different languages, they might be more or less difficult to use etc.).

Sample Code and Exercises section

Otherwise the code runs smoothly!

I hope this is helpful! Let me know if there’s anything you’d like to discuss or if you have any questions. Ideally, this first round of revisions would happen within 30 days so we can move swiftly on to the next phase, but let us know if there are any adjustments you need to make on the timeline.

Thanks again for this exciting contribution and looking forward to working on this with you!

Laura

anisa-hawes · 2024-05-08T10:56:14Z

What's happening now?

Hello Ian @ian-nai. Your lesson has been moved to the next phase of our workflow which is Phase 3: Revision 1.

This Phase is an opportunity for you to revise your draft in response to @lachapot's initial feedback.

I've sent you an invitation to join us as an Outside Collaborator here on GitHub. This will give you the Write access you'll need to edit your lesson directly.

We ask authors to work on their own files with direct commits: we prefer you don't fork our repo, or use the Pull Request system to edit in ph-submissions. You can make direct commits to your file here: /en/drafts/originals/non-english-and-multilingual-text-analysis.md. @charlottejmc and I can help if you encounter any practical problems!

When you and Laura @lachapot are both happy with the revised draft, we will move forward to Phase 4: Open Peer Review.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 2 <br> Initial Edit
Who worked on this? : Editor (@lachapot) 
All  Phase 2 tasks completed? : Yes
Section Phase 3 <br> Revision 1
Who's working on this? : Author (@ian-nai)  
Expected completion date? : June 8
Section Phase 4 <br> Open Peer Review
Who's responsible? : Reviewers (TBC) 
Expected timeframe? : ~60 days after request is accepted

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

hawc2 added English 0. Proposal Original labels Mar 23, 2024

hawc2 assigned lachapot Mar 23, 2024

anisa-hawes added 1. Submission and removed 0. Proposal labels Apr 19, 2024

anisa-hawes added 2. Initial Edit and removed 1. Submission labels Apr 19, 2024

anisa-hawes added 3. Revision 1 and removed 2. Initial Edit labels May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to Text Analysis for Non-English and Multilingual Texts #612

Introduction to Text Analysis for Non-English and Multilingual Texts #612

hawc2 commented Mar 23, 2024

charlottejmc commented Apr 19, 2024 •

edited

anisa-hawes commented Apr 19, 2024 •

edited

lachapot commented Apr 19, 2024

charlottejmc commented Apr 24, 2024

lachapot commented Apr 24, 2024

lachapot commented May 6, 2024 •

edited

anisa-hawes commented May 8, 2024

Introduction to Text Analysis for Non-English and Multilingual Texts #612

Introduction to Text Analysis for Non-English and Multilingual Texts #612

Comments

hawc2 commented Mar 23, 2024

charlottejmc commented Apr 19, 2024 • edited

anisa-hawes commented Apr 19, 2024 • edited

What's happening now?

lachapot commented Apr 19, 2024

charlottejmc commented Apr 24, 2024

lachapot commented Apr 24, 2024

lachapot commented May 6, 2024 • edited

Lesson Goals section

Basics of Text Analysis and Working with Non-English and Multilingual Text sections

Tools We’ll Cover section

Sample Code and Exercises section

anisa-hawes commented May 8, 2024

What's happening now?

charlottejmc commented Apr 19, 2024 •

edited

anisa-hawes commented Apr 19, 2024 •

edited

lachapot commented May 6, 2024 •

edited