Skip to content

CornellNLP/wiki-talk-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Wikipedia-Talk-Parser

Description

A generalized Wikipedia Talk parser with metadata and configurations available

  • It backtracks the entire conversation (with metadata) of a comment the user wants to reply to
  • Comment metadata format (if available)
author: string
id: string
level: int
replies: array
timestamp: date
type: string

Parser Configurations (in the source code)

"include_title":false,         //include section title of a talk
"bind_comments_to_users":true, //all comments are associated with inferred users using timestamp
"remove_comment_info":true,    //remove comment username and timestamp
"debug": false,                //troubleshooting only

Getting Started

  1. Download Tampermonkey suitable to your browser

  2. Install Userscript.user.js

  3. Go to any wikipedia talk page -> open web developer tools-> switch to the console tab -> the console will show parsed results when a reply button is clicked on a wiki talk page

  4. Continue adding code inside the userscript to process the parsed results for further development


Alternatively, you can also use it as a library (see example files in the lib folder)


Parser Algorithm

  • It starts by checking if the comment is at the root level, since the comment DOM layout at the root level is different from other levels
  • It creates deep copies of original nodes to process them without side effects
  • Starting from the selected comment, it backtracks its preceding comment until the root level is reached
    • All comments preceding it at the same level are traversed before jumping back towards the root level
    • Comment texts (with or without the comment info) and metadata of all traversed comments are saved
  • After reaching the root level, it traverses backwards until the section title is reached
    • Comment texts (with or without the comment info) and metadata of all traversed comments are saved

Helper Functions

remove_nested_comments(node, metadata): remove all non-primary comments of node and add metadata of the first comment to metadata

prepend_if_valid(stack, node): add node into the stack

  • Configurations available
  • Validation checking
  • Catches some edge cases
  • Calls the helper function preprocess(node)

preprocess(node): remove noises of node

  • Configurations available
  • Catches some edge cases
  • Permits clean usage of node.innerText

About

parser for wikipedia talkpages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published