Skip to content

sraj50/simple-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

simple-parser

A basic parser to investigate natural-language posts from a Q&A site

The dataset to be parsed comes from https://hardwarerecs.stackexchange.com a Q&A site for hardware recommendations. The data itself is in XML format.

Each line represents a record of a post begining with and ending with />

Each post has four attributes Id: unique identifier of each post PostTypeId: 1=question, 2=answer, 3-8:others CreationDate: creation date & time of post (yyyy-mm-ddThh:mm:ss) Body: content of post


Process Data
The first step is to process the XML data by doing some clean up so that only the body of post is available.

Special characters ("&#xA", "&#xD") are replaced by single empty space.

XML character references are changed to their original representation. For example, "&amp" to &, "&quot" to ", "&pos" to ', "&gt" to >, "&lt" to <

All HTML tags are also removed.

process-data.py reads input XML file, perform pre-processing to clean the body and split the file into question and answer based on the post ID type.

About

A basic parser to investigate natural-language posts from a Q&A site

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages