simple-parser

A basic parser to investigate natural-language posts from a Q&A site

The dataset to be parsed comes from https://hardwarerecs.stackexchange.com a Q&A site for hardware recommendations. The data itself is in XML format.

Each line represents a record of a post begining with and ending with />

Each post has four attributes Id: unique identifier of each post PostTypeId: 1=question, 2=answer, 3-8:others CreationDate: creation date & time of post (yyyy-mm-ddThh:mm:ss) Body: content of post

Process Data
The first step is to process the XML data by doing some clean up so that only the body of post is available.

Special characters ("&#xA", "&#xD") are replaced by single empty space.

XML character references are changed to their original representation. For example, "&amp" to &, "&quot" to ", "&pos" to ', "&gt" to >, "&lt" to <

All HTML tags are also removed.

process-data.py reads input XML file, perform pre-processing to clean the body and split the file into question and answer based on the post ID type.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
process-data.py		process-data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

process-data.py

process-data.py

Repository files navigation

simple-parser

About

Releases

Packages

Languages

sraj50/simple-parser

Folders and files

Latest commit

History

README.md

README.md

process-data.py

process-data.py

Repository files navigation

simple-parser

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages