A basic parser to investigate natural-language posts from a Q&A site
The dataset to be parsed comes from https://hardwarerecs.stackexchange.com a Q&A site for hardware recommendations. The data itself is in XML format.
Each line represents a record of a post begining with and ending with />
Each post has four attributes Id: unique identifier of each post PostTypeId: 1=question, 2=answer, 3-8:others CreationDate: creation date & time of post (yyyy-mm-ddThh:mm:ss) Body: content of post
Process Data
The first step is to process the XML data by doing some clean up so that only the body of post is available.
Special characters ("
", "
") are replaced by single empty space.
XML character references are changed to their original representation. For example, "&" to &, """ to ", "&pos" to ', ">" to >, "<" to <
All HTML tags are also removed.
process-data.py
reads input XML file, perform pre-processing to clean the body and split the file into question
and answer based on the post ID type.