Skip to content

Latest commit

 

History

History
196 lines (162 loc) · 11.1 KB

04-stages-of-data-processedtransformed.md

File metadata and controls

196 lines (162 loc) · 11.1 KB

Stages of Data: Raw   |   Side Note on Data Structures: Tidy Data


4. Stages of Data: Processed/Transformed

Processing data puts it into a state more readily available for analysis and makes the data legible. For instance, it could be rendered as structured data. This can also take many forms, e.g., a table. Here are a few you're likely to come across, all representing the same data:

XML

XML or eXstensible Markup Language, uses a nested structure, where the "tags" like <Cat> contain other tags inside them, like <firstName>. This format is good for organizing the layout of a document in a tree-like format, just like HTML, where we want to nest elements like a sentence within a paragraph, for example. XML does not carry any information about how to be displayed and can be used in a variety of presentation scenarios.

<Cats>
    <Cat>
        <firstName>Smally</firstName>
        <lastName>McTiny</lastName>
    </Cat>
    <Cat>
        <firstName>Kitty</firstName>
        <lastName>Kitty</lastName>
    </Cat>
    <Cat>
        <firstName>Foots</firstName>
        <lastName>Smith</lastName>
    </Cat>
    <Cat>
        <firstName>Tiger</firstName>
        <lastName>Jaws</lastName>
    </Cat>
</Cats>

Screenshot of XML cats file This file is viewed on an online XML Viewer. If you would like to, you can either copy the code chunk above to try it out on XML Viewer or download the XML file to try it out in other viewers. To save the file onto your local computer, right click on Raw button (top right-hand corner of the data set) and click Save Link As... to save the file onto your local computer.

For example, after downloading the file, can you try to open this file in your browser? (Psst! Try right clicking on cats.xml in your local directory and choosing Open with Other Application in the drop down menu to select the browser of your choice.)

JSON

JSON or JavaScript Object Notation, also uses a nesting structure, but with the addition of key/value pairs, like the "firstName" key which is tied to the Smally value (at least for the first cat!). JSON is popular with web applications that save and send data from your browser to web servers, because it uses the main language of web browsers, JavaScript, to work with data.

{
    "Cats": [
        {
            "firstName": "Smally",
            "lastName": "McTiny"
        },
        {
            "firstName": "Kitty",
            "lastName": "Kitty"
        },
        {
            "firstName": "Foots",
            "lastName":"Smith"
        },
        {
            "firstName": "Tiger",
            "lastName":"Jaws"
        }
    ]
} 

Screenshot of JSON cats file This file is viewed on my Firefox browser from my local directory. To view it in your browser, you can drag and drop the local file onto a open tab or window. You can also download the JSON file and try opening it in other viewers (e.g. R Studio, webviewers like Code Beautify's JSON Viewer). To save the file onto your local computer, right click on Raw button (top right-hand corner of the data set) and click Save Link As... to save the file onto your local computer.

CSV

CSV or Comma Separated Values uses—you guessed it!—commas to separate values. Each line (First Name, Last Name) is a new "record" and each column (separated by a comma) is a new "field." This data format stores tabular data in a clean way that facilitates the transfer between different data architectures. As data types go, it is very rudimentary (even predating computers!) and is easy to type, without needing special characters beyond a comma.

First Name,Last Name
Smally,McTiny
Kitty,Kitty
Foots,Smith
Tiger,Jaws

Screenshot of CSV cats file

This file is viewed on my VSCode with the extension Excel Viewer. To view in VSCode, install the extension in VSCode, open the .csv, and then right click on the file and click Open Preview. You can also download the CSV file to open it in other viewers (e.g. Microsoft Excel, Notepad). To save the file onto your local computer, right click on Raw button (top right-hand corner of the data set) and click Save Link As... to save the file onto your local computer.

The Importance of Using Open Data Formats

A small detour to discuss data formats. Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software cease to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily open or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:

  1. Open this file in a text editor (e.g. Visual Studio Code, TextEdit (macOS), NotePad (Windows) ), and then in an app like Excel. This is a CSV, an open, text-only, file format. To save the file onto your local computer, right click on cats.csv and click Save Link As to download the file to your local computer (it's the same cats.csv from above!)
  2. Now do the same with this Excel file. Unlike the previous, this is a proprietary format!

Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.

A small list of open formats (more information of each file format is linked in their entries):
Types of multimedia Examples Common file extensions
Images TIFF (Tagged Image File Format) `.tiff`, `.tif`
JPEG2000 `.jp2`, `.jpf`, `.jpx`
PNG (Portable Network Graphics) `.png`
Text ASCII (American Standard Code for Information Interchange) `.ascii`, `.dat`, `.txt`
PDF (Portable Document Format) `.pdf`
CSV (Comma-Separated Values `.csv`
Audio FLAC (Free Lossless Audio Codec) `.flac`
ogg `.ogg`
Video MPEG-4 `.mp4`
Others XML (Extensible Markup Language) `.xml`
JSON (JavaScript Object Notation `.json`
STL (STereoLithography file format—used in 3D modeling) `.stl`
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats.

Evaluation

Structured data can be:

  • a XML list.*
  • a Excel table.*
  • an email chain.
  • a collection of text files.

We may choose to store our data in open data formats because they:

  • are sustainable.
  • allow for easy reusability.
  • are free-of-charge to use.
  • All of the above.*

Challenge: Processed/Transformed

  1. How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?
  2. Explore the moSmall.csv dataset, what questions might you ask with this dataset? What columns (variables) will you keep?
  3. If you are saving the file moSmall.csv in a proprietary spreadsheet application like Microsoft Excel (Windows/macOS) or Numbers (macOS), you may be prompted to save the file as .xlsx or .numbers. What format would you choose to save it in? Why would you choose to do so?

Solution:

  1. I usually go with the conventions of the field as it allows me to share my "in progress" work easily with my research lab and collaborators. The file conventions can range from .csv to .json.
  2. I will keep columns (variables) relevant to my question, such as the Artist Gender, Is Public Domain and Rights and Reproduction columns. I will also keep some of the descriptive columns such as Object ID and Artist Role to help contextualize the results (e.g. what kind of roles do female artists tend to take on?)
  3. I will choose to keep it in a .csv file type as it can be opened up by more programs and if Microsoft stops supporting .xlsx file types I may no longer have access to opening the dataset. or I will choose to switch to a .xlsx format as it is easier to use on a graphical user interface like Microsoft Excel. Any stylistic changes I've made to the file will remain as well, such as alternative highlighting rows for readability or bolding column headings.

Keywords

Do you remember the glossary terms from this section?


Stages of Data: Raw   |   Side Note on Data Structures: Tidy Data