6 Lesson 06

6.1 Understanding Structured Data

Doing Digital Humanities practically always means working with structured data of some kind. In most general terms, structured data means some explicit annotation or classification that the machine can understand, and therefore — effectively use. When we see the word “Vienna”, we are likely to automatically assume that this is the name of the capital of Austria. The machine cannot know that, unless there is something else in the data that allows it to figure it out (here, an XML tag): <settlement country="Austria" type="capital city">Vienna</settlement> — from this annotation (and its attributes) the machine can be instructed to interpret the string Vienna as a settlement of the type capital city in the country of Austria. It is important to understand most common data formats in order to be able to create and generate them as well as to convert between different formats.

When we decide which format we want to work with, we need to consider the following: the ease of working with a given format (manual editing); suitability for specific analytical software; human-friendliness and readability; open vs. proprietary. In general, it does not make any sense to engage in the format wars (i.e., one format is better than another); one should rather develop an understanding that almost every format has its use and value in specific contexts or for specific tasks. What we also want is not to stick to a specific format and try to do everything with it and only it, but rather to be able to write scripts with which we can generate data in suitable formats or convert our data from one format into another.

Let’s take a look at a simple example in some most common formats.

6.1.2 CSV/TSV (Comma-Separated Values/ Tab-Separated Values)

to,from,heading,body
Tove,Jani,Reminder,Don’t forget me this weekend!

6.1.4 YML or YAML (Yet Another Markup Language > YAML Ain’t Markup Language)

6.2 Larger Examples

NB data example from here.

There are some online converters that can help you to convert one format into another. For example: http://www.convertcsv.com/.

6.2.1 CSV / TSV

city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
New York,4.8%,40.7127837,-74.0059413,8405837,1,New York
Los Angeles,4.8%,34.0522342,-118.2436849,3884307,2,California
Chicago,-6.1%,41.8781136,-87.6297982,2718782,3,Illinois

TSV is a better option than a CSV, since TAB characters are very unlikely to appear in values.

Neither TSV not CSV are good for preserving new line characters (\n)—or, in other words, text split into multiple lines. As a workaround, one can convert \n into some unlikely-to-occur character combination (for example, ;;;), which would allow to restore \n later , if necessary.

6.2.3 YML/YAML

YAML is often used only for a single set of parameters.

But it can also be used for storage of serialized data. It has advantages of both JSON and CSV: the overall simplicity of the format (no tricky syntax) is similar to that of CSV/TSV, but it is more readable than CSV/TSV in any text editor, and is more difficult to break—again, due to the simplicity of the format.

YAML files can be read with Python into dictionaries like so:

You will most likely need to install yaml library; it is also quite easy to write a script that would read such serialized data.

Note on installing libraries for python. In general, it should be as easy as running the following command in your command line tool:

pip install --upgrade libraryName
  • pip is the standard package installer for python; if you are running version 3.xx of python, it may be pip3 instead of pip. If you have Anaconda installed, you can also use Anacodnda interface to install packages;
  • install is the command to install a package that you need;
  • --upgrade is an optional argument that you would need only when you upgrade already installed package;
  • libraryName is the name of the library that you want to install.

This should work just fine, but sometimes it does not—usually when you have multiple versions of python installed and they may start conflicting with each other (another good reason to handle your python installations via Anaconda). There is, luckily, a workaround that seems to do the trick.You can modify your command in the following manner:

python -m pip install --upgrade libraryName
  • python here is whatever alias you are using for running python (e.g., in my case it is python3, so the full command will look: python3 -m pip install --upgrade libraryName)

6.3 In-Class Practice

Let’s try to convert this bibTex file into other formats. Before we begin, however, let’s break down this task into smaller tasks and organize them together in some form of pseudocode.

  • Which of the above-discussed formats would be most suitable? Why yes, why no?

6.4 Homework

  • Take your bibliography in bibTeX format and convert it into: csv/tsv, json, and yaml;
    • Hint: you should load your data into a dictionary;
    • additionally, you might want to create (manually) a dictionary of bibTeX field: some of the fields are named differently, while they contain the same type of information — you want to identify those fields and unify them for your output format, which will improve the quality of your data (the process usually called normalization); hint: in order to figure out how to identify those fields, you may want to look into the Word Frequency program in Chapter 11. You can use this approach to identify all fields and count their frequencies, which will help you to determine which fields to keep and which to normalize (i.e., merge low-frequency fields into high-frequency fields).
  • upload your results together with scripts to your homework github repository

Python

  • Work through Chapters 8 and 11 of Zelle’s book; read chapters carefully; work through the chapter summaries and exercises; complete the following programming exercises: 1-8 in Chapter 8 and 1-11 in Chapter 11;
  • Watch Dr. Vierthaler’s videos:
    • Episode 12: Functions
    • Episode 13: Libraries and NLTK
    • Episode 14: Regular Expressions
  • Note: the sequences are somewhat different in Zelle’s textbook and Vierthaler’s videos. I would recommend you to always check Vierthaler’s videos and also check videos which cover topics that you read about in Zelle’s book.

Webscraping (optional)

Submitting homework:

  • Homework assignment must be submitted by the beginning of the next class;
  • Now, that you know how to use GitHub, you will be submitting your homework pushing it to github:
    • Create a relevant subfoler in your HW070172 repository and place your HW files there; push them to your GitHub account;
      • Email me the link to your repository with a short message (Something like: I have completed homework for Lesson 3, which is uploaded to my repository … in subfolder L03)
      • In the subject of your email, please, add the following: CCXXXXX-LXX-HW-YourLastName-YourMatriculationNumber, where CCXXXXX is the numeric code of the course; LXX is the lesson for which the homework is being submitted; YourLastName is your last name, and YourMatriculationNumber is your matriculation number.

6.5 Homework Solution

Before we proceed, let’s make sure that you have the same folder structure on you machine. This will help to ensure that we will not run into other issiues and can focus on solving one problem at a time. (Please, download these files for L06_Conversion folder: unzip and move them all into the folder). The structure should be as follows:

.
├── MEMEX_SANDBOX
│   └── data
├── L06_Conversion
│   ├── comments.md
│   ├── pseudocode.md
│   ├── z_1_preliminary.py
│   ├── z_2_conversion_simple.py
│   ├── z_config.yml
│   └── zotero_biblatex_sample.bib
└── L07_Memex_Step1
    ├── ... your scripts ...
    └── ... your scripts ...

NB: ./MEMEX_SANDBOX/data/ is the target folder for all other assignments to follow. This is where we will be creating our memex.

6.5.1 Pseudocode solution

  1. look at the file, i.e. check the file in order to understand it structure
  2. create a holder for our data, which will be dictionary (dic, list, etc.)
  3. read as one big string
  4. split into records using \n@
    • we will get a list of strings
  5. loop through all the records: NB: each records is a string that needs to be converted into something else.
    • we need to split each record using ,\n
    • now we loop through the list of “key-value” pairs
    • type{citationkey element:
      • grab list element with index 0 (citationkey = record[0])
      • split the element on {
        • recordType = element[0]
        • citationKey = element[1]
    • add a record into our dictionary using citationKey as a key value
    • add recordType into the newly created record
    • process the rest of the record:
      • loop through the record, starting with 1:
        • for r in record[1:]:
      • split every element on =
        • key = element[0].strip()
        • value = element[1].strip()
        • add our key-value pair into the dictionary
  6. Save dictionary into CSV, JSON, YAML

6.5.2 Scripts

Script 1: analyzing bibTex data (z_1_preliminary.py)

This script will create the file bibtex_analysis.txt, which will be a frequency list of keys from all bibTeX records. We would want to convert this frequency list into a YML file which we can then load with yaml library (make sure to install it!). Loading yml data into a python dictionary is as easy as: dictionary = yaml.load(open(fileNameYml)).

You can convert the frequency list into a proper yml file using regular expressions:

Script 2: loading bibTeX data and converting to other formats (z_2_conversion_simple.py)


import re
import yaml

"""
1. load bibtex file
    - bibliography should be curated in Zotero (one can program cleaning procedures into the script, but this is not as reliable);
    - loading bibtex data, keep only those records that have PDFs;
    - some processing might be necessary (like picking one file out of two and more)
2. convert into other formats
    - csv
    - json
    - yml
"""

###########################################################
# VARIABLES ###############################################
###########################################################

settingsFile = "z_config.yml"
settings = yaml.load(open(settingsFile))
bibKeys = yaml.load(open("zotero_biblatex_keys.yml"))

###########################################################
# FUNCTIONS ###############################################
###########################################################

# load bibTex Data into a dictionary
def bibLoad(bibTexFile):

    bibDic = {}

    with open(bibTexFile, "r", encoding="utf8") as f1:
        records = f1.read().split("\n@")

        for record in records[1:]:
            # let process ONLY those records that have PDFs
            if ".pdf" in record.lower():

                record = record.strip().split("\n")[:-1]

                rType = record[0].split("{")[0].strip()
                rCite = record[0].split("{")[1].strip().replace(",", "")

                bibDic[rCite] = {}
                bibDic[rCite]["rCite"] = rCite
                bibDic[rCite]["rType"] = rType

                for r in record[1:]:
                    key = r.split("=")[0].strip()
                    val = r.split("=")[1].strip()
                    val = re.sub("^\{|\},?", "", val)

                    fixedKey = bibKeys[key]

                    bibDic[rCite][fixedKey] = val


    print("="*80)
    print("NUMBER OF RECORDS IN BIBLIGORAPHY: %d" % len(bibDic))
    print("="*80)
    return(bibDic)

###########################################################
# CONVERSION FUNCTIONS ####################################
###########################################################

import json
def convertToJSON(bibTexFile):
    data = bibLoad(bibTexFile)
    with open(bibTexFile.replace(".bib", ".json"), 'w', encoding='utf8') as f9:
        json.dump(data, f9, sort_keys=True, indent=4, ensure_ascii=False)


import yaml
def convertToYAML(bibTexFile):
    data = bibLoad(bibTexFile)
    with open(bibTexFile.replace(".bib", ".yaml"), 'w', encoding='utf8') as f9:
        yaml.dump(data, f9)

# CSV is the trickest because bibTeX is not symmetrical
def convertToCSV(bibTexFile):
    data = bibLoad(bibTexFile)
    # let's handpick fields that we want to save: citeKey, type, author, title, date
    headerList = ['citeKey', 'type', 'author', 'title', 'date'] 
    header = "\t".join(headerList)

    dataNew = [header]

    for k,v in data.items():
        citeKey = k

        if 'rType' in v:
            rType = v['rType']
        else:
            rType = "NA"

        if 'author' in v:
            author = v['author']
        else:
            author = "NA"

        if 'title' in v:
            title = v['title']
        else:
            title = "NA"

        if 'date' in v:
            date = v['date']
        else:
            date = "NA"

        tempVal = "\t".join([citeKey, rType, author, title, date])
        dataNew.append(tempVal)

    finalData = "\n".join(dataNew)
    with open(bibTexFile.replace(".bib", ".csv"), 'w', encoding='utf8') as f9:
        f9.write(finalData)


###########################################################
# RUN EVERYTHING ##########################################
###########################################################

print(settings["bib_all"])

#convertToJSON(settings["bib_all"])
#convertToYAML(settings["bib_all"])
#convertToCSV(settings["bib_all"])


print("Done!")