6 L06: Data II

6.1 Modeling & Manipulating

6.2 Goals

Getting to know the basics of working with data: modeling, manipulating

6.3 Software

R
Excel, Google Spreadsheets, or any other alternative

6.4 Additional Materials

Two excellent books on data visualization with R (both available onlenly online):
- Kieran Healy. Data Visualization: A Practical Guide, https://socviz.co/.
  - This book is more conceptual and is more of a textbook
  - Everybody should read the first chapter “Look at Data!” (https://socviz.co/lookatdata.html)
- Rob Kabacoff. Data Visualization with R, https://rkabacoff.github.io/datavis/
  - This book is more of a reference and a cookbook

6.5 In Class I: Theoretical and Conceptual

6.6 Ways of modeling data: Categorization

“[Modeling is] a continual process of coming to know by manipulating representations.”

Willard McCarty, “Modeling: A Study in Words and Meanings,” in Susan Schreibman, Ray Siemens, and John Unsworth, A New Companion to Digital Humanities, 2nd ed. (Chichester, UK, 2016), http://www.digitalhumanities.org/companion/.

One of the most common way of modeling data in historical research—joining items into broader categories. Categorization is important because it allows to group items with low frequencies into items with higher frequencies, and through those discern patterns and trends. Additionally, alternative categorizations allow one to test different perspectives on historical data.

The overall process is rather simple in terms of technological implementation, but is quite complex in terms of subject knowledge and specialized expertise is required to make well-informed decisions.

For example, let’s say we have the following categories: baker, blacksmith, coppersmith, confectioner, and goldsmith.
- These can be categorized as occupations;
- Additionally, blacksmith, coppersmith, and goldsmith can also be categorized as ‘metal industry’, while baker and confectioner, can be categorized as ‘food industry’;
- Yet even more, one might want to introduce additional categories, such as luxury production to include items like goldsmith and confectioner; and regular production for items like baker, blacksmith, coppersmith.
Such categorizations can be created in two different ways, with each having its advantages:
- first, one can create them as additional columns. This approach will allow to always have the original—or alternative—classifications at hand, which is helpful for re-thinking classifications and creating alternative ones where items will be reclassified differently, based on a different set of assumptions about your subject.
- second, these can be created in separate files, which might be easier as one does not have to stare at existing classifications and therefore will be less influenced by them in making new classification decisions.
Additionally, one can use some pre-existing classifications that have already been created in academic literature. These most likely need to be digitized and converted into properly formatted data, as we discussed in the previous lesson.

6.7 Normalization

This is a rather simple, yet important procedure, which is, on the technical side, very similar to what was described above. In essence, the main goal of normalization is to remove insignificant differences that may hinder analysis.

Most common examples would be:
- bringing information to the same format (e.g., dates, names, etc.)
- unifying spelling differences

The best practice is to preserve the initial data, creating normalized data in separate columns (or tables)

6.8 Note: Proxies, Features, Abstractions

These are the terms that refer to the same idea. The notion of proxies is used in data visualization, that of features—in computer science; that of abstractions—in the humanities.

The main idea behind these terms is that some simple features of an object can act as proxies to some complex phenomena.

For example, Ian Morris uses the size of cities as a proxy to the complexity of social organization. The logic is following: the larger the size of a city, the more complex social, economic and technical organization is required to keep that city functional, therefore is can be used as an indicator of the social complexity.

While proxies are selected from what is available—usually not much, especially when it comes to historical data—as a way to approach something more complex, it may be argued that abstractions are often arrived to from the opposite direction. We start with an object which is available in its complexity and we reduce its complexity to a more manageable form which—we expect—would prepresent a particular aspect of the initial complex object. Most commonly this is applied to texts in a natural language. For example, in stylometry texts are reduced to freqiency lists of most frequent features, which are expected to represent an authorial fingerprint.

The complexity of texts can be reduced in a number of ways: into a list of lemmas (e.g., for topic modeling analysis), frequency lists (e.g., for document distance comparison), syntactic structures, ngrams, etc. This list is never set and researchers can create multiple abstractions depending on their research questions.

6.9 In Class II: Practical

Data for the practical session and homework: Bosker_Data.zip. The data is available in open access and is a supplement to a study (see, Reference Materials).

The zipped file includes everything you need for the practical session. Download and unzip (read the article at home!).

Note: create a notebook and work through the following questions. Group work is encouraged. Please, explain in one or two sentences what you do in each step, so that your work is also human-readable. Please, submit this notebook as your homework. Make sure to name your file in the following manner: 070184-LXX-HW-YourLastName-YourMatriculationNumber.EXT, where LXX is the number of the lesson for which you submit homework; YourLastName is your last name; and YourMatriculationNumber is your matriculation number; EXT is the extension of your file — yopu can submit it either as HTML or as a PDF.

6.10 Bosker et al. Dataset

Please, provide your answers (a few sentences) and/or working code to the following questions:

Describe and provide working R code:
- Can you figure which file contains data?
- In which format is data?
- How can we load it into R?
Describe and provide working R code: What is the chronological extent of this data?
Describe: [easy-ish] What periods can it be divided into? How can we do that?
Describe, provide more than one strategy. (Comment: Sometimes you must be precise, sometimes approximation may be sufficient; also, consider such issues as reliability and availability of data. Is the data that you have reliable? Are necessary values available? Etc.) How can we introduce the following categories into this data:
- [easy] North Africa and Europe?
- [a bit more complicated] the Austro-Hungarian Empire?
  - Optional coding When did the Empire had the largest number of cities (based on the data set)?
  - Optional coding When was its population at the highest?
- [a tad tricky] Christiandom and Islamdom?
  - Optional coding What are the largest cities of Islamdom for each reported period?
  - Optional coding What are the largest western cities of Islamdom between 1000 and 1500 CE?

6.11 Reference Materials

Bosker, Maarten, Eltjo Buringh, and Jan Luiten van Zanden. 2012. “From Baghdad to London: Unraveling Urban Development in Europe, the Middle East, and North Africa, 800–1800.” The Review of Economics and Statistics 95 (4): 1418–37. https://doi.org/10.1162/REST_a_00284.
Bosker, Maarten, Eltjo Buringh, and Jan Luiten Van Zanden. 2014. “Replication Data for: From Baghdad to London: Unraveling Urban Development in Europe, the Middle East, and North Africa, 800-1800.” Harvard Dataverse. https://doi.org/10.7910/DVN/24747.

6.12 Additional Readings

Moretti, Franco. 2007. Graphs, Maps, Trees: Abstract Models for Literary History. London - New York: Verso.
Moretti, Franco. 2013. Distant Reading. London ; New York: Verso.
Romanov, Maxim G. 2017. “Algorithmic Analysis of Medieval Arabic Biographical Collections.” Speculum 92 (S1): S226–46. https://doi.org/10.1086/693970.

6.13 Homework

Bosker et al. Dataset. Finish the practical assignment on the Bosker et al. Dataset.
Viennese Dataset Assignments. Collectively, you should now have all the data on Vienna’s districts. Next assignment is as follows:
1. Work together to create one dataset with all the data on houses and inhabitants on Vienna; what would be the best structure for this dataset?
2. Make graphs of the growth of all districts — both, for houses and for inhabitants
3. Make a graph of the growth of Vienna using this data. Here, again, you should have two graphs (two perspectives): one based on the number of houses, and another — on the number of inhabitants. After you have the graph, provide your interpretation (this implies looking for additional information, not simply relying on the graph).
Data visualizations:
1. Please, make sure to read Kieran Healy’s “Look at Data!” (https://socviz.co/lookatdata.html)
2. Using what you have learned in the chapter, take a close look at the following datasets (given in the order of increasing size and complexity) and think about how they can be visualized. The main goal is to come up with verbal descriptions of how this data can be visualized and what you may expect to learn from these visualizations. You are, of course, welcome to generate those visualizations. Again, group work is encouraged — brainstorming is particularly good for this task.
  1. historydata::early_colleges
  2. europop::europop
  3. historydata::us_cities_pop
Note: you can put everything in one notebook. Do not forget to send me your work!

6.14 Submitting homework

Homework assignment must be submitted by the beginning of the next class;
Email your homework to the instructor as attachments.
- In the subject of your email, please, add the following: 070184-LXX-HW-YourLastName-YourMatriculationNumber, where LXX is the numnber of the lesson for which you submit homework; YourLastName is your last name; and YourMatriculationNumber is your matriculation number.