The main goal of OpenArabic mARkdown is to provide a simple system for tagging structural elements in premodern and early modern Arabic texts that are being prepared within the framework of the OpenArabic Project. The use of OpenArabic mARkdown will allow one to engage in the computational analysis of classical Arabic texts in the same way as more complex and time-consuming tagging schemes (like TEI XML); OpenArabic mARkdown will also facilitate the conversion of the large volume of Arabic texts into TEI XML, which is now the standard format for digital editions. In principle, OpenArabic mARkdown does not require any special editor, but the current implementation relies on EditPad Pro, which supports right-to-left languages, Unicode, and large files. However, it is the support of custom highlighting and navigation schemes that makes this text editor particularly convenient for OpenArabic mARkdown.
%APPDATA%\JGsoft\EditPad Pro 7(make sure that EditPad Pro in not running: you need to do
File > Exitto completely close it).
### |), sections (
### ||), subsections (
### |||), etc.
### $), descriptions of events (
### @), and dictionary entries (also
EditPad Pro is a commercial text editor for Windows. It can be used on Mac and Linux through some virtualization solution (for example, Parallels or Fusion on Mac, or Wine on Linux). You can download a fully functional trial version to try OpenArabic mARkdown (it, however, will not work with EditPad Lite, a free light-weight version of the same text editor).
The most recent OpenArabic mARkdown settings files for EditPad Pro can be found in the mARkdown repository on GitHub. You can download the entire repository or just the
*.zip file (the date is in the filename:
YYYYMMDD_HHMMSS.zip). Unzip its content, and copy all the files into
%APPDATA%\JGsoft\EditPad Pro 7. You will need to restart the program (use:
File > Exit). If OpenArabic mARkdown settings work, the main window should look as shown below. Note the yellow-ish color of the main window and a large Find/Replace panel on the right).
In a text file, OpenArabic mARkdown is activated by the “magic value” at the beginning of the text file, which is
#####ARABICA#. (In other words, the very first line of the text should consist exclusively of this value.) The highlighting scheme relies on regular expressions (RE) to color a limited number of pattern in the text.
Note: You can try
test_textFile from the github repository. The file does not have an extension to avoid accidental opening in Notepad—the default text editor on Windows, which cannot handle large files. Drug-n-drop the file onto EditPad Pro to open it. The OpenArabic mARkdown should be automatically activated. (If you start a new file, you may need to reopen it after adding the “magic value”.)
OpenArabic mARkdown includes structural elements and in-text elements—all are highlighted with different colors. This helps one to avoid mistakes while tagging texts, and, additionally, makes the structure of the text easily discernible to the human eye. Configured in a custom highlighting scheme, all elements are activated with regular expressions (RE).
EditPad Pro also supports folding of structural elements that span over multiple lines. This allows one to contract—or fold—the entire text into a table of content and then open—or unfold—only elements that one wants to work with.
All texts in
OpenArabic have automatically inserted metadata blocks in the beginning of the file. This metadata is collected from initial collections and is neither fixed nor vetted.
Headers are for the titles of main structural units, like chapters, subchapters, etc. in the text of the source. The entire header must be on one line and must include three main elements:
###— three hashtags at the beginning of the line;
|— pipes, whose number corresponds to the level of a header;
Header— the text of a header (followed by
\n, “new line” character)
### | First Level Header (red) ### || Second Level Header (orange) ### ||| Third Level Header (yellow) ### |||| Fourth Level Header (green) ### ||||| Fifth Level Header (blue)
A gif-image below shows how headers get highlighted with different colors when tagged properly. Colors follow the rainbow spectre (they stop changing after Level V):
Many books also include sections written by editors. These sections are not parts of the actual source and they must be tagged distinctively so that they could be easily excluded from automated forms of analysis. The tag can also be used for any “in-edition” information, which is not relevant to the actual text of the source (for example, there are often lines saying this is such and such volume of a such and such book—these can be tagged as “editorial headers”). The tag is added in the same manner as source headers and is:
### === (at the beginning of the line).
Most of the texts that I have worked with so far have a very clear structure (chapters > subchapters > subsubchapters, etc.), where the lowest level structural units (subchapters or subsubchapters) are made of “information units,” such as biographies in biographical collections, descriptions of events in chronicles, and dictionary entries (on lexical items, names, toponyms, book titles, etc.)
Using OpenArabic mARkdown, one needs only to mark the beginning of each information unit. Each unit can be marked with either a simplified or full tag. Simplified tags are short, which makes them ideal for manual tagging. Simplified tags are, however, ambiguous. Full tags are more readable and source independent. Full tags are particularly important when information from multiple sources is processed at the same time.
OpenArabic mARkdown has been first developed for biographical collections, chronicles, bibliographical collections, and dictionaries of different types. In most cases, these sources contain one type of information units and in such cases one can use only one type of tag
### $ [information unit], which can be later converted—using find/replace—into corresponding full tags. Below is the description of used tags.
Arabic dictionaries usually include information units of the same types, so one simplified tag—
### $ [a dictionary item]—is sufficient. The full tags depend on the nature of each dictionary and at the moment include “descriptive names,” toponyms, lexical items, and book titles. Tags for them are as follows:
### $DIC_NIS$ [a descriptive name entry] ### $DIC_TOP$ [a toponym entry] ### $DIC_LEX$ [a lexical entry] ### $DIC_BIB$ [a book title]
Biographical collections often include several types of information units. Moreover, there are plenty of sources that combine features of both biographical collections and chronicles (“obituary chronicles”), so one often has to deal with a variety of information units in the same text. For this reason, the main simplified tags are as follows:
### $ [a biography of a man] ### $$ [a biography of a woman] ### $$$ [a cross-reference and/or repetition, for both men and women] ### $$$$ [a list of names] ### @ [a historical event] ### @ RAW [a batch of historical events]
### @ RAW can be used to tag blocks of historical events when it is not immediately clear when one information unit ends and another begins. With these tags in place, one can return to an unfinished batch later, read it more carefully, and split properly into single units. There is also, of course, a conceptual and methodological issue with regard to what constitutes an ‘event’. For the purposes of algorithmic analysis, [the “description of an] event” is a structurally and thematically complete unit of text that describes an entity that has 5 properties: subject, predicate, object, time and place. In other words, it is something that can be grouped into categories, graphed across time, and mapped in space.
Full tags are as follows:
### $BIO_MAN$ [a biography of a man] ### $BIO_WOM$ [a biography of a woman] ### $BIO_REF$ [a cross-reference, for both men and women] ### $BIO_NLI$ [a list of names] ### $CHR_EVE$ [a historical event] ### $CHR_RAW$ [a batch of historical events]
Each riwāyaŧ/ḥadīṯ report should be treated as a separate paragraph:
new line + # $RWY$ ; in order to demarkate the boundary between isnād and matn,
@MATN@ tag is to be inserted between isnād and matn. Since it is not uncommon to have an evaluation of reported material, tag
@HUKM@ can be used to tag the beginning of the ḥukm-statement. All three elements of a riwāyaŧ/ḥadīṯ must remain the part of the same
# $RWY$ this section contains isnād @MATN@ this section contains matn @HUKM@ this section contains ḥukm .
It is not uncomon that either isnād or matn is missing. In such cases
@MATN@ tag still must be inserted: in the case of missing isnād,
@MATN@ directly follows
# $RWY$ ; in the case of missing matn,
@MATN@ becomes the last element in the ḥadīṯ paragraph.
@HUKM@ is optional and inserted only when there is a ḥukm-statement.
The following tags are section tags to be places at the beginning of a relevant section: (1)
### $DOX_POS$ for sections dealing woth doxographical/theological positions; (2)
### $DOX_SEC$ for descriptions of religious groups (“sects”). These can be used equally in doxographical texts and texts that address doxographical issues among many other things.
Chapters and information units may have internal divisions that are marked either in the original editions or being introduced by a researcher. These can be tagged with
# --- description either on a separate line right before a sub-unit (where
description is a researchers own characterization of a unit, i.e.
# --- teachers marks a section of a biography that includes the names of a biographee’s teachers)
Most of the in-text elements are already in pre-formatted texts. In most cases, you do not need to worry about them, but it is important to understant how they function.
In premodern Arabic texts paragraphs as units are not particularly reliable. Yet, if a certain electronic text reproduces a printed edition, it is worth preserving its division into paragraphs. Each paragraph begins with a hashtag,
While EditPad Pro handles large files very well, it has problems with long paragraphs (or, more correctly, lines). For this reason, long paragraphs are split into shorter lines, where each line starts with
~~ (two tildas).
Poetry lines can be tagged in the following manner: one line of poetry per line/paragraph, beginning and ending with
%~% and with hemistiches divided by
%~% (when applicable).
# %~% hemistich1 %~% hemistich2 %~%
an image to be added
@QB@ .*? @QE@)
Qur’anic verses can be tagged in the following manner: 1)
@QB@—at the beginning of a verse; 2)
@QE@—at the end.
NB: it is not recommended to tag verses of the Qurʾān at this point; the tags that may occur in the existing texts simply preserve residual and inconsistent tagging of Qurʾānic verses in the initial versions of texts.
an image to be added
Most of electronic texts preserve page numbers. OpenArabic mARkdown supports the following format
V## is the volume number, and
P### is the page number. Volume number must be two digits, page number—three (padded with zeros when necessary). Thus,
PageV05P022 stands for Vol. 5, p.22. Page number tags are inserted at the end of the corresponding page.
In a very similar manner notes—both footnotes and endnotes—can be tagged:
V## is the volume number, and
P### is the page number, and
N## is the note number. The content of notes is put into the end of the text, as endnotes.
to be updated
All non-letter characters are automatically highlighted as elements that may have some value for the structural understanding of a text. These include: punctuation, numbers, non-Arabic letters.
OpenArabic mARkdown also includes a series of patterns/tags to highlight results generated with/for algorithmic analysis. Unlike previous elements, these additions to the text that are either generated automatically, or inserted manually.
Geographical texts—such as comprehensive geographies—contain a lot of data that can be used for modeling historical processes in space. Of particular importance are administrative divisions and trade routes, which often come with distances.
#\$#(PROV|REG\d)# .*? #\$#TYPE .*? #\$#(REG\d|STTL) ([\w# ]+) $)
Most descriptions fit into the following scheme
WORLD: PROVINCE > TYPE > (REGION) > TYPE > SETTLEMENT. In the actual text, relevant information is tagged essentially as ‘triples’ of
SUBJECT > PREDICATE > OBJECT (with multiple OBJECTs that will be parsed out at a later stage):
#$#PROV toponym #$#TYPE type_of_region #$#REG1 (toponym #)+ #$#REGX toponym #$#TYPE type_of_region #$#REGX (toponym #)+ #$#REGX toponym #$#TYPE type_of_settlement #$#STTL (toponym #)+
Note: Clip collection can be used to insert relevant patterns into the text.
#$#FROM .*? #$#TOWA .*? #$#DIST .*)
Route sections with distances are tagged in the following manner:
#$#FROM toponym #$#TOWA toponym #$#DIST distance_as_recorded
At the moment these include toponyms (
@TXX), individuals (
@PXX) and social/onomastic/biographical characteristics (
All tags have similar structure
CODE + two numbers.
CODEs have two variations: 1) long (three letters) and 2) short (one letter). Triliteral
CODEs are used for automatic tagging with scripts and entity lists; one-letter short
CODEs are used for manual tagging and disambiguation of automatic tags.
XX, two numbers, indicate:
bi-are attached to an entity, the first number must be
1, which means that 1 character from the beginning must be removed)
2must be used (both words will be automatically highlighted).
Years AH can be tagged in the following manner (where
# is a digit; there must be spaces on both sides of the year tag):
YB####— a year of birth, as mentioned in a biography; for example,
@YB510means that a biographee was born in the year 510 of the Islamic hijrī calendar.
YD####—a year of death, as mentioned in a biography; for example,
@YD597means that a biographee died in the year 597 of the Islamic hijrī calendar.
YY####—any other type of year references mentioned in a text.
This pattern is easily adjustable for tagging specific semantic patterns. The overall pattern is
USER is an alias of a researcher introducing a pattern (three-letter initials is an option—
MGR in my case);
CAT_SUBCAT... represents categorical branching, where categories are defined by a researcher for specific purposes.
MGR ├── WACZ │ ├── education │ │ ├── quran │ │ ├── hadith │ │ └── fiqh │ └── sermons │ ├── hell │ ├── paradise │ └── love ├── KHUTBA │ ├── education │ │ ├── punishmenst │ │ ├── hadith │ │ └── fiqh │ └── sermons │ ├── hell │ ├── paradise │ └── love └ etc.
Above is a snapshot of a scheme that I used for tagging biographies of preachers. In this example, I am looking into individuals involved into two different forms of preaching—waʿẓ and ḫuṭbaŧ, their education backgrounds and topics of their sermons. Thus, my tag in the case of a wāʿiẓ who had training in fiqh will look as follows:
@MGR@WACZ_education_fiqh@. In combination with chronological, geographical and social information (derived from nisbaŧs), one can get a perspective on how many individuals involved in waʿẓ had training in fiqh and if there are any chronological, geographical and social peculiarities to this information.
It is vital to keep track of tags one introduces and avoid creating too many of them. Another useful stategy is to keep them short, for instance:
@MGR@WCZ_ed_fqh@ (it is your own decision, of course, how to balance brevity and readability).
Occasionally it happens that highlighting scheme loses its track and gets shifted, highlighting wrong elements. One of the things that may trigger this behavior is a long regular expression in a search field, but sometimes it happens for some other reasons. This can be fixed through refreshing the highlighting scheme by going through the following steps:
[Options >] Configure File Types... > Color Syntax > Refresh (
Folding lines is a very convenient tool, but it seems to take a while before for it to load when one opens EditPad Pro. This can be fixed by going through the following steps:
[Options >] Configure File Types... > Navigation > Refresh (