12 Lesson 12

12.1 Building Memex - Step 6 (or, actually, back to Step 3)

All that’s left is to update out interface. We need to update the index page (our main entry point into the memex), which will include both a) the table [of contents] for our searches (links to pages with results for individual searches), and b) the table [of contents] for all the publications we are including into our memex. Additionally, we will need to update our publication pages so that they include our similarity results.

12.2 The Index Page

Our index page should contain two automatically generated elements: 1) the table [of contents] for the searches that we had run; 2) the table [of contents] for all the publications that we have included into the memex. The way it may look is shown on the screenshot below. You are, of course, more than welcome to change it so that it fits better your own vision (the page still must contain all the major components shown below).

In order to generate our two tables we could write two functions: one to process searches, and another — to process publications. Below is a fully working function that processes searches (i.e., generates an html page for each search file) and returns a table [of contents] which we can then include into our main index file:

# generate search pages and TOC
def formatSearches(pathToMemex):
    with open(settings["template_search"], "r", encoding="utf8") as f1:
        indexTmpl = f1.read()
    dof = functions.dicOfRelevantFiles(pathToMemex, ".searchResults")

    toc = []
    for file, pathToFile in dof.items():
        searchResults = []
        data = json.load(open(pathToFile))
        
        # collect toc
        template = "<tr> <td>%s</td> <td>%s</td> <td>%s</td> <td>%s</td></tr>"

        linkToSearch = os.path.join("searches", file+".html")
        pathToPage = '<a href="%s"><i>read</i></a>' % linkToSearch
        searchString = '<div class="searchString">%s</div>' % data.pop("searchString")
        timeStamp = data.pop("timestamp")
        tocItem = template % (pathToPage, searchString, len(data), timeStamp)
        toc.append(tocItem)

        # generate the results page
        keys = sorted(data.keys(), reverse=True)
        for k in keys:
            searchResSingle = []
            results = data[k]
            temp = k.split("::::")
            header = "%s (pages with results: %d)" % (temp[1], int(temp[0]))
            for page, excerpt in results.items():
                pdfPage = int(page)
                linkToPage = '<a href="../%s"><i>go to the original page...</i></a>' % excerpt["pathToPage"]
                searchResSingle.append("<li><b><hr>(pdfPage: %d)</b><hr> %s <hr> %s </li>" % (pdfPage, excerpt["result"], linkToPage))
            searchResSingle = "<ul>\n%s\n</ul>" % "\n".join(searchResSingle)
            searchResSingle = generalTemplate.replace("@ELEMENTHEADER@", header).replace("@ELEMENTCONTENT@", searchResSingle)
            searchResults.append(searchResSingle)
        
        searchResults = "<h2>SEARCH RESULTS FOR: <i>%s</i></h2>\n\n" % searchString + "\n\n".join(searchResults)
        with open(pathToFile.replace(".searchResults", ".html"), "w", encoding="utf8") as f9:
            f9.write(indexTmpl.replace("@MAINCONTENT@", searchResults))

    toc = searchesTemplate.replace("@TABLECONTENTS@", "\n".join(toc))
    return(toc)

And this is the content of generalTemplate and searchesTemplate used in the function:

There is nothing overly complicated in generalTemplate, but we want to take a closer look at the searchesTemplate. Please, note those <thead> and <tbody> tags, which surround the table header and the table main content, respectively. These are important elements without which the dynamic table filtering and [re]ordering will not work properly. Also, if you want to change the number of columns in your table, you will need to modify both the template and your python code so that the number of columns that you generate with the script corresponds to the number of columns given in the <thead> element (currently in the template there are four columns (link, search string, @ of publications with mmatches, and time stamp)).

Using this function, write another one that generates the table [of content] for all included publications.

You can use this page as an example of what you need to generate. Look at the HTML code carefully; compare it with the HTML template; identify elements that are static and those that must be generated from your data; write the code that generates your elements.

12.3 The Publication Pages

Essentially, you need to adapt the same procedure to [re]generate publication pages: you will need to add tables with the most similar results based on our tf-idf similarity distances that we generated before. The final results should look like what you seen on the screenshots below.

Keep in mind that, like before, the very first page contains different information, if compared to all the other pages.

  • the author(s) and the title as the main header
  • the bibliographical record in bibTeX format (folded on the screenshot)
  • the wordcloud of main keywords
  • and the table of distances with most similar publications

All the other pages look a little bit differently as they include the image of the page and the table of distances to clusters of pages, rather than complete publications (as shown on two screenshots below):

You can use this and this pages as examples of what you need to generate. (Note: Look at the HTML code carefully; compare it with the HTML template; identify elements that are static and those that must be generated from your data; write the code htat generates your elements.)

12.4 Homework

  • write scripts that generate: 1) the index page; 2) publication pages.

  • additionally, take the solution scripts from the previous lesson and annotate every line of code; submit your annotations together with the main assignment; if you have any suggestions for improvements, please share them (this will count as extra points :).
  • upload your results to your memex github repository
    • place annotated scripts into _misc subfolder

12.5 Homework Solution

solution will be added later to the main repository