1. Outline¶
This archive was populated computationally using a spreadsheet provided by ACADIA that lists all articles published between 1985 and 2020. A Python program was written to parse this spreadsheet, get the necessary data, and create archive entries for all papers (articles), all authors (contributors), and all publications (issues). Multiple edits of the original spreadsheet were done manually to correct errors, inconsistencies, add missing information, and correct illegible characters.
All data used and the software written to create this archive are described and provided below.
2. Code outline¶
For detailed information about the parsing process see comments integrated in the Python Jupyter Notebook file (⁄python parser (version 1)).
Here is an outline (part numbers correspond to sections of the Jupyter Notebook):
- Part 1: Setup; load CSV catalogue of articles; create articles dataframe.
Part 2: Parsesource
column and create Issues dataframe:identify individual publication sources/proceedings names (columnclean issue
)get page range for each paper (columnclean pages
)
create a new dataframe for individual issues (⁄issues dataframe (CSV)).
- Revised Part 2: Load CSV catalogue of ACADIA articles (⁄acadia issues list (CSV)) and compare against
source
column.- find article page range,
- get issue editors,
- match papers with source publications,
- get paper indexes per issue,
- get item library id,
- export dataframe (⁄issues dataframe (CSV))
- Part 3: Parse and clean up article keywords:
- Add publication year to keywords.
- For cases that feature ‘category’ in the keywords, keep that as a keyword with the prefix
category:
) - if no keywords add the keyword
archive-note-no-tags
- if no abstract add the keyword
archive-note-no-abstract
- add clean keyword array to column
clean keywords
- Part 4: Clean up article titles
- remove double spaces, special characters, etc., and add to column
clean title
- generate short title to use as filename (column
filename
), and check filename uniqueness.
- remove double spaces, special characters, etc., and add to column
- Part 5: Clean up and identify individual authors
- Manually scan similar names to identify author aliases and create a dictionary of names and aliases,; match author aliases and reduce author list (see §⁄aliases),
- Create a new dataframe (⁄authors dataframe (CSV)) listing individual authors, aliases (if applicable), and indexes of papers .
- Part 6: Create Markdown files for all entries:
- Contributor entries, including:
- a list of author name aliases (if applicable),
- list of articles,
- list of co-authors.
- Article entries, including:
- abstract,
- keywords (including publication year),
- contributors (by metadata association),
- link to the library item of the publication.
- Editor entries:
- names of Editors were matched against the list of contributors; if a name is both editor and authors a link is provided from editor to contributor page.
- Issue entries, including
- list of articles (by metadata association),
- list of editors,
- publication year as keyword,
- a table with key information such as proceedings ISBN, conference location and date,
- link to library entry of proceedings PDF.
- Contributor entries, including:
3. Author aliases¶
Individual author (contributor) names were identified and matched throughout the catalogue.
The author list was then scanned manually to identify different aliases corresponding to the same author.
These were collected in a dictionary listing a contributor’s preferred name and its aliases (form: author name : [alias1, alias2]
), and was used to reduce the author list by 171 entries.
The author aliases dictionary was exported as a JSON file for preview (see §⁄author aliases dictionary)
Contributor entries for authors with aliases, feature both a table with the author’s aliases, and are marked with the keyword archive-note-aliases
.
Note: The author aliases dictionary might contain errors. Also, it is highly likely that some aliases were not properly identified.
4. Files used in the making of this archive¶
4.1. Parser program¶
- ⁄Python parser (version 1): The Python (Jupyter Notebook) program authored to parse the catalogue of publications and generate archive entries, including documentation and comments. Note: save with extension
.ipynb
.
4.2. Data input¶
- ⁄acadia publication list original (CSV): The original metadata spreadsheet provided by ACADIA (in CSV format; contains illegible characters, various errors and inconsistencies).
- ⁄acadia publication list edited (CSV): A copy of the previous that incorporates multiple corrections, which was fed to the parser to populate the archive. Notes about corrections carried out were added in the column
NOTE
. - ⁄acadia issues list (CSV): Spreadsheet (added in revision of part 2), for listing issue information. Columns:
year, short title,source title, issue name, subtitle, editors-straight-names, editors, proceedings title, ISBN, publisher, location, date, host, website, keywords, Original PDF, Compressed & OCR PDF Size, OCRd, in library, library id
.
4.3. Author aliases dictionary¶
- ⁄author-aliases.json: JSON dictionary listing author names and aliases (key: author name, value: list of aliases).
4.4. Spreadsheets generated (dataframes)¶
- ⁄authors dataframe (CSV): A spreadsheet of individual authors. Columns:
author name, paper indexes, aliases, unicode name, filename
. - ⁄issues dataframe (CSV): Version of ⁄acadia issues list (CSV), with additional columns:
paper indexes, filename, clean editors, editor filenames
. - ⁄articles dataframe (CSV): A list of individual articles with cleaned up metadata. Columns:
authors, year, title, source, summary, keywords, NOTES, clean issue, clean pages, clean keywords, clean title, filename, clean authors
.