Warwick Week Seven

From MarineLives
Revision as of 13:22, May 1, 2018 by ColinGreenstreet (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Proposed Agenda: 4-6 pm, Tuesday, May 1st, 2018

Reviewing work from last few weeks and trying some new material


Opportunities to work with MarineLives and/or the Signs of Literacy community over the summer


Assisting with organisation of a MarineLives Virtual Summer School in July, and repeated in August


I am musing about running a Virtual Summer School in July 2018, with a possible repeat in August 2018. This would be organised by Chronoscopic Education & MarineLives, with the potential for involvement by other organisations as faculty or supporters.

MarineLives has run successful summer events before, ranging from Summer Transcription programmes in 2013, 2014 and 2015, to a Digital PopUp Lab in 2016 (with a focus on exploiting the technical features of our semantic media wiki).

The Virtual Summer School idea is at an early stage and if I am going to go with it I would want to decide by end May at the latest. I don't require funding (the cash cost is miniscule - simply creating a Google Group and using our existing ZOOM MEETINGS capability, plus some organisational and faculty time. We would make no charge to participants. Recruitment of participants would be through the MarineLives Twitter account, and through the Twitter accounts of any supporting organisations.

Depending on how many people show an interest, I might put in place a group of three or four volunteer facilitators or moderators, who would work with participants AFTER video calls to answer technical questions and to promote discussion. Twenty participants would be a good target - I would propose to target participants in the GLAM sector, plus PhD candidates and Early Career Scholars working in history (and corpus linguistics).

The trick to these things is to create a bit of buzz AND to keep the time commitment manageable, so that participants are not put off, and to allow for varying degrees of subsequent engagement. We find such programmes a great way to build community spirit and to generate future volunteers for our activities. Hence I am proposing the minimum commitment to be attending four one hour video calls spread out over four weeks in July (with possible repeat for a second batch or participants in August). For those interested, there would be follow-on documents to read and practice exercises, made available on a private Virtual Summer School Google Group, which I (and possibly others) would moderate, facilitate and support.

Creating a tagged dataset for a Kaggle Research Competition in late 2018


Please briefly describe the overall objective of your competition:
Social venture Chronoscopic Education and the MarineLives project team have recently initiated a new Signs of Literacy community with colleagues Dr Mark Hailwood (Bristol) and Mark Ponte (Stadsarchief Amsterdam) to explore technology enhanced approaches to the study of historical literacy (1550-1750).

Anglo-Dutch initiative, with data sets drawn from English High Court of Admiralty (National Archives, Kew), the Alle Amsterdamser Aketen (notarial archives at Stadsarchief, Amsterdam) and the Dutch Prize Papers (data held in a Virtual Research Environment, sourced from National Archives, Kew).

We are able to provide multiple tagged large scale datasets to the Kaggle community later in 2018 to develop a Proof of Concept.

The Proof of Concept will contain two parts:

(1) Algorithmic identification of markes, initials and signatures

(2) Algorithmic discrimination between degrees of "sophistication" within the three categories of "marke"; "initial(s)", and "signature". Having proved the concept, we would then propose to seek out an image or vision oriented computational laboratory with which to develop a grant funded collaboration to take the work further in 2019 and beyond.

We originally considered a five day Data Study Group with the Alan Turing Institute, but have bben advised that a Kaggle Research Competition would be more effective for a POC

https://github.com/Signsofliteracy/Signoff/wiki
https://github.com/Signsofliteracy/Signoff/blob/master/MarineLives_Turing-Accenture%20DSG%20science%20challenges_DEFINITIVE_THREE_12022018.pdf
https://www.slideshare.net/ColinGreenstreet/c17th-signatures-and-markes-88587464

Please briefly describe the contents of the dataset:
Markes, initials and signatures from C17th manuscript images.

Available as snippets or as XML text area marked up full manuscript pages

We would appreciate advice on suggested data set size and tagging

We can deliver a tagged dataset of 6000 markes, initials and signatures relatively quickly. We aspire to work on much larger data sets of 50,000 markes, initials and signatures and indeed one order of magnitude larger

Examples of snippet type data:

https://github.com/Signsofliteracy/Signoff/blob/master/Signsofliteracy_C17_Alphabet_Of_Initials_4th_Edn_04042018.pdf

What is the target variable?
(1) Discrimination between marke, initial, signature (2) sophistication of execution of members of each class

Is there anything else about your competition or dataset that would be helpful for us to know?
We are new to Kaggle (and indeed to Machine Learning) so this will be a learning experience for us. What Kaggle volunteers would get is (1) our concentrated interest and attention (2) access to a novel and large scale tagged dataset for training and prediction (3) access to academic historians and linguists with a strong interest in exploring this technology

Potential of this proposed competition and approach to revolutionise the scale at which historians and linguists work and to introduce them to machine learning in the context of the study of historical literacy

Working on a conceptual framework for an Early Modern Maritime Gazetteer, in collaboration with MarineLives, Pelagios/Recogito and the World-Historical Gazetteer project


Feedback to Ben and Colin on the Warwick Transcription Programme


- What did we get right?

- What could we do better?

- Would you recommend the programme to friends, and if so, why?

- Should this programme be run again in 2019/20?

And finally, a big thanks to you all