Tools: Collaboration with Transkribus

From MarineLives
Revision as of 16:07, February 6, 2018 by ColinGreenstreet (Talk | contribs)

Jump to: navigation, search

MarineLives is exploring the use of Transkribus tools to scan signatures and markes at the bottom of High Court of Admiralty depositions as input to a proposed April 2018 Data Study Group with the Alan Turing Institute.

We are also structuring a project with the READ/Transkribus project team to explore key word spotting technology applied to English language legal records from the early and mid-C17th. The project is planned to start in April 2018 and to run through to October 2018.




Signature/Marke recognition



Goals


  • Structure a challenge for an Alan Turing Institute led Data Study Group to explore educational, occupational and other segmentation in MarineLives data concerning early C17th High Court of Admiralty deponents, without a priori assumptions as to possible groupings.
  • Provide a MarineLives data set to the Alan Turing Institute to support the challenge to the proposed Data Study Group, April 16th-20th, 2018. The data set would include digital files of signatures and markes affixed to the depositions by deponents, which will have been pre-processed using Transkribus' Handwriting Text Recognition engine.
  • Data Study Groups take place three times a year and are a week long. Paticipating researchers are drawn from the Turing Institute's five founding universities (University of Cambridge, University of Edinburgh, University of Oxford, University College London, the University of Warwick), and the wider academic community. Director of the Data Study Groups is Turing Fellow and Associate Professor at Warwick, Dr Sebastian Vollmer. Past Data Study Group partners have been drawn from industry, and include Codecheck, Dtsl and Inmarsat. The April 2018 DSG is unusual in that the theme is 'Data Science for Social Good', with Accenture covering the costs of participants.


Characterisation of data


  • MarineLives has a sizeable semi-structured data set of deponents, who made witness statements in the English High Court of Admiralty in the first half of the seventeenth century. These data include information on name, place of residence, occupation and age, as well as the date of the deposition. The statements are in written hand, recorded in bound manuscript volumes, held at the National Archives, Kew. They have been digitally imaged and are available in the MarineLives semantic media wiki. Depositions are cknowledged with the signature or marke of the deponent at the end of teh deposition, which would have been affixed after the reading aloud of the written deposition to the deponent.




SMW platform and API


Ask API generated output from MarineLives wiki

The MarineLives wiki is built on a PHP-based stack:
- Media Wiki
- Semantic MediaWiki extension to allow storage and querying of data across pages
- Semantic Forms extension to allow editing of pages as structured data
- Custom extensions for folio navigation, basic transcription, and improved behaviour to match transcription expectations.

For more information click here.



Sample data


Eleven signatures and one marke from mariners & landsmen on board the English whaling ship the Owners Adventure, 1656 - Could AI be used to categorise these signatures & markes visually and to relate any groupings to occupational, residential, age or other data?
Signatures of Nathaniel Morecocke at ages of thirty-one and forty-one, High Court of Admiralty


Methodology


[ADD DATA]



Team


Colin Greenstreet (MarineLives)



Opportunities to get involved


[ADD DATA]



Bibliography


[ADD DATA]



Key word spotting


Transkribus Handwriting Text Recognition (HTR) engine has been trained on mid-C17th High Court of Admiralty Court records


Goals


  • Build and test an interface for historical researchers to interact with Transkribus key word spotting technology.
  • Searching for words in the image, not the text rendered transcription


Methodology


[ADD DATA]



Team


Colin Greenstreet (MarineLives)
Michael Bennett (Sheffield)



Opportunities to get involved


[ADD DATA]



Bibliography


Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018[1]

Transkribus/VCG-DUTH-Word_Spotting_By_Example, Github, 2018[2]

Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017[3]

K. Zagoris, I. Pratikakis and B. Gatos, "Segmentation-Based Historical Hand-written Word Spotting Using Document-Specific Local Features," 2014 14th In-ternational Conference on Frontiers in Handwriting Recognition, Heraklion, 2014, pp. 9-14.

K. Zagoris, I. Pratikakis, and B. Gatos, “A framework for efficient transcription of historical documents using keyword spotting,” in Historical Document Imaging and Processing (HIP’15), 3rd International Workshop on, August 2015, pp. 9–14.

K. Zagoris, I. Pratikakis, B. Gatos. 2017 Unsupervised Word Spotting in Histori-cal Handwritten Document Images using Document-oriented Local Features. Transactions on Image Processing. Under Review.



Transkribus background


  1. Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018, accessed 06/02/2018
  2. Transkribus/VCG-DUTH-Word_Spoting_By_Example, Github, 2018, accessed 06/02/2018
  3. Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017, accessed 06/02/2018