Difference between revisions of "Tools: Collaboration with Transkribus"

From MarineLives
Jump to: navigation, search
Line 1: Line 1:
[[File:Transkribus HCA Example 19062016.png|800px|thumb|lright|Transkribus Handwriting Text Recognition (HTR) engine has been trained on mid-C17th High Court of Admiralty Court records]]
+
'''MarineLives is exploring the use of Transkribus tools to scan signatures and markes at the bottom of High Court of Admiralty depositions as input to a proposed April 2018 Data Study Group with the Alan Turing Institute.'''
  
'''MarineLives is structuring a project with the READ/Transkribus project team to explore key word spotting technology applied to English language legal records from the early and mid-C17th. The project is planned to start in April 2018 and to run through to October 2018.'''
+
'''We are also structuring a project with the READ/Transkribus project team to explore key word spotting technology applied to English language legal records from the early and mid-C17th. The project is planned to start in April 2018 and to run through to October 2018.'''
 
+
'''We are also exploring the use of Transkribus tools to scan signatures and markes at the bottom of High Court of Admiralty depositions as input to a proposed April 2018 Data Study Group with the Alan Turing Institute.'''
+
 
----
 
----
 
__TOC__
 
__TOC__
==Key word spotting==
+
==Signature/Marke recognition==
 
----
 
----
 
===Goals===
 
===Goals===
  
* Build and test an interface for historical researchers to interact with Transkribus key word spotting technology.
+
* Structure a challenge an [https://www.turing.ac.uk/data-study-groups/ Alan Turing Institute led Data Study Group] to explore educational, occupational and other segmentation in MarineLives data concerning early C17th High Court of Admiralty deponents, without a priori assumptions as to possible groupings.
* Searching for words in the image, not the text rendered transcription
+
* Provide a MarineLives data set to the Alan Turing Institute to support the challenge to the proposed Data Study Group, April 16th-20th, 2018. The data set would include digital files of signatures and markes affixed to the depositions by deponents, which will have been pre-processed using Transkribus' Handwriting Text Recognition engine.
 +
* Data Study Groups take place three times a year and are a week long. Paticipating researchers are drawn from the Turing Institute's five founding universities (University of Cambridge, University of Edinburgh, University of Oxford, University College London, the University of Warwick), and the wider academic community. Director of the Data Study Groups is Turing Fellow and Associate Professor at Warwick, Dr Sebastian Vollmer. Past Data Study Group partners have been drawn from industry, and include Codecheck, Dtsl and Inmarsat. The April 2018 DSG is unusual in that the theme is 'Data Science for Social Good', with Accenture covering the costs of participants.
 +
----
 +
===Characterisation of data===
 +
 
 +
MarineLives has a sizeable semi-structured data set of deponents, who made witness statements in the English High Court of Admiralty in the first half of the seventeenth century. These data include information on name, place of residence, occupation and age, as well as the date of the deposition. The statements are in written hand, recorded in bound manuscript volumes, held at the National Archives, Kew. They have been digitally imaged and are available in the MarineLives semantic media wiki.
 +
 
 +
----
 +
===SMW platform and API===
 +
 
 +
[[File:Ask Api Output 27062016.PNG|700px|thumb|left|Ask API generated output from MarineLives wiki]]
 +
 
 +
The MarineLives wiki is built on a PHP-based stack:
 +
- [https://www.mediawiki.org/wiki/MediaWiki Media Wiki]
 +
- [https://www.semantic-mediawiki.org/wiki/Semantic_MediaWiki Semantic MediaWiki] extension to allow storage and querying of data across pages
 +
- [https://www.mediawiki.org/wiki/Extension:Semantic_Forms Semantic Forms] extension to allow editing of pages as structured data
 +
- Custom extensions for folio navigation, basic transcription, and improved behaviour to match transcription expectations.
 +
 
 +
For more information [[Tools: Semantic Media Wiki|click here]].
 +
----
 +
===Sample data===
 +
 
 +
[[File:Owners Adventure Literacy 06022018.JPG|600px|thumb|left|Eleven signatures and one marke from mariners & landsmen on board the English whaling ship the Owners Adventure, 1656 - Could AI be used to categorise these signatures & markes visually and to relate any groupings to occupational, residential, age or other data?]]
 +
 
 +
[[File:Morecocke 04022018.JPG|600px|thumb|left|Signatures of Nathaniel Morecocke at ages of thirty-one and forty-one, High Court of Admiralty]]
 +
 
 
----
 
----
 
===Methodology===
 
===Methodology===
Line 20: Line 43:
  
 
Colin Greenstreet (MarineLives)
 
Colin Greenstreet (MarineLives)
Michael Bennett (Sheffield)
+
 
 
----
 
----
 
===Opportunities to get involved===
 
===Opportunities to get involved===
Line 28: Line 51:
 
===Bibliography===
 
===Bibliography===
  
Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018<ref>[https://www.youtube.com/watch?v=_7V8iXFzx_E&feature=youtu.be Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018], accessed 06/02/2018</ref>
+
[ADD DATA]
  
Transkribus/VCG-DUTH-Word_Spotting_By_Example, Github, 2018<ref>[https://github.com/Transkribus/VCG-DUTH-Word-Spotting-by-Example Transkribus/VCG-DUTH-Word_Spoting_By_Example, Github, 2018], accessed 06/02/2018</ref>
+
----
 +
==Key word spotting==
  
Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017<ref>[https://read.transkribus.eu/2017/10/23/researchers-night/ Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017], accessed 06/02/2018</ref>
+
[[File:Transkribus HCA Example 19062016.png|600px|thumb|left|Transkribus Handwriting Text Recognition (HTR) engine has been trained on mid-C17th High Court of Admiralty Court records]]
 
+
K. Zagoris, I. Pratikakis and B. Gatos, "Segmentation-Based Historical Hand-written Word Spotting Using Document-Specific Local Features," 2014 14th In-ternational Conference on Frontiers in Handwriting Recognition, Heraklion, 2014, pp. 9-14.
+
  
K. Zagoris, I. Pratikakis, and B. Gatos, “A framework for efficient transcription of historical documents using keyword spotting,” in Historical Document Imaging and Processing (HIP’15), 3rd International Workshop on, August 2015, pp. 9–14.
 
 
K. Zagoris, I. Pratikakis, B. Gatos. 2017 Unsupervised Word Spotting in Histori-cal Handwritten Document Images using Document-oriented Local Features. Transactions on Image Processing. Under Review.
 
----
 
==Signature/Marke recognition==
 
 
----
 
----
 
===Goals===
 
===Goals===
  
* Structure and provide a data set to the Alan Turing Institute for a proposed Data Study Group, April
+
* Build and test an interface for historical researchers to interact with Transkribus key word spotting technology.
----
+
* Searching for words in the image, not the text rendered transcription
===Sample data===
+
 
+
[[File:Owners Adventure Literacy 06022018.JPG|600px|thumb|left|Eleven signatures and one marke from mariners & landsmen on board the English whaling ship the Owners Adventure, 1656 - Could AI be used to categorise these signatures & markes visually and to relate any groupings to occupational, residential, age or other data?]]
+
 
+
[[File:Morecocke 04022018.JPG|600px|thumb|left|Signatures of Nathaniel Morecocke at ages of thirty-one and forty-one, High Court of Admiralty]]
+
 
+
 
----
 
----
 
===Methodology===
 
===Methodology===
Line 60: Line 71:
  
 
Colin Greenstreet (MarineLives)
 
Colin Greenstreet (MarineLives)
 
+
Michael Bennett (Sheffield)
 
----
 
----
 
===Opportunities to get involved===
 
===Opportunities to get involved===
Line 68: Line 79:
 
===Bibliography===
 
===Bibliography===
  
[ADD DATA]
+
Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018<ref>[https://www.youtube.com/watch?v=_7V8iXFzx_E&feature=youtu.be Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018], accessed 06/02/2018</ref>
  
 +
Transkribus/VCG-DUTH-Word_Spotting_By_Example, Github, 2018<ref>[https://github.com/Transkribus/VCG-DUTH-Word-Spotting-by-Example Transkribus/VCG-DUTH-Word_Spoting_By_Example, Github, 2018], accessed 06/02/2018</ref>
 +
 +
Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017<ref>[https://read.transkribus.eu/2017/10/23/researchers-night/ Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017], accessed 06/02/2018</ref>
 +
 +
K. Zagoris, I. Pratikakis and B. Gatos, "Segmentation-Based Historical Hand-written Word Spotting Using Document-Specific Local Features," 2014 14th In-ternational Conference on Frontiers in Handwriting Recognition, Heraklion, 2014, pp. 9-14.
 +
 +
K. Zagoris, I. Pratikakis, and B. Gatos, “A framework for efficient transcription of historical documents using keyword spotting,” in Historical Document Imaging and Processing (HIP’15), 3rd International Workshop on, August 2015, pp. 9–14.
 +
 +
K. Zagoris, I. Pratikakis, B. Gatos. 2017 Unsupervised Word Spotting in Histori-cal Handwritten Document Images using Document-oriented Local Features. Transactions on Image Processing. Under Review.
 
----
 
----
 
==Transkribus background==
 
==Transkribus background==
  
 
[[File:Transkribus Key Figures 06022018.JPG|600px|thumb|left|Transkribus Key Figures, slide from [https://www.youtube.com/watch?v=_7V8iXFzx_E&feature=youtu.be Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018]]]
 
[[File:Transkribus Key Figures 06022018.JPG|600px|thumb|left|Transkribus Key Figures, slide from [https://www.youtube.com/watch?v=_7V8iXFzx_E&feature=youtu.be Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018]]]

Revision as of 15:12, February 6, 2018

MarineLives is exploring the use of Transkribus tools to scan signatures and markes at the bottom of High Court of Admiralty depositions as input to a proposed April 2018 Data Study Group with the Alan Turing Institute.

We are also structuring a project with the READ/Transkribus project team to explore key word spotting technology applied to English language legal records from the early and mid-C17th. The project is planned to start in April 2018 and to run through to October 2018.




Signature/Marke recognition



Goals


  • Structure a challenge an Alan Turing Institute led Data Study Group to explore educational, occupational and other segmentation in MarineLives data concerning early C17th High Court of Admiralty deponents, without a priori assumptions as to possible groupings.
  • Provide a MarineLives data set to the Alan Turing Institute to support the challenge to the proposed Data Study Group, April 16th-20th, 2018. The data set would include digital files of signatures and markes affixed to the depositions by deponents, which will have been pre-processed using Transkribus' Handwriting Text Recognition engine.
  • Data Study Groups take place three times a year and are a week long. Paticipating researchers are drawn from the Turing Institute's five founding universities (University of Cambridge, University of Edinburgh, University of Oxford, University College London, the University of Warwick), and the wider academic community. Director of the Data Study Groups is Turing Fellow and Associate Professor at Warwick, Dr Sebastian Vollmer. Past Data Study Group partners have been drawn from industry, and include Codecheck, Dtsl and Inmarsat. The April 2018 DSG is unusual in that the theme is 'Data Science for Social Good', with Accenture covering the costs of participants.


Characterisation of data


MarineLives has a sizeable semi-structured data set of deponents, who made witness statements in the English High Court of Admiralty in the first half of the seventeenth century. These data include information on name, place of residence, occupation and age, as well as the date of the deposition. The statements are in written hand, recorded in bound manuscript volumes, held at the National Archives, Kew. They have been digitally imaged and are available in the MarineLives semantic media wiki.



SMW platform and API


Ask API generated output from MarineLives wiki

The MarineLives wiki is built on a PHP-based stack:
- Media Wiki
- Semantic MediaWiki extension to allow storage and querying of data across pages
- Semantic Forms extension to allow editing of pages as structured data
- Custom extensions for folio navigation, basic transcription, and improved behaviour to match transcription expectations.

For more information click here.



Sample data


Eleven signatures and one marke from mariners & landsmen on board the English whaling ship the Owners Adventure, 1656 - Could AI be used to categorise these signatures & markes visually and to relate any groupings to occupational, residential, age or other data?
Signatures of Nathaniel Morecocke at ages of thirty-one and forty-one, High Court of Admiralty


Methodology


[ADD DATA]



Team


Colin Greenstreet (MarineLives)



Opportunities to get involved


[ADD DATA]



Bibliography


[ADD DATA]



Key word spotting


Transkribus Handwriting Text Recognition (HTR) engine has been trained on mid-C17th High Court of Admiralty Court records


Goals


  • Build and test an interface for historical researchers to interact with Transkribus key word spotting technology.
  • Searching for words in the image, not the text rendered transcription


Methodology


[ADD DATA]



Team


Colin Greenstreet (MarineLives)
Michael Bennett (Sheffield)



Opportunities to get involved


[ADD DATA]



Bibliography


Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018[1]

Transkribus/VCG-DUTH-Word_Spotting_By_Example, Github, 2018[2]

Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017[3]

K. Zagoris, I. Pratikakis and B. Gatos, "Segmentation-Based Historical Hand-written Word Spotting Using Document-Specific Local Features," 2014 14th In-ternational Conference on Frontiers in Handwriting Recognition, Heraklion, 2014, pp. 9-14.

K. Zagoris, I. Pratikakis, and B. Gatos, “A framework for efficient transcription of historical documents using keyword spotting,” in Historical Document Imaging and Processing (HIP’15), 3rd International Workshop on, August 2015, pp. 9–14.

K. Zagoris, I. Pratikakis, B. Gatos. 2017 Unsupervised Word Spotting in Histori-cal Handwritten Document Images using Document-oriented Local Features. Transactions on Image Processing. Under Review.



Transkribus background


  1. Günter Mühlberger, Presentation: Transkribus - the status quo and future plans, Youtube video, pub. Jan 23. 2018, accessed 06/02/2018
  2. Transkribus/VCG-DUTH-Word_Spoting_By_Example, Github, 2018, accessed 06/02/2018
  3. Interactive Keywords Spotting Tool, showcased at European Researchers' Night, READ blog entry, Oct. 23, 2017, accessed 06/02/2018