http://www.marinelives.org/api.php?action=feedcontributions&user=ColinGreenstreet&feedformat=atomMarineLives - User contributions [en-gb]2024-03-29T11:26:46ZUser contributionsMediaWiki 1.25alphahttp://www.marinelives.org/index.php?title=MarineLives&diff=130972MarineLives2022-03-29T22:04:36Z<p>ColinGreenstreet: /* MarineLives research agenda for 2022 */</p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|1000px|thumb|left|Celebrating our tenth anniversary]]<br />
<br />
[[File:Model 3.322 Three 12032022.jpg|1000px|thumb|left|We are already getting excellent results from our machine transcription models]]<br />
<br />
[[File:Model 3.322 Two 12032022.jpg|1000px|thumb|left|This machine transcribed test page is the start of a deposition made in 1630 in the English High Court of Admiralty. There had been no manual correction of the transcription.]]<br />
<br />
[[File:Model 3.322 Four 12032022.jpg|1000px|thumb|left|Here are the learning curves and formal performance characteristics of our Series Three C17th Secretary Hand models optimised for English High Court of Admiralty depositions]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of sixty volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point research agenda for 2022:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
[[File:Contact Us 10032022.jpg|750px|thumb|left|Contact Colin Greenstreet (Marine Lives) or Holly Brewer (UMD) if you would like to learn more about our work on machine recognition of C17th Secretary Hand]]<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 22,185 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130971MarineLives2022-03-12T11:56:24Z<p>ColinGreenstreet: </p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|1000px|thumb|left|Celebrating our tenth anniversary]]<br />
<br />
[[File:Model 3.322 Three 12032022.jpg|1000px|thumb|left|We are already getting excellent results from our machine transcription models]]<br />
<br />
[[File:Model 3.322 Two 12032022.jpg|1000px|thumb|left|This machine transcribed test page is the start of a deposition made in 1630 in the English High Court of Admiralty. There had been no manual correction of the transcription.]]<br />
<br />
[[File:Model 3.322 Four 12032022.jpg|1000px|thumb|left|Here are the learning curves and formal performance characteristics of our Series Three C17th Secretary Hand models optimised for English High Court of Admiralty depositions]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point research agenda for 2022:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
[[File:Contact Us 10032022.jpg|750px|thumb|left|Contact Colin Greenstreet (Marine Lives) or Holly Brewer (UMD) if you would like to learn more about our work on machine recognition of C17th Secretary Hand]]<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 22,185 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:Model_3.322_Four_12032022.jpg&diff=130970File:Model 3.322 Four 12032022.jpg2022-03-12T11:54:39Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130969MarineLives2022-03-12T10:29:32Z<p>ColinGreenstreet: </p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|1000px|thumb|left|Celebrating our tenth anniversary]]<br />
<br />
[[File:Model 3.322 Three 12032022.jpg|1000px|thumb|left|We are already getting excellent results from our machine transcription models]]<br />
<br />
[[File:Model 3.322 Two 12032022.jpg|1000px|thumb|left|This machine transcribed test page is the start of a deposition made in 1630 in the English High Court of Admiralty. There had been no manual correction of the transcription.]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point research agenda for 2022:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
[[File:Contact Us 10032022.jpg|750px|thumb|left|Contact Colin Greenstreet (Marine Lives) or Holly Brewer (UMD) if you would like to learn more about our work on machine recognition of C17th Secretary Hand]]<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 22,185 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:Model_3.322_Two_12032022.jpg&diff=130968File:Model 3.322 Two 12032022.jpg2022-03-12T10:25:42Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:Model_3.322_Three_12032022.jpg&diff=130967File:Model 3.322 Three 12032022.jpg2022-03-12T10:24:11Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130966MarineLives2022-03-11T11:21:09Z<p>ColinGreenstreet: /* About MarineLives */</p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|750px|thumb|left|Celebrating our tenth anniversary]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point research agenda for 2022:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
[[File:Contact Us 10032022.jpg|750px|thumb|left|Contact Colin Greenstreet (Marine Lives) or Holly Brewer (UMD) if you would like to learn more about our work on machine recognition of C17th Secretary Hand]]<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 22,185 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130965MarineLives2022-03-11T11:19:09Z<p>ColinGreenstreet: /* MarineLives research agenda for 2022 */</p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|750px|thumb|left|Celebrating our tenth anniversary]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point research agenda for 2022:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
[[File:Contact Us 10032022.jpg|750px|thumb|left|Contact Colin Greenstreet (Marine Lives) or Holly Brewer (UMD) if you would like to learn more about our work on machine recognition of C17th Secretary Hand]]<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 21,250 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130964MarineLives2022-03-11T10:57:07Z<p>ColinGreenstreet: /* Creating a Ground Truth */</p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|750px|thumb|left|Celebrating our tenth anniversary]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point our research agenda for the year:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
[[File:Contact Us 10032022.jpg|750px|thumb|left|Contact Colin Greenstreet (Marine Lives) or Holly Brewer (UMD) if you would like to learn more about our work on machine recognition of C17th Secretary Hand]]<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 21,250 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:Contact_Us_10032022.jpg&diff=130963File:Contact Us 10032022.jpg2022-03-11T10:54:53Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130962MarineLives2022-03-11T10:51:13Z<p>ColinGreenstreet: </p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals Revised 11092022.jpg|750px|thumb|left|Celebrating our tenth anniversary]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point our research agenda for the year:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
[[File:Ground Truth One 02032022.png|750px|thumb|left|We are adding 100 manuscript pages per week to our Ground Truth]]<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 21,250 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:Our_HTR_Goals_Revised_11092022.jpg&diff=130961File:Our HTR Goals Revised 11092022.jpg2022-03-11T10:50:27Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130960MarineLives2022-03-11T10:49:19Z<p>ColinGreenstreet: /* MarineLives research agenda for 2022 */</p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals 11092022.jpg|750px|thumb|left|Celebrating our tenth anniversary]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we announced in March 2022 our goal to publish by the end of 2022 a high quality machine transcription of fifty nine volumes of English High Court of Admiralty depositions for the period 1574 to 1688. This initiative will use handwriting recognition models we are developing for C17th Secretary Hand and links to the first point in our four point our research agenda for the year:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
[[File:Ground Truth One 02032022.png|750px|thumb|left|We are adding 100 manuscript pages per week to our Ground Truth]]<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 21,250 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=MarineLives&diff=130959MarineLives2022-03-11T10:43:41Z<p>ColinGreenstreet: </p>
<hr />
<div>__TOC__<br />
----<br />
==Celebrating our tenth anniversary this year==<br />
<br />
'''To celebrate the tenth anniversary of the founding of Marine Lives we aim to make publicly available and searchable by the end of 2022 a high quality machine transcription of 34 mill words of English High Court of Admiralty depositions, 1574 to 1688.'''<br />
<br />
[[File:Our HTR Goals 11092022.jpg|750px|thumb|left|Celebrating our tenth anniversary]]<br />
----<br />
==MarineLives research agenda for 2022==<br />
<br />
'''Marine Lives celebrates its tenth anniversary in 2022. To mark reaching the remarkable digital age of ten we are planning several initiatives, which we will be announcing in January 2022. These initiatives will be linked to our research agenda for the year, which is as below:<br />
<br />
1. Developing and optimising keyword search algorithms for C17th Secretary Hand in English language English High Court of Admiralty manuscript documents.<br />
<br />
2. Completing the next phase of our metadata for the HCA 13/ series. Currently, we have metadata for 22,000 deponents from the years 1570 to 1688, supported by 50,000 digital images held offline.<br />
<br />
3. Publishing a [http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page#Work_product_by_the_HCA_30.2F636_team hybrid digital edition of three ship account books from the 1620s and 1630s], taken from HCA 30/636.<br />
<br />
4. Organising two workshops with the Oldenburg Prize Papers team on (a) Developing a broadly applicable digital C16th and C17th commercial document ontology (b) Forming and developing an online international community to study mariner letters.<br />
<br />
----<br />
==Creating a Ground Truth==<br />
<br />
'''We are creating a bespoke HTR model to read C17th English Secretarial hand. We plan two models. The first using 500,000 words from our existing diplomatic transcriptions of HCA 13/72. The second will be twice the size, and will add an additional 500,000 words from an earlier volume of HCA depositions.'''<br />
<br />
Below an image showing the first manuscript page from HCA 13/72, which we have now entered into Transkribus.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth Click here] to view pages we are developing to illustrate practical aspects of creating a Ground Truth<br />
<br />
[[File:HCA 1372 f1r GROUND TRUTH 02032022.png|750px|thumb|left|HCA 13/72 f.1r: entered into Transkribus as part of the Ground Truth we are creating]]<br />
<br />
[[File:Ground Truth One 02032022.png|750px|thumb|left|We are adding 100 manuscript pages per week to our Ground Truth]]<br />
<br />
----<br />
<br />
==Fantasy Early Modern book competition==<br />
<br />
'''Announcing a one week FANTASY EARLY MODERN BOOK COMPETITION.'''<br />
<br />
'''Rules: List the chapter titles of a fantasy book you WISH EXISTED and post to the [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.'''<br />
<br />
'''Winner(s) will be those book titles/chapters which get MOST LIKES ON THE [https://twitter.com/Marinelivesorg @Marinelivesorg Twitter] account.''' <br />
<br />
Here is my starter:<br />
<br />
TITLE: <u>Early Modern Economic Lives</u><br />
<br />
CHAPTER 1: Talking about work: Early Modern workers describing their occupations and work places<br />
<br />
CHAPTER 2: The melding of work and home and its implications for participation of men and women in commercial life<br />
<br />
CHAPTER 3: Everyone lies: The importance of accurate record keeping<br />
<br />
CHAPTER 4: The role of the Early Modern bookkeeper [https://www.amazon.co.uk/Accounting-Growth-Stripping-Camouflage-Accounts/dp/0712652809 accounting for growth]<br />
<br />
CHAPTER 5: Service driven functional literacy: How Early Modern international trade drove and was a product of literacy<br />
<br />
CHAPTER 6: Risk and probability: How people thought about individual and group risk, and how it affected their commercial decision making<br />
<br />
CHAPTER 7: Free will, contract, indentured labour, enslavement: Concepts underpinning the Early Modern labour market<br />
<br />
CHAPTER 8: How to make Early Modern money (and keep it)<br />
<br />
'''Here's a possible plan. Assemble a group of authors, recruited through Twitter, who will write and publish the Fantasy EM book as an open source online book. The book would be peer reviewed by its readers, and will go through various iterations as readers comment, suggest and offer to improve.'''<br />
<br />
'''If we go with Early Modern Economic Lives (and I'm happy to go with a different Fantasy EM book), I (Colin Greenstreet) would be happy to write a draft of chapters three, four and five, but would need collaborators to write the remaining chapters.'''<br />
<br />
'''What do you think?'''<br />
----<br />
<br />
==New Year's wishes 2022==<br />
<br />
'''2022 is the tenth anniversary of the founding of Marine Lives. So here are our New Year's wishes for the next ten years for Marine Lives and for Early Modern studies generally.'''<br />
<br />
<u>New Year's Wish One</u><br />
<br />
Digitisation and open licence publication of the C17th manuscript records of the [https://search.lma.gov.uk/scripts/mwimain.dll/144/LMA_OPAC/web_detail/REFD+CLA~2F004?SESSIONSEARCH Mansion House Justice Room] and of the [https://www.westminster-abbey.org/about-the-abbey/library-research/muniment-collection Westminster Abbey Muniments].<br />
<br />
<u>New Year's Wish Two</u><br />
<br />
Development of AI-enabled search tools for discovery within and synthesis of large scale digitised manuscript collections, bypassing the need for the manual creation of archival metadata.<br />
<br />
<u>New Year's Wish Three</u><br />
<br />
Implementation of federated search using AI-enabled search tools across multiple large scale digitised manuscript collections<br />
<br />
<u>New Year's Wish Four</u><br />
<br />
Development of Early Modern Material Lives to complement Early Modern Marine Lives, broadening scope from marine to land based occupations, and emphasising the interaction of Early Modern workers with the physical world.<br />
<br />
'''[http://www.marinelives.org/wiki/New_Year%27s_wishes,_2018 Click here] to see our 2018 New Year's wishes.<br />
<br />
'''Please Tweet your comments on our New Year's wishes to [https://twitter.com/Marinelivesorg @Marinelivesorg] and share your own New Year's wishes for technologies to support historical research'''<br />
<br />
----<br />
<br />
==New MarineLives project: Researching three ship account books from the 1620s and 1630s: HCA 30/636/==<br />
<br />
[[File:Ship Diamond August 1637 Setting Out To Mallaga f,61r Ratcatcher Image 20212029 145150.JPG|620px|thumb|left|Item: Paid the Ratcatcher: 4''s'' - 4''d'', Account book of the ship the ''Diamond'': setting out expenses to Mallaga, 1637; HCA 13/635/2]]<br />
<br />
[[File:Account Books 13102021.JPG|800px|thumb|left|HCA 30/636 in all its glory, just waiting for some collaborative work by volunteers]]<br />
<br />
'''Are you interested in a startup collaborative online project to look at, partially transcribe and understand three ship account books from the 1620s and 1630s? You have come to the right place. Marine Lives is launching a new project and is seeking volunteer collaborators. This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''<br />
<br />
HCA 30/636 is a document category which has been created to cover certain papers generated by the Prize Court jurisdiction of the English High Court of Admiralty papers. It contains nine sub-references. We have imaged all the documents within HCA 30/636 and will be making these available to volunteer collaborators online. Documents include three beautifully leather bound account books of various sizes, further paper bound account books, a letter copy book of letters written from on board ship, and various miscellaneous accounting documents relating to multiple voyages. In all we have over one thousand images.<br />
<br />
We are in start up mode. Our current thinking is to make the images available on DropBox or OneDrive and to use this MarineLives wiki as our collaboration platform - to share ideas, to provide support, and to be the vehicle to publish our transcriptions and synthesis. But we are open to your ideas about how to organise this project and nothing will be finalised until we have our team in place. You can get up to speed on our thinking by reading this [https://twitter.com/Marinelivesorg/status/1448320605384753156 Twitter Thread].<br />
<br />
We have had expressions of interest from people from many places - Mexico, Michigan, Texas, London, Newcastle to list a few - which is perfect given the virtual nature of our project and the broad geographic scope of the papers which include multiple voyages from England to the West Indies, the Mediterranean and to Northern Europe.<br />
<br />
We will be sending out an email to everyone who has expressed this interest this weekend (Saturday, October 16th 2021), and will invite people in that email to take a look at some sample images and to tell us about their research interests, skills and ideas for this project.<br />
<br />
This is going to be a very relaxed project running through to the middle of 2022 in which people are welcome to dip in and out, and to do as little or as much as they have time and interest for.<br />
<br />
If you are interested in learning more, '''[https://twitter.com/Marinelivesorg follow Marine Lives on Twitter]''', tweet your interest and we will get in touch with you by Twitter direct mail.<br />
<br />
You can also '''[http://www.marinelives.org/wiki/HCA_30/636_Project_Home_Page access our HCA 30/636 account book project home page here]''', which we are starting to populate. Come join us and help us fill in the blanks.<br />
----<br />
==Women and Early Modern record keeping==<br />
<br />
'''In November 2021, MarineLives participated in an online seminar on Women and Early Modern recordkeeping, co-hosted by [https://oieahc.wm.edu/fellowships/fellows-since-1945/postdoctoral/caylin-carbonell/ Caylin Carbonell] and [http://www.marinelives.org/wiki/Volunteer_research_interests_and_goals#Colin_Greenstreet Colin Greenstreet]. We hope to develop some of the themes of this seminar in 2022'''<br />
<br />
Our contribution to the seminar was to a series of case studies from C17th English High Court of Admiralty depositions, in which [http://www.marinelives.org/wiki/Women%27s_investment_activities_and_record_keeping '''women testify about their investment activities and record keeping.''']<br />
----<br />
==About MarineLives==<br />
<br />
[[File:About MarineLives 22012018.JPG|800px|thumb|left|MarineLives volunteers, past and present, 2015-2018. [[Tools: Biographies|Read more about them here]]]]<br />
<br />
'''MarineLives is a collaborative volunteer driven project. The project started as a spinoff from a National Archives hackathon in early 2012. We are exploring lives touched by the marine world between 1540 and 1690. Commerce, materials, language and correspondence.'''<br />
<br />
At the core of MarineLives is the collaborative transcription, linkage and enrichment of primary manuscripts from the English High Court of Admiralty, together with thematically related manuscripts from international manuscript and printed document collections.<br />
<br />
In the past ten years over 250 volunteers have contributed to our transcriptions and to our synthesis of the many themes which constitute lives in the Early Modern marine world. Currently, we have {{NUMBEROFARTICLES}} text pages and {{NUMBEROFFILES}} images available and nearly six million words of full text transcriptions on the MarineLives wiki.<br />
<br />
We have finding aids for themes as varied as Early Modern women in the marine world; Materials handling; The Early Modern River Thames; Commercial record keeping; Mariners letters; and many more.<br />
<br />
We have also developed a database of 21,250 depositions drawn from the HCA 13/ series covering the period 1575 to 1684, which provides quantitative and qualitative insights into this important series of Admiralty Court depositions.<br />
<br />
'''[[Tools: Basic wiki skills & palaeographical tips|Sample our training material to see if this could be for you.]]'''<br />
<br />
----<br />
<div style="float: right; vertical-align: bottom; margin-left: 20px; margin-bottom: 20px; background:#dcdcdc; border: 1px solid #b0c4de; width: 530px;"><br />
<div style="background: #b0c4de; padding: 5px 10px 5px 10px; font-size: larger; font-weight: bold;">Sample images</div><br />
<div style="padding: 10px;"><br />
<div style="font-size: normal;">'''This will be a project about co-creation of a public resource, which will be published on the Marine Lives wiki and made available to all - public and academic historians alike (and those just intrigued by our past).'''</div><br />
[[File:Abraham Hardy Account Book Wages HCA 30 636.JPG|500px|thumb|right|HCA 30/636/3 Andrew Hardey's account book for voyage to Barbados in the ship the Abraham- extract from wages schedules]]<br />
<br />
[[File:Abraham Handwriting Sample HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Handwriting sample]]<br />
<br />
[[File:Abraham Letter Copy Book HCA 30 636.JPG|500px|thumb|right|HCA 30/636/ Letter copy book from on board the ship the Abraham]]<br />
<br />
</div><br />
</div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:Our_HTR_Goals_11092022.jpg&diff=130958File:Our HTR Goals 11092022.jpg2022-03-11T10:41:15Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130957Running List of Questions2022-03-04T21:03:17Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of the join, if the join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.35 How handle "j" with the baseline?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.61 Marks and initials usually have text like "the marke of", as well as teh actual mark or initial(s)<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?<br />
<br />
7.95 Good practice in cropping text region boxes? Especially where in curved or dark margins?<br />
<br />
7.96 Dashes at end of lines, used to fill gaps?<br />
<br />
7.97 Blcked out words versus crossed out words?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130956Running List of Questions2022-03-04T16:23:59Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.35 How handle "j" with the baseline?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.61 Marks and initials usually have text like "the marke of", as well as teh actual mark or initial(s)<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?<br />
<br />
7.95 Good practice in cropping text region boxes? Especially where in curved or dark margins?<br />
<br />
7.96 Dashes at end of lines, used to fill gaps?<br />
<br />
7.97 Blcked out words versus crossed out words?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130955Running List of Questions2022-03-04T16:20:10Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.35 How handle "j" with the baseline?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.61 Marks and initials usually have text like "the marke of", as well as teh actual mark or initial(s)<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?<br />
<br />
7.95 Good practice in cropping text region boxes? Especially where in curved or dark margins?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130954Running List of Questions2022-03-04T16:10:18Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.35 How handle "j" with the baseline?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.61 Marks and initials usually have text like "the marke of", as well as teh actual mark or initial(s)<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130953Running List of Questions2022-03-04T15:49:00Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.35 How handle "j" with the baseline?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130952Running List of Questions2022-03-04T15:12:14Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130951Running List of Questions2022-03-04T13:57:45Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatures generally?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130950Running List of Questions2022-03-04T13:57:27Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatrues generally?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?<br />
<br />
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130949Running List of Questions2022-03-04T13:47:51Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatrues generally?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?<br />
<br />
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130948Running List of Questions2022-03-04T12:55:49Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be inserted into a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for signatures, and indeed signatrues generally?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130947Running List of Questions2022-03-04T12:35:31Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the tex, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be insertedinto a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practive to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?<br />
<br />
7.5 How handle baselines for ignatrues, and indeed signatrues generally?<br />
<br />
7.6 How handle marks (as signoffs)?<br />
<br />
7.7 How handle initials (as signoffs)?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130946Running List of Questions2022-03-04T12:19:33Z<p>ColinGreenstreet: /* Question Seven */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the tex, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be insertedinto a baseline, and not just at either end?<br />
<br />
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practive to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130945Running List of Questions2022-03-04T11:40:01Z<p>ColinGreenstreet: /* Question Six */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----<br />
==Question Seven==<br />
<br />
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the tex, so that the number of letters in the text matches the number of points in the baseline?<br />
<br />
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of teh join, if teh join width is quite wide?<br />
<br />
7.3 Can additional points be insertedinto a baseline, and not just at either end?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130944Running List of Questions2022-03-04T07:28:43Z<p>ColinGreenstreet: /* Question Three */</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Customized_Structural_Analysis&diff=130943Customized Structural Analysis2022-03-03T22:44:36Z<p>ColinGreenstreet: </p>
<hr />
<div>We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
<br />
[[File:HCA 1372 Ground Truth Customized Structure Tags Input 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
[[File:HCA 1372 Ground Truth Customized Structure Tages 03032022.png|750px|thumb|left|Three pages from our growing HCA 13/72 Ground Truth marked up with customized structural tags]]<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus.<br />
<br />
We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.<br />
<br />
Transkribus has a useful guid: [https://readcoop.eu/transkribus/howto/how-to-use-the-structural-tagging-feature-and-how-to-train-it/ How To Use the Structural Tagging Feature and How To Train It with P2PaLA]</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130942Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T22:25:48Z<p>ColinGreenstreet: </p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
----<br />
__TOC__<br />
----<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
==Tools==<br />
<br />
We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]<br />
<br />
----<br />
==Work Process==<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]<br />
<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].<br />
<br />
----<br />
==Questions==<br />
<br />
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Customized_Structural_Analysis&diff=130941Customized Structural Analysis2022-03-03T22:21:28Z<p>ColinGreenstreet: </p>
<hr />
<div>We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
<br />
[[File:HCA 1372 Ground Truth Customized Structure Tags Input 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
[[File:HCA 1372 Ground Truth Customized Structure Tages 03032022.png|750px|thumb|left|Three pages from our growing HCA 13/72 Ground Truth marked up with customized structural tags]]<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus.<br />
<br />
We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:HCA_1372_Ground_Truth_Customized_Structure_Tags_Input_03032022.png&diff=130940File:HCA 1372 Ground Truth Customized Structure Tags Input 03032022.png2022-03-03T22:19:50Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Ground_Truth_Work_Process&diff=130939Ground Truth Work Process2022-03-03T22:02:55Z<p>ColinGreenstreet: </p>
<hr />
<div>'''We have set up a simple work process'''<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
__TOC__<br />
----<br />
==Automatic layout recognition of all 1518 images in HCA 13/72==<br />
<br />
- Used the CITlab Advanced Tool<br />
<br />
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]<br />
<br />
- Modified the layout page by page after manual inspection of automatically generated layouts<br />
<br />
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth<br />
We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image<br />
Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text<br />
Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this<br />
In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions<br />
However, base lines of text have already been recognised and allocated to specific text regions. <br />
We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]<br />
<br />
- The two key modifications we are making are<br />
<br />
(a) Adjusting number size and shape of Text Regions<br />
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)<br />
Look for breaks in base lines<br />
Look for incomplete base lines<br />
Connect broken base lines<br />
Extend incomplete base lines<br />
(c) Reallocating base lines to our newly created and/or adjusted Text Regions<br />
<br />
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]<br />
<br />
----<br />
==Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client==<br />
<br />
Once the automatically generated Text Regions have been adjusted for a specific image page<br />
<br />
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region<br />
<br />
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. <br />
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. <br />
In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. <br />
This enables us to see the relevant part of the image, with the relevant Text Region.<br />
We are pasting transcribed text against the correct lines. <br />
To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region<br />
This gives us good human oversight of the document.<br />
Then we work methodically through all the text<br />
<br />
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]<br />
<br />
----</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130938Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T22:01:42Z<p>ColinGreenstreet: </p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
__TOC__<br />
----<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
==Tools==<br />
<br />
We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]<br />
<br />
----<br />
==Work Process==<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]<br />
<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].<br />
<br />
----<br />
==Questions==<br />
<br />
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130937Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T22:00:55Z<p>ColinGreenstreet: </p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
==Tools==<br />
<br />
We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]<br />
<br />
----<br />
==Work Process==<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]<br />
<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].<br />
<br />
----<br />
==Questions==<br />
<br />
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130936Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T22:00:35Z<p>ColinGreenstreet: </p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
==Tools==<br />
<br />
We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]<br />
<br />
----<br />
==Work Process==<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]<br />
<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].<br />
<br />
----<br />
==Questions==<br />
<br />
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130935Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T21:55:24Z<p>ColinGreenstreet: </p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
==Tools==<br />
<br />
We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]<br />
<br />
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]<br />
<br />
----<br />
==Work Process==<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]<br />
<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].<br />
<br />
----<br />
==Questions==<br />
<br />
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Tools_to_create_our_Ground_Truth&diff=130934Tools to create our Ground Truth2022-03-03T21:53:50Z<p>ColinGreenstreet: </p>
<hr />
<div>We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
__TOC__<br />
----<br />
==Using Transkribus Expert Client==<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
==Using Transkribus Lite version 2.0==<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
==Marine Lives wiki==<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Customized_Structural_Analysis&diff=130933Customized Structural Analysis2022-03-03T21:50:38Z<p>ColinGreenstreet: Created page with "We are experimenting with a range of Transkribus tools related to layout analysis and HTR. [http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Groun..."</p>
<hr />
<div>We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
[[File:HCA 1372 Ground Truth Customized Structure Tages 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=File:HCA_1372_Ground_Truth_Customized_Structure_Tages_03032022.png&diff=130932File:HCA 1372 Ground Truth Customized Structure Tages 03032022.png2022-03-03T21:48:52Z<p>ColinGreenstreet: </p>
<hr />
<div></div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Tools_to_create_our_Ground_Truth&diff=130931Tools to create our Ground Truth2022-03-03T21:41:01Z<p>ColinGreenstreet: Created page with "We are working with several related Transkribus Tools and with our own semantic media wiki [http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Groun..."</p>
<hr />
<div>We are working with several related Transkribus Tools and with our own semantic media wiki<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
----<br />
===Using Transkribus Expert Client===<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
===Using Transkribus Lite version 2.0===<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
===Marine Lives wiki===<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130930Running List of Questions2022-03-03T21:37:49Z<p>ColinGreenstreet: </p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[http://www.marinelives.org/wiki/Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all teh Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130929Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T21:37:26Z<p>ColinGreenstreet: </p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
<br />
==Tools==<br />
----<br />
===Using Transkribus Expert Client===<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
===Using Transkribus Lite version 2.0===<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
===Marine Lives wiki===<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.<br />
<br />
----<br />
==Work Process==<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]<br />
<br />
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]<br />
<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.<br />
<br />
[[File:Structural Analysis HCA Depositions 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
----<br />
==QUESTIONS==<br />
<br />
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Running_List_of_Questions&diff=130928Running List of Questions2022-03-03T21:36:10Z<p>ColinGreenstreet: Created page with "We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth [BACK TO GROUND TRUTH MAIN PAGE] ---- ==Question One== '''What does a se..."</p>
<hr />
<div>We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth<br />
<br />
[BACK TO GROUND TRUTH MAIN PAGE]<br />
<br />
----<br />
==Question One==<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
==Question Two==<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
==Question Three==<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all teh Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
==Question Four==<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
==Question Five==<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
==Question Six==<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Ground_Truth_Work_Process&diff=130927Ground Truth Work Process2022-03-03T21:31:40Z<p>ColinGreenstreet: Created page with "'''We have set up a simple work process''' __TOC__ ==Automatic layout recognition of all 1518 images in HCA 13/72== - Used the CITlab Advanced Tool File:CITlab Advanced..."</p>
<hr />
<div>'''We have set up a simple work process'''<br />
<br />
__TOC__<br />
<br />
==Automatic layout recognition of all 1518 images in HCA 13/72==<br />
<br />
- Used the CITlab Advanced Tool<br />
<br />
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]<br />
<br />
- Modified the layout page by page after manual inspection of automatically generated layouts<br />
<br />
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth<br />
We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image<br />
Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text<br />
Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this<br />
In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions<br />
However, base lines of text have already been recognised and allocated to specific text regions. <br />
We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]<br />
<br />
- The two key modifications we are making are<br />
<br />
(a) Adjusting number size and shape of Text Regions<br />
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)<br />
Look for breaks in base lines<br />
Look for incomplete base lines<br />
Connect broken base lines<br />
Extend incomplete base lines<br />
(c) Reallocating base lines to our newly created and/or adjusted Text Regions<br />
<br />
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]<br />
<br />
----<br />
==Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client==<br />
<br />
Once the automatically generated Text Regions have been adjusted for a specific image page<br />
<br />
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region<br />
<br />
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. <br />
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. <br />
In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. <br />
This enables us to see the relevant part of the image, with the relevant Text Region.<br />
We are pasting transcribed text against the correct lines. <br />
To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region<br />
This gives us good human oversight of the document.<br />
Then we work methodically through all the text<br />
<br />
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]<br />
<br />
----</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130926Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T20:01:18Z<p>ColinGreenstreet: /* Question Six */</p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
<br />
==Tools==<br />
----<br />
===Using Transkribus Expert Client===<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
===Using Transkribus Lite version 2.0===<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
===Marine Lives wiki===<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.<br />
<br />
----<br />
<br />
==Work process==<br />
<br />
We have set up a simple work process<br />
<br />
1. Automatic layout recognition of all 1518 images in HCA 13/72<br />
<br />
- Used the CITlab Advanced Tool<br />
<br />
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]<br />
<br />
- Modified the layout page by page after manual inspection of automatically generated layouts<br />
<br />
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth<br />
We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image<br />
Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text<br />
Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this<br />
In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions<br />
However, base lines of text have already been recognised and allocated to specific text regions. <br />
We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]<br />
<br />
- The two key modifications we are making are<br />
<br />
(a) Adjusting number size and shape of Text Regions<br />
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)<br />
Look for breaks in base lines<br />
Look for incomplete base lines<br />
Connect broken base lines<br />
Extend incomplete base lines<br />
(c) Reallocating base lines to our newly created and/or adjusted Text Regions<br />
<br />
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]<br />
<br />
----<br />
2. Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client<br />
<br />
Once the automatically generated Text Regions have been adjusted for a specific image page<br />
<br />
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region<br />
<br />
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. <br />
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. <br />
In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. <br />
This enables us to see the relevant part of the image, with the relevant Text Region.<br />
We are pasting transcribed text against the correct lines. <br />
To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region<br />
This gives us good human oversight of the document.<br />
Then we work methodically through all the text<br />
<br />
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]<br />
<br />
----<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.<br />
<br />
[[File:Structural Analysis HCA Depositions 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
----<br />
<br />
==QUESTIONS==<br />
<br />
'''We are developing a running list of questions.'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''<br />
<br />
----<br />
===Question One===<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
<br />
===Question Two===<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
===Question Three===<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all teh Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
===Question Four===<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
===Question Five===<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
===Question Six===<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?<br />
----</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130925Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T20:00:44Z<p>ColinGreenstreet: /* Question One */</p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
<br />
==Tools==<br />
----<br />
===Using Transkribus Expert Client===<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
===Using Transkribus Lite version 2.0===<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
===Marine Lives wiki===<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.<br />
<br />
----<br />
<br />
==Work process==<br />
<br />
We have set up a simple work process<br />
<br />
1. Automatic layout recognition of all 1518 images in HCA 13/72<br />
<br />
- Used the CITlab Advanced Tool<br />
<br />
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]<br />
<br />
- Modified the layout page by page after manual inspection of automatically generated layouts<br />
<br />
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth<br />
We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image<br />
Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text<br />
Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this<br />
In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions<br />
However, base lines of text have already been recognised and allocated to specific text regions. <br />
We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]<br />
<br />
- The two key modifications we are making are<br />
<br />
(a) Adjusting number size and shape of Text Regions<br />
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)<br />
Look for breaks in base lines<br />
Look for incomplete base lines<br />
Connect broken base lines<br />
Extend incomplete base lines<br />
(c) Reallocating base lines to our newly created and/or adjusted Text Regions<br />
<br />
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]<br />
<br />
----<br />
2. Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client<br />
<br />
Once the automatically generated Text Regions have been adjusted for a specific image page<br />
<br />
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region<br />
<br />
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. <br />
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. <br />
In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. <br />
This enables us to see the relevant part of the image, with the relevant Text Region.<br />
We are pasting transcribed text against the correct lines. <br />
To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region<br />
This gives us good human oversight of the document.<br />
Then we work methodically through all the text<br />
<br />
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]<br />
<br />
----<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.<br />
<br />
[[File:Structural Analysis HCA Depositions 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
----<br />
<br />
==QUESTIONS==<br />
<br />
'''We are developing a running list of questions.'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''<br />
<br />
----<br />
===Question One===<br />
<br />
'''What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
<br />
===Question Two===<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
===Question Three===<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all teh Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
===Question Four===<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
===Question Five===<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
===Question Six===<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Regiona nd base lines?<br />
----</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130924Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T20:00:25Z<p>ColinGreenstreet: /* Question One */</p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
<br />
==Tools==<br />
----<br />
===Using Transkribus Expert Client===<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
===Using Transkribus Lite version 2.0===<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
===Marine Lives wiki===<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.<br />
<br />
----<br />
<br />
==Work process==<br />
<br />
We have set up a simple work process<br />
<br />
1. Automatic layout recognition of all 1518 images in HCA 13/72<br />
<br />
- Used the CITlab Advanced Tool<br />
<br />
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]<br />
<br />
- Modified the layout page by page after manual inspection of automatically generated layouts<br />
<br />
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth<br />
We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image<br />
Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text<br />
Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this<br />
In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions<br />
However, base lines of text have already been recognised and allocated to specific text regions. <br />
We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]<br />
<br />
- The two key modifications we are making are<br />
<br />
(a) Adjusting number size and shape of Text Regions<br />
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)<br />
Look for breaks in base lines<br />
Look for incomplete base lines<br />
Connect broken base lines<br />
Extend incomplete base lines<br />
(c) Reallocating base lines to our newly created and/or adjusted Text Regions<br />
<br />
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]<br />
<br />
----<br />
2. Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client<br />
<br />
Once the automatically generated Text Regions have been adjusted for a specific image page<br />
<br />
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region<br />
<br />
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. <br />
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. <br />
In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. <br />
This enables us to see the relevant part of the image, with the relevant Text Region.<br />
We are pasting transcribed text against the correct lines. <br />
To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region<br />
This gives us good human oversight of the document.<br />
Then we work methodically through all the text<br />
<br />
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]<br />
<br />
----<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.<br />
<br />
[[File:Structural Analysis HCA Depositions 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
----<br />
<br />
==QUESTIONS==<br />
<br />
'''We are developing a running list of questions.'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''<br />
<br />
----<br />
===Question One===<br />
<br />
What does a sensible division of a page into Text Regions look like?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
<br />
===Question Two===<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
===Question Three===<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all teh Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
===Question Four===<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
===Question Five===<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
===Question Six===<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Regiona nd base lines?<br />
----</div>ColinGreenstreethttp://www.marinelives.org/index.php?title=Marine_Lives_guide_to_creating_a_Transkribus_Ground_Truth&diff=130923Marine Lives guide to creating a Transkribus Ground Truth2022-03-03T19:57:59Z<p>ColinGreenstreet: /* Objective */</p>
<hr />
<div>'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''<br />
<br />
__TOC__<br />
<br />
==Objective==<br />
<br />
[[File:Transkribus Lite Search Bahia 03032022.png|750px|thumb|left|Transkribus web capability has simple, but useful search functionality]]<br />
<br />
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]<br />
<br />
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''<br />
<br />
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''<br />
<br />
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).<br />
<br />
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.<br />
----<br />
<br />
==Tools==<br />
----<br />
===Using Transkribus Expert Client===<br />
<br />
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.<br />
<br />
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.<br />
<br />
----<br />
===Using Transkribus Lite version 2.0===<br />
<br />
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]<br />
<br />
<br />
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]). <br />
<br />
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].<br />
<br />
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.<br />
<br />
The browser interface also has useful functionality not available in Transkribus Expert Client.<br />
<br />
Most useful to date are:<br />
<br />
Large thumbnails<br />
Ability to display thumbnails by status of manuscript pages within our work process<br />
----<br />
<br />
===Marine Lives wiki===<br />
<br />
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.<br />
<br />
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.<br />
<br />
----<br />
<br />
==Work process==<br />
<br />
We have set up a simple work process<br />
<br />
1. Automatic layout recognition of all 1518 images in HCA 13/72<br />
<br />
- Used the CITlab Advanced Tool<br />
<br />
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]<br />
<br />
- Modified the layout page by page after manual inspection of automatically generated layouts<br />
<br />
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth<br />
We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image<br />
Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text<br />
Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this<br />
In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions<br />
However, base lines of text have already been recognised and allocated to specific text regions. <br />
We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]<br />
<br />
- The two key modifications we are making are<br />
<br />
(a) Adjusting number size and shape of Text Regions<br />
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)<br />
Look for breaks in base lines<br />
Look for incomplete base lines<br />
Connect broken base lines<br />
Extend incomplete base lines<br />
(c) Reallocating base lines to our newly created and/or adjusted Text Regions<br />
<br />
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]<br />
<br />
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]<br />
<br />
----<br />
2. Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client<br />
<br />
Once the automatically generated Text Regions have been adjusted for a specific image page<br />
<br />
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region<br />
<br />
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. <br />
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. <br />
In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. <br />
This enables us to see the relevant part of the image, with the relevant Text Region.<br />
We are pasting transcribed text against the correct lines. <br />
To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region<br />
This gives us good human oversight of the document.<br />
Then we work methodically through all the text<br />
<br />
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]<br />
<br />
----<br />
==We are experimenting==<br />
<br />
We are experimenting with a range of Transkribus tools related to layout analysis and HTR.<br />
<br />
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.<br />
<br />
[[File:Structural Analysis HCA Depositions 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]<br />
<br />
----<br />
<br />
==QUESTIONS==<br />
<br />
'''We are developing a running list of questions.'''<br />
<br />
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''<br />
<br />
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''<br />
<br />
----<br />
===Question One===<br />
<br />
Is this a sensible division of this page into Text Regions?'''<br />
<br />
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]<br />
----<br />
<br />
===Question Two===<br />
<br />
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''<br />
<br />
----<br />
===Question Three===<br />
<br />
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''<br />
<br />
<u>Typical structure and variations</u><br />
<br />
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.<br />
<br />
Most HCA deposition image pages have an implied three column structure.<br />
<br />
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.<br />
<br />
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page. <br />
<br />
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body. <br />
<br />
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.<br />
<br />
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.<br />
<br />
<u>Human reading of our documents</u><br />
<br />
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag<br />
<br />
<u>Machine reading of layout</u><br />
<br />
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order<br />
<br />
For example:<br />
<br />
Always look to the top right hand corner of a page, if the page is a recto page<br />
If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page<br />
Look for single leads detached from other text blocks and create Text Region<br />
Always create a Text Region if a certain minimum of non-text surrounds the text<br />
Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region<br />
Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?<br />
<br />
Can we as an alternative start by hand defining all teh Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?<br />
<br />
----<br />
<br />
===Question Four===<br />
<br />
'''How should we chose the best range of documents to include in our Ground Truth?'''<br />
<br />
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?<br />
<br />
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy? <br />
<br />
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?<br />
<br />
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?<br />
<br />
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?<br />
----<br />
===Question Five===<br />
<br />
'''How should signatures be treated?'''<br />
<br />
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?<br />
----<br />
===Question Six===<br />
<br />
'''How should pictograms containing some text be treated?'''<br />
<br />
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Regiona nd base lines?<br />
----</div>ColinGreenstreet