Difference between revisions of "Running List of Questions"
(→Question Six) |
(→Question Seven) |
||
(11 intermediate revisions by the same user not shown) | |||
Line 84: | Line 84: | ||
==Question Seven== | ==Question Seven== | ||
− | 7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the | + | 7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline? |
− | 7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of | + | 7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of the join, if the join width is quite wide? |
− | 7.3 Can additional points be | + | 7.3 Can additional points be inserted into a baseline, and not just at either end? |
+ | |||
+ | 7.35 How handle "j" with the baseline? | ||
+ | |||
+ | 7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page? | ||
+ | |||
+ | 7.5 How handle baselines for signatures, and indeed signatures generally? | ||
+ | |||
+ | 7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe? | ||
+ | |||
+ | 7.6 How handle marks (as signoffs)? | ||
+ | |||
+ | 7.61 Marks and initials usually have text like "the marke of", as well as teh actual mark or initial(s) | ||
+ | |||
+ | 7.7 How handle initials (as signoffs)? | ||
+ | |||
+ | 7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"? | ||
+ | |||
+ | 7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR? | ||
+ | |||
+ | 7.95 Good practice in cropping text region boxes? Especially where in curved or dark margins? | ||
+ | |||
+ | 7.96 Dashes at end of lines, used to fill gaps? | ||
+ | |||
+ | 7.97 Blcked out words versus crossed out words? |
Latest revision as of 21:03, March 4, 2022
We are developing a running list of questions as we work on our C17th Secretarial Hand Ground Truth
BACK TO GROUND TRUTH MAIN PAGE
Contents
Question One
What does a sensible division of a page into Text Regions look like?
Question Two
Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?
Question Three
How can we train the Transkribus automatic layout tools to understand the range of document structures we have?
Typical structure and variations
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.
Most HCA deposition image pages have an implied three column structure.
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page.
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body.
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.
Human reading of our documents
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag
Machine reading of layout
Can we manually teach programme an automatic layout recogniser to look for certain types of Text Region (by size and shape and amount of text) and to number the Text Regions detected in a bespoke order
For example:
Always look to the top right hand corner of a page, if the page is a recto page If no page number in top right hand corner, look for pencil page number elsewhere at the top of the page Look for single leads detached from other text blocks and create Text Region Always create a Text Region if a certain minimum of non-text surrounds the text Recognise curly brackets and always place text which has curly brackets on the right of it within a discrete Text Region Work in a Zig Zag from top RH corner of a page with the Text Rehions numbered according to their presence (or not) on that Zig Zag pattern?
Can we as an alternative start by hand defining all the Text Regions we want on a given page and THEN running the automatic layout recognition, with the recognition program identifying only base lines, and assigning them to our manually chosen Text Regions?
Question Four
How should we chose the best range of documents to include in our Ground Truth?
Should we chose a wide range of handwriting types under the umbrella of "Notarial Secretary Hand?
Should we chose only well lit, high quality images, or is there an argument to include images in diverse lighting conditions and different degrees of focal accuracy?
How sensitive is the machine learning process (and the subsequent accuracy of an HTR model) to very accurately defined base lines, versus very accurately transcribed text?
How does the machine learning process treat text in the Ground Truth which is either blacked out (no text visible) or struck out (but where the text is still visible)?
Is it important to have plenty of examples of interlineation and marginalia to train an HTR model to be flexible, or do these actually reduce the accuracy of an HTR model?
Question Five
How should signatures be treated?
Should signatures be in their own Text Region, or should they be included in the Text Region which contains the main text with which they are associated?
Question Six
How should pictograms containing some text be treated?
English High Court of Admiralty depositions often contain Merchants Marks, which are essentially pictograms. These pictograms sometimes include letters, numbers or words. How should these be treated im terms of Text Region and base lines?
Question Seven
7.1 Is it good practice when manually creating a baseline to have one dot point per letter in the text, so that the number of letters in the text matches the number of points in the baseline?
7.2 What is best practive for joining to partial base lines together to make a combined base line? Is there away of inserting extra points into the middle of the join, if the join width is quite wide?
7.3 Can additional points be inserted into a baseline, and not just at either end?
7.35 How handle "j" with the baseline?
7.4 Does the machine learning process when applied to the Ground Truth make any reference to Text Regions as defined by Text boxes? If so, is it important for the text boxes to be drawn as tightly as possible round the text? To get tight boundaries is it good practice to use rregular shapes, or is it always best to use rectangles aligned horizontally with the page?
7.5 How handle baselines for signatures, and indeed signatures generally?
7.51 What about ornate signatures with flourishes? Which part of a signature should be in the baselibe?
7.6 How handle marks (as signoffs)?
7.61 Marks and initials usually have text like "the marke of", as well as teh actual mark or initial(s)
7.7 How handle initials (as signoffs)?
7.8 Is there a way to add more "process flow" steps between "In progress" and "Done"?
7.9 Is it possible to use a different HTR-engine for predefined structure types im a document. For example, a signature-HTR, a Legal-Latin-HTR, a secretary-hand-HTR?
7.95 Good practice in cropping text region boxes? Especially where in curved or dark margins?
7.96 Dashes at end of lines, used to fill gaps?
7.97 Blcked out words versus crossed out words?