Difference between revisions of "Tools: Corpus linguistics"
(22 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
'''The MarineLives wiki contains nearly six million words of semi-diplomatically transcribed early and mid-C17th legal and commercial text. This is one of the larger collections of text derived from C17th manuscript sources, and is certainly the largest English language collection derived from early and mid C17th legal and commercial manuscript sources.''' | '''The MarineLives wiki contains nearly six million words of semi-diplomatically transcribed early and mid-C17th legal and commercial text. This is one of the larger collections of text derived from C17th manuscript sources, and is certainly the largest English language collection derived from early and mid C17th legal and commercial manuscript sources.''' | ||
+ | |||
+ | '''In the next five years we would like to grow our full text corpus to twenty-five million words, supported by 50,000 images, all derived from primary manuscript material from the period 1627-1677.''' | ||
'''The MarineLives project team is keen to explore the corpus linguistic potential of the material, and welcomes approaches from corpus and historical linguists, interested in discussing this potential.''' | '''The MarineLives project team is keen to explore the corpus linguistic potential of the material, and welcomes approaches from corpus and historical linguists, interested in discussing this potential.''' | ||
+ | ---- | ||
+ | |||
+ | ==Building corpus/historical linguistic & data science thinking into MarineLives planning== | ||
+ | ===Past policy decisions=== | ||
+ | * Choice of semi-diplomatic transcription retains original C17th orthography, punctation and capitalisation | ||
+ | * Editorial policy expands contractions and spells out abbreviations | ||
+ | * Considered and rejected use of TEI coding, when project started in 2012 | ||
+ | ===Issues for consideration=== | ||
+ | * Technical and editorial requirements of corpus and historical linguistic practitioners and of data scientists? | ||
+ | * Need for parallel diplomatic and semi-diplomatic texts? (See [https://emmo.folger.edu/?_ga=2.192924522.1585096656.1515932263-1642802026.1514837408 Folger Institute's Early Modern Manuscripts Online initiative])<ref>[https://folgerpedia.folger.edu/Early_Modern_Manuscripts_Online_(EMMO)#EMMO_web_site_.28beta.29 Early Modern Manuscripts Online (EMMO), Folgerpedia entry], accessed 14/01/2018; [https://emmo.folger.edu/?_ga=2.192924522.1585096656.1515932263-1642802026.1514837408 Early Modern Manuscripts Online (EMMO) Beta website], accessed 14/01/2018</ref> | ||
+ | * How should our digital image acquisition policies be affected by corporate/historical linguistic and data science potential, if at all? | ||
+ | -- Should it affect the type of documents we are seeking to ingest (e.g. legal depositions; letters)? | ||
+ | -- Should it affect the archives and libraries we are seeking to collaborate with? | ||
+ | -- Should it affect the scale of our imaging and transcription plans over the next five years? (We are seeking to move from ca. 11,600 images to 50,000 images, and ca. 6 million to 25 million full text words) | ||
---- | ---- | ||
==Potential etymological use== | ==Potential etymological use== | ||
− | [[File:Mannagerie DSL 06012018.JPG|500px|thumb|right|'Menagery, -ie, Mannagerie, Managary, n.', Dictionary of the Scots | + | [[File:Mannagerie DSL 06012018.JPG|500px|thumb|right|[http://www.dsl.ac.uk/entry/dost/menagery 'Menagery, -ie, Mannagerie, Managary, n.', Dictionary of the Scots Language, Dictionar o the Scots Leid, online web resource]]] |
− | * ASSEVERATION | + | * '''ASSEVERATION''' |
− | - "...the sayd Dirick Dobler in discourse with this deponent touching the premisses did assure this deponent with much asseveration that the sayd goods were all free and for Hamburgh and merchants there living onely or to that purpose..." (Jan 1653/54 | + | - "...the sayd Dirick Dobler in discourse with this deponent touching the premisses did assure this deponent with much asseveration that the sayd goods were all free and for Hamburgh and merchants there living onely or to that purpose..." (Jan 1653/54)<ref>[[HCA 13/68 f.555r Annotate|HCA 13/68 f.555r]]</ref> |
− | * MANNAGERIE | + | * '''DAMASKILLIAS''' |
+ | - "...sixe peeces of damaskillias or floramides..." (Apr 1657)<ref>[[HCA 13/71 f.574v Annotate|HCA 13/71 f.574v]]; [https://books.google.co.uk/books?id=ex1wAwAAQBAJ&pg=PA481&lpg=PA481&dq#v=onepage&q&f=false Edward A. Roberts, A Comprehensive Etymological Dictionary of the Spanish Language with Families of Words based on Indo-European Roots: Volume I (A-G) (2014), p.481], accessed 10/01/2018</ref> | ||
+ | |||
+ | * '''HURRICANO; HURRICANOE; HURRICANE; HURRICAN'''<ref>[http://www.memidex.com/hurricane 'Hurricane', Memidex, online dictionary, thesaurus and more], accessed 10/01/2018</ref> | ||
+ | - "...a furious hurricano, which continued in very great extremity for about five or six and thirty howres..." (Mar 1655/56)<ref>[[HCA 13/71 f.122v Annotate|HCA 13/71 f.122v]]</ref> | ||
+ | - "...the sayd shipp under the command of George Boys arlate departed with her lading of sugars from the Barbadoes bound for this Port of London in the month of August last past and that afterwards being upon her Course a violent storme or hurricano..." (Mar 1655/56)<ref>[[HCA 13/71 f.123r Annotate|HCA 13/71 f.123r]]</ref> | ||
+ | - "...the sayd sugars were not all dry and well conditioned after the sayd hurricano but were damnifiyed (as aforesayd) thereby, howbeit some of them as he conceyveth did remain dry. And saith that the sayd shipp was not tight after the sayd hurrican (sic)..." (Mar 1655/56)<ref>[[HCA 13/71 f.124r Annotate|HCA 13/71 f.124r]]</ref> | ||
+ | - "...at the arrivall of the said ship at Nevis this deponent found that hurricanoes and stormes which had bin very violent in those parts had blowne downe the tobacco plants indicoes, and Sugar canes there, and spoyled most of them..." (Mar 1659)<ref>[[HCA 13/73 f.106v Annotate|HCA 13/73 f.106v]]</ref> | ||
+ | - "...the first storme predeposed of was a hurricano which is usually soe violent that few shipps are able to withstand them..." (Jun 1656)<ref>[[HCA 13/71 f.256v Annotate|HCA 13/71 f.256v]]</ref> | ||
+ | - "[WITHIN THE EAST INDIES]...goeing for another port shee was by violence of the hurricanes cast away and utterly lost..." (Nov 1660)<ref>[[HCA 13/73 f.643v Annotate|HCA 13/73 f.643v]]</ref> | ||
+ | |||
+ | * '''MANNAGERIE; MANNAGERY; MANNAGEMENT'''<ref>[http://www.dsl.ac.uk/entry/dost/menagery 'Menagery, -ie, Mannagerie, Managary, n.', Dictionary of the Scots Language, Dictionar o the Scots Leid, online web resource], accessed 10/01/2018</ref> | ||
- "...the same were wholely left (as beleeveth) to the mannagerie of the foresayd William Warren who disposed of them at the place predeposed and went with the proceede of them to the Canaries..." (Jun 1658))<ref>[[HCA 13/72 f.377v Annotate|HCA 13/72 f.377v]]</ref> | - "...the same were wholely left (as beleeveth) to the mannagerie of the foresayd William Warren who disposed of them at the place predeposed and went with the proceede of them to the Canaries..." (Jun 1658))<ref>[[HCA 13/72 f.377v Annotate|HCA 13/72 f.377v]]</ref> | ||
- "...hee this deponent is a factor and agent to the articulate Antonio Rodrigues Robles here at London and soe hath bin for these five yeares last past or thereabouts and imployed by him in keepeing his accompts and mannagerie of his merchandizeing affayres..." (Jul 1658)<ref>[[HCA 13/72 f.386r Annotate|HCA 13/72 f.386r]]</ref> | - "...hee this deponent is a factor and agent to the articulate Antonio Rodrigues Robles here at London and soe hath bin for these five yeares last past or thereabouts and imployed by him in keepeing his accompts and mannagerie of his merchandizeing affayres..." (Jul 1658)<ref>[[HCA 13/72 f.386r Annotate|HCA 13/72 f.386r]]</ref> | ||
+ | - "...hee was by the sayd Brewer and Crispe intrusted to goe supracargo upon the sayd shipp and voyage, as well for the mannagement of their sayd five eighth parts as for him selfe and the mannagery of the sayd three eighth parts of the sayd shipp soe lett to him the sayd Wood by the sayd Grove..." (Jan 1658/59)<ref>[[HCA 13/73 f.1r Annotate|HCA 13/73 f.1r]]</ref> | ||
+ | |||
+ | * '''OFFAYRES; AFFARES; AFFAYRES'''<ref>[http://www.dsl.ac.uk/entry/dost/affair 'Affair, Affare, n. Also: affayr, affeaire, auffaire. See further Effair n. [Late ME. affayre (Caxton), e.m.E. affaire, OF. afaire. A later form of Affere n.', in Dictionary of the Scots Language, Dictionar o the Scots Leid, online web resource], accessed 10/01/2018</ref> | ||
+ | - "...sent letters of advice thither to two factors there about his merchandizing offayres under that name..." (Jul 1658)<ref>[[HCA 13/72 f.386r Annotate|HCA 13/72 f.386r]]</ref> | ||
+ | - "...you cannot expect any acting as to merchandizing affares trading having beene at a stand ffor this 12 month past..." (Jan 1665/66)<ref>[[MRP: 3rd January 1665/66, Letter from William Ryder to Sir GO, Bethnall Green|BL, Add. MS. 40, 700, vol. v, ff. 6-7, '3rd January 1665/66, Letter from William Ryder, Bethnall Green, to Sir George Oxenden, Surat]]</ref> | ||
+ | - "...intend now to give you a breefe relation what past since in that affare..." (Jan 1665/66)<ref>[[MRP: 3rd January 1665/66, Letter from William Ryder to Sir GO, Bethnall Green|BL, Add. MS. 40, 700, vol. v, ff. 6-7, '3rd January 1665/66, Letter from William Ryder, Bethnall Green, to Sir George Oxenden, Surat]]</ref> | ||
+ | |||
+ | * '''RUCKOO''' | ||
+ | - "[At Brazil] tooke in a sort of ffish called mannettee and dying-stuff called ruckoo..." (Oct 1655)<ref>[[HCA 13/70 f.612r Annotate|HCA 13/70 f.612r]]</ref> | ||
− | * | + | * '''STIRK; STOAT'''<ref>[http://www.dsl.ac.uk/entry/dost/stirk 'Stirk, Styrk, n. Also: stirke, stirc-, stirck, sterk, steirk, stark, strik.', in A Dictionary of the Older Scottish Tongue (up to 1700), online web resource], accesed 12/01/2018</ref>; <ref>[http://www.dsl.ac.uk/entry/dost/stot_n_1 'Stot, Stoit, n.1 Also: stott, stote, stoitt, stoyt, stuit, stoat, stoot. [ME and e.m.E. stott- (1222), stot (Chaucer), stotte (c1450), OE stot(t. Cf. ON stútr bull.]', in A Dictionary of the Older Scottish Tongue (up to 1700), online web resource], accessed 12/01/2018</ref> |
− | + | - "[SHPPED FROM BERWICK UPON TWEED TO LONDON] [Item] 21 Stephen Jackson master 16 last and 9 barrells of salmon by last hyde and stoe(or a. or r)ks] skinns 52 dry horse skins calfe and stirks 65 skins: 160 sherlinge and lambeskins per certificate 2 Ocober 1657"<ref>[[E 190/46/2 f.17v Annotate|E 190/46/2 f.17v]]</ref> | |
---- | ---- | ||
− | |||
− | + | ==Potential contribution to Ngram analysis== | |
+ | |||
+ | [[File:Junior Senior Frequency 14012018.JPG|400px|thumb|left|Twitter exchange with Dr Samuel McLean, social media editor of [https://globalmaritimehistory.com/ Global Maritime History], on frequency of terms "Junior" and "Senior" in MarineLives corpus vs. other document types and periods]] | ||
+ | |||
+ | [[File:Google Elder Younger Ngram 14012018.JPG|850px|thumb|right|[https://books.google.com/ngrams/graph?content=the+younger%2C+the+Younger%2C+the+elder%2C+the+Elder&year_start=1600&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cthe%20younger%3B%2Cc0%3B.t1%3B%2Cthe%20Younger%3B%2Cc0%3B.t1%3B%2Cthe%20elder%3B%2Cc0%3B.t1%3B%2Cthe%20Elder%3B%2Cc0 Google Ngram Viewer display: "the younger", "the Younger", "the elder", "the Elder", 1600-2000]]] | ||
+ | |||
+ | [[File:Google Ngram Viewer 14012018.JPG|850px|thumb|right|[https://books.google.com/ngrams/graph?content=junior%2C+Junior%2C+senior%2C+Senior&year_start=1600&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cjunior%3B%2Cc0%3B.t1%3B%2CJunior%3B%2Cc0%3B.t1%3B%2Csenior%3B%2Cc0%3B.t1%3B%2CSenior%3B%2Cc0 Google Ngram Viewer display: "junior", "Junior", "senior", "Senior", 1600-2000]]] | ||
---- | ---- | ||
− | |||
− | + | ==Potential creation of glossaries== | |
− | + | * EXAMPLE: [[Tools: Textiles, garments, & dyes glossary|Communal C17th Textiles, Garments and Dyestuffs Glossary]] | |
− | |||
---- | ---- | ||
==Potential ground base for machine learning== | ==Potential ground base for machine learning== | ||
Line 35: | Line 74: | ||
[ADD DATA] | [ADD DATA] | ||
+ | ---- | ||
+ | ==Potential Natural Language Programming use== | ||
+ | |||
+ | [[File:Named Entities In Court 2014.JPG|400px|thumb|left|[https://www.academia.edu/6551336/Dominique_Ritze_Caecilia_Zirn_Colin_Greenstreet_Kai_Eckert_Simone_Paolo_Ponzetto_Named_Entities_in_Court_The_Marine_Lives_Corpus_May_2014_ Ritze, Dominique and Zirn, Cäcilia and Greenstreet, Colin and Eckert, Kai and Ponzetto, Simone Paolo (2014) Named Entities in Court: The MarineLives Corpus. In: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage Workshop : associated with the LREC 2014 Conference, 26 - 30 May 2014, Reykjavik 2014 Reykjavik, Conference or workshop item]]] | ||
+ | |||
+ | In 2014, MarineLives collaborated with a team at the University of Mannheim Informatics Department, led by [http://dws.informatik.uni-mannheim.de/en/people/alumni/prof-dr-kai-eckert/ Professor Kai Eckert], to explore the application of Natural Language Programming to the MarineLives corpus. | ||
+ | |||
+ | The output of this collaboration was a paper given in Reykjavik at a workshop on Language Resources and Technologies for Processing and Linking Historical Documents and Archives in association with the LREC Conference, May 2014.<ref>[https://www.academia.edu/6551336/Dominique_Ritze_Caecilia_Zirn_Colin_Greenstreet_Kai_Eckert_Simone_Paolo_Ponzetto_Named_Entities_in_Court_The_Marine_Lives_Corpus_May_2014_ Ritze, Dominique and Zirn, Cäcilia and Greenstreet, Colin and Eckert, Kai and Ponzetto, Simone Paolo (2014) Named Entities in Court: The MarineLives Corpus. In: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage Workshop : associated with the LREC 2014 Conference, 26 - 30 May 2014, Reykjavik 2014 Reykjavik, Conference or workshop item], accessed 10/01/2018</ref> | ||
---- | ---- |
Latest revision as of 08:16, January 20, 2018
The MarineLives wiki contains nearly six million words of semi-diplomatically transcribed early and mid-C17th legal and commercial text. This is one of the larger collections of text derived from C17th manuscript sources, and is certainly the largest English language collection derived from early and mid C17th legal and commercial manuscript sources.
In the next five years we would like to grow our full text corpus to twenty-five million words, supported by 50,000 images, all derived from primary manuscript material from the period 1627-1677.
The MarineLives project team is keen to explore the corpus linguistic potential of the material, and welcomes approaches from corpus and historical linguists, interested in discussing this potential.
Contents
Building corpus/historical linguistic & data science thinking into MarineLives planning
Past policy decisions
- Choice of semi-diplomatic transcription retains original C17th orthography, punctation and capitalisation
- Editorial policy expands contractions and spells out abbreviations
- Considered and rejected use of TEI coding, when project started in 2012
Issues for consideration
- Technical and editorial requirements of corpus and historical linguistic practitioners and of data scientists?
- Need for parallel diplomatic and semi-diplomatic texts? (See Folger Institute's Early Modern Manuscripts Online initiative)[1]
- How should our digital image acquisition policies be affected by corporate/historical linguistic and data science potential, if at all?
-- Should it affect the type of documents we are seeking to ingest (e.g. legal depositions; letters)?
-- Should it affect the archives and libraries we are seeking to collaborate with?
-- Should it affect the scale of our imaging and transcription plans over the next five years? (We are seeking to move from ca. 11,600 images to 50,000 images, and ca. 6 million to 25 million full text words)
Potential etymological use
- ASSEVERATION
- "...the sayd Dirick Dobler in discourse with this deponent touching the premisses did assure this deponent with much asseveration that the sayd goods were all free and for Hamburgh and merchants there living onely or to that purpose..." (Jan 1653/54)[2]
- DAMASKILLIAS
- "...sixe peeces of damaskillias or floramides..." (Apr 1657)[3]
- HURRICANO; HURRICANOE; HURRICANE; HURRICAN[4]
- "...a furious hurricano, which continued in very great extremity for about five or six and thirty howres..." (Mar 1655/56)[5]
- "...the sayd shipp under the command of George Boys arlate departed with her lading of sugars from the Barbadoes bound for this Port of London in the month of August last past and that afterwards being upon her Course a violent storme or hurricano..." (Mar 1655/56)[6]
- "...the sayd sugars were not all dry and well conditioned after the sayd hurricano but were damnifiyed (as aforesayd) thereby, howbeit some of them as he conceyveth did remain dry. And saith that the sayd shipp was not tight after the sayd hurrican (sic)..." (Mar 1655/56)[7]
- "...at the arrivall of the said ship at Nevis this deponent found that hurricanoes and stormes which had bin very violent in those parts had blowne downe the tobacco plants indicoes, and Sugar canes there, and spoyled most of them..." (Mar 1659)[8]
- "...the first storme predeposed of was a hurricano which is usually soe violent that few shipps are able to withstand them..." (Jun 1656)[9]
- "[WITHIN THE EAST INDIES]...goeing for another port shee was by violence of the hurricanes cast away and utterly lost..." (Nov 1660)[10]
- MANNAGERIE; MANNAGERY; MANNAGEMENT[11]
- "...the same were wholely left (as beleeveth) to the mannagerie of the foresayd William Warren who disposed of them at the place predeposed and went with the proceede of them to the Canaries..." (Jun 1658))[12]
- "...hee this deponent is a factor and agent to the articulate Antonio Rodrigues Robles here at London and soe hath bin for these five yeares last past or thereabouts and imployed by him in keepeing his accompts and mannagerie of his merchandizeing affayres..." (Jul 1658)[13]
- "...hee was by the sayd Brewer and Crispe intrusted to goe supracargo upon the sayd shipp and voyage, as well for the mannagement of their sayd five eighth parts as for him selfe and the mannagery of the sayd three eighth parts of the sayd shipp soe lett to him the sayd Wood by the sayd Grove..." (Jan 1658/59)[14]
- OFFAYRES; AFFARES; AFFAYRES[15]
- "...sent letters of advice thither to two factors there about his merchandizing offayres under that name..." (Jul 1658)[16]
- "...you cannot expect any acting as to merchandizing affares trading having beene at a stand ffor this 12 month past..." (Jan 1665/66)[17]
- "...intend now to give you a breefe relation what past since in that affare..." (Jan 1665/66)[18]
- RUCKOO
- "[At Brazil] tooke in a sort of ffish called mannettee and dying-stuff called ruckoo..." (Oct 1655)[19]
- "[SHPPED FROM BERWICK UPON TWEED TO LONDON] [Item] 21 Stephen Jackson master 16 last and 9 barrells of salmon by last hyde and stoe(or a. or r)ks] skinns 52 dry horse skins calfe and stirks 65 skins: 160 sherlinge and lambeskins per certificate 2 Ocober 1657"[22]
Potential contribution to Ngram analysis
Potential creation of glossaries
Potential ground base for machine learning
- Source of data on C17th handwritten orthographical variation
[ADD DATA]
Potential Natural Language Programming use
In 2014, MarineLives collaborated with a team at the University of Mannheim Informatics Department, led by Professor Kai Eckert, to explore the application of Natural Language Programming to the MarineLives corpus.
The output of this collaboration was a paper given in Reykjavik at a workshop on Language Resources and Technologies for Processing and Linking Historical Documents and Archives in association with the LREC Conference, May 2014.[23]
- ↑ Early Modern Manuscripts Online (EMMO), Folgerpedia entry, accessed 14/01/2018; Early Modern Manuscripts Online (EMMO) Beta website, accessed 14/01/2018
- ↑ HCA 13/68 f.555r
- ↑ HCA 13/71 f.574v; Edward A. Roberts, A Comprehensive Etymological Dictionary of the Spanish Language with Families of Words based on Indo-European Roots: Volume I (A-G) (2014), p.481, accessed 10/01/2018
- ↑ 'Hurricane', Memidex, online dictionary, thesaurus and more, accessed 10/01/2018
- ↑ HCA 13/71 f.122v
- ↑ HCA 13/71 f.123r
- ↑ HCA 13/71 f.124r
- ↑ HCA 13/73 f.106v
- ↑ HCA 13/71 f.256v
- ↑ HCA 13/73 f.643v
- ↑ 'Menagery, -ie, Mannagerie, Managary, n.', Dictionary of the Scots Language, Dictionar o the Scots Leid, online web resource, accessed 10/01/2018
- ↑ HCA 13/72 f.377v
- ↑ HCA 13/72 f.386r
- ↑ HCA 13/73 f.1r
- ↑ 'Affair, Affare, n. Also: affayr, affeaire, auffaire. See further Effair n. [Late ME. affayre (Caxton), e.m.E. affaire, OF. afaire. A later form of Affere n.', in Dictionary of the Scots Language, Dictionar o the Scots Leid, online web resource, accessed 10/01/2018
- ↑ HCA 13/72 f.386r
- ↑ BL, Add. MS. 40, 700, vol. v, ff. 6-7, '3rd January 1665/66, Letter from William Ryder, Bethnall Green, to Sir George Oxenden, Surat
- ↑ BL, Add. MS. 40, 700, vol. v, ff. 6-7, '3rd January 1665/66, Letter from William Ryder, Bethnall Green, to Sir George Oxenden, Surat
- ↑ HCA 13/70 f.612r
- ↑ 'Stirk, Styrk, n. Also: stirke, stirc-, stirck, sterk, steirk, stark, strik.', in A Dictionary of the Older Scottish Tongue (up to 1700), online web resource, accesed 12/01/2018
- ↑ 'Stot, Stoit, n.1 Also: stott, stote, stoitt, stoyt, stuit, stoat, stoot. [ME and e.m.E. stott- (1222), stot (Chaucer), stotte (c1450), OE stot(t. Cf. ON stútr bull.', in A Dictionary of the Older Scottish Tongue (up to 1700), online web resource], accessed 12/01/2018
- ↑ E 190/46/2 f.17v
- ↑ Ritze, Dominique and Zirn, Cäcilia and Greenstreet, Colin and Eckert, Kai and Ponzetto, Simone Paolo (2014) Named Entities in Court: The MarineLives Corpus. In: Language Resources and Technologies for Processing and Linking Historical Documents and Archives - Deploying Linked Open Data in Cultural Heritage Workshop : associated with the LREC 2014 Conference, 26 - 30 May 2014, Reykjavik 2014 Reykjavik, Conference or workshop item, accessed 10/01/2018