Difference between revisions of "User:Mcostag1/Neapolitan"
(→Corpus Coverage) |
(→Post-Evaluation) |
||
Line 118: | Line 118: | ||
I added a significant amount of more stems, but my coverage has only increased slightly. Since my only source for my corpus was the Neapolitan wikipedia site, many of my words are proper nouns that occur only a few times. Adding more of these entries into my transducer would take a long time, and would only increase my coverage slightly, but ultimately I think is the only way to continually increase the coverage of my corpus. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal. <br> | I added a significant amount of more stems, but my coverage has only increased slightly. Since my only source for my corpus was the Neapolitan wikipedia site, many of my words are proper nouns that occur only a few times. Adding more of these entries into my transducer would take a long time, and would only increase my coverage slightly, but ultimately I think is the only way to continually increase the coverage of my corpus. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal. <br> | ||
− | '''Final corpus coverage:''' 56. | + | '''Final corpus coverage:''' 56.22% <br> |
<pre> | <pre> |
Revision as of 22:30, 8 May 2017
Contents
Overview
I created a prototype Neapolitan transducer using lttoolbox for my final project. The transducer consists of a monolingual dictionary of approximately 200 stems. The final coverage over my corpus is 56.22% where 647424 tokenised words were found in the corpus.
My approach to this final project was as followed:
- Research on the grammar and morphology of Neapolitan. For this I referenced multiple books and online resources, including:
- Grammatica diacronica del napoletano - Ledgeway, Adam 2009
- Dizionario dialettale napoletano - Altamura, Antonio 1956
- Vocabolario napoletano-italiano - Gaetano Ceraso 1906
From this I noted all key and interesting grammar points in my #Grammar Documentation section. I primarily used resource #2.
- Created a corpus of Neapolitan text." More details can be found in my # section.
- Bootstrapped a transducer using lttoolbox. With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.
- Evaluated my transducer by testing coverage using aq-covtest. With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.
Source Code
Work on my Neapolitan Transducer can be found on this github repository
Corpus Assembly
I obtained a majority of my corpus from the Neapolitan Wiki Site, which is a Wikipedia site written entirely in the Neapolitan dialect. I used a Wiki Extractor to extract just the text off of every Wikipedia article on the site. I had to manually strip some superfluous numbers, but eventually cleaned up a corpus that consisted of 556092 words. My corpus is the file nap.wiki.current.txt
.
One difficulty I had with this corpus, is that consonant gemination (or raddoppiamento), is explicitly written into the orthography of this wiki site. The resources I have used, specifically [1] states "In genere, tutte le consonanti che formano sillaba con una vocale tonica ne subiscono la forza e si pronunziano raddoppiate: ca, la, ragione (ma non mi sembra corretto scrivere lla, cca, raggione, perche anche in italianio molte consonanti si pronunziano doppie ma si scrivono scempie)," (p.16). This roughly translates to: In general, all consonants that form a syllable with a voiced tone, are subject to be pronounced in a geminated form: ca, la, ragione (but it does not seem correct to me to write them lla, cca, raggione, because even in Italian many consonants are pronounced geminiated, but are not written with double letters."
Notes on orthography
Neapolitan is a primarily spoken language. While it does have a rich history in literature and especially music, there is no formal orthography. Most Neapolitan written text is spelled as it would be phonetically spoken using the Latin alphabet with a few diacritics, but phonetically spelling something with an alphabet other than the IPA can produce wildly different results. Since my corpus was drawn from one source (the Neapolitan wiki page), it streamlined the orthography a bit, but not by much. I at times would find 4 or 5 ways to spell one word. An example of such a case is the word for more most commonly spelled cchiù, but also seen as chiù, cchiu, or chiu. I resolved most of these issues by marking the most common spelling as the analysis, and had all other forms analyzed as the the most common form. This way, only cchiù would be generated, and not all 4 different spellings.
Pre-Evaluation
Corpus Coverage
I wanted to analyze the results of my coverage prior to adding anything to my monolingual dictionary other than punctuation and numbers. These are the results of my coverage test, when my dictionary only analyzed non-lemma particles.
Number of tokenised words in the corpus: 650726 Coverage: 23.00% Top unknown words in the corpus: 38311 'e 13821 è 12144 e 10957 a 10498 nu 9050 'o 8194 pruvincia 8035 d''a 8016 comune 7063 'a 4449 ca 4024 abitante 4009 crestiane 3535 d 2968 o 2755 se 2590 gl 2571 de 2373 d' 2319 na
Therefore, 23.00% is my starting corpus coverage. Seeing as this much corpus coverage can be attributed to punctuation and numbers, my goal is to reach 65-70% coverage on the remainder of the corpus.
Increasing coverage over corpus
In this section, I will keep track of the evolution of my corpus coverage as I continue to add more lemmas that will help increase my coverage. After adding all indefinite and definite articles, my corpus coverage was as follows:
Number of tokenised words in the corpus: 650707 Coverage: 36.93% Top unknown words in the corpus: 12144 e 10957 a 8194 pruvincia 8035 d''a 8016 comune 4449 ca 4024 abitante 4009 crestiane 3535 d 2968 o 2755 se 2590 gl 2571 de 2373 d' 2228 cu 2074 cchiù 1942 la 1889 da 1738 pe 1597 Napule
After adding this list of top 20 unknown words, and a few more, my coverage reached 51.58%.
Number of tokenised words in the corpus: 650666 Coverage: 51.58% Top unknown words in the corpus: 1545 che 1485 San 1405 pure 1355 di 1273 br 1143 é 1141 AC 1026 ô 997 pe' 988 comme 983 Muorte 867 in 835 ll'anno 813 e' 771 nun 759 ' 712 ma 698 parte 679 le 676 o'
Post-Evaluation
Corpus Coverage
I added a significant amount of more stems, but my coverage has only increased slightly. Since my only source for my corpus was the Neapolitan wikipedia site, many of my words are proper nouns that occur only a few times. Adding more of these entries into my transducer would take a long time, and would only increase my coverage slightly, but ultimately I think is the only way to continually increase the coverage of my corpus. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal.
Final corpus coverage: 56.22%
Number of tokenised words in the corpus: 647424 Coverage: 56.22% Top unknown words in the corpus: 626 ne 610 sta 554 fine 540 cità 524 ce 521 ë 489 ha 481 fa 465 int'ô 461 juorno 407 secunno 402 munno 394 l'anne 391 addò 385 calannario 385 int' 382 a' 380 tutte 380 greguriano 379 ’e
Precision/Recall
I randomly selected 416 words from my corpus. After hand annotating these words in nap.annotated.txt
and cross referencing them with corpus.out.txt
, I received these precision/recall measurements:
Totals: 13 tp, 0 fp, 0 tn, 436 fn Precision: 100.00000% Recall: 2.89532%
As seen above, the recall measurement for my transducer over my corpus is very low. Currently, I have ~56% coverage over my corpus, but this comes primarily from frequently seen prepositions, pronouns, and verbs. My corpus is made up of stripped down wiki articles from the Neapolitan wikipedia page. Each article is relatively short, which means proper nouns are abundant in the corpus. Most of of these proper nouns seem to appear 1-5 times in the corpus, which did not I think warrant me adding them to my transducer at this stage, because it would not have increased my coverage In my randomly selected corpus for my precision/recall measurements 124 words, or 30%, of the corpus were proper nouns. I believe this contributed to low recall score.
My precision score is 100%, because the analysis of the words that were in my transducer were also correct in corpus.out.txt
.
Grammar Points
Parts of Speech
This section comes primarily from the
Neapolitan, similar to Italian, has both indefinite and definite articles, in both the feminine and masculine form. Neapolitan also has a widely used neuter form as well.
Articles
(Italian equivalent noted in parenthesis)
Definite articles: 'a (la), 'o (lo), 'e/'i (le/li).
Indefinite articles: 'na (una), 'no/'nu (uno)
Definite articles never join with a preposition:
- d' 'a (de la)
- d' 'o (de lo)
- d' 'e/d' 'i (de le/de li)
The same patter follows with the prepositions per, con, and a.
Nouns
Feminine noun formations
sg. >> pl
-ié- >> -è- (lieggo >> leggia) (to read)
-i- >> -é- (frisco >> fresca) (cool)
-u- >> -ó- (duje >> doje) (two)
-uó- >> -ò- (luongo >> longa) (long)
Pluralization
Pluralization of singular nouns follows the general principles of Italian.
sg. >> pl
-è- >> -ié- (verme >> vierme) (worm >> worms)
-é- >> -i- (cecere >> cicere) (chickpea >> chickpeas)
-ó- >> -u-(nepote >> nepute)
-ò- >> -uó- (ommo >> uómmene)
-u- >> -ó- (pertuso >> pertóse)
-uó- >> -ò- (uóvo >> ove)
Altered nouns
Augmentative suffixes:
-óne >> -une (piattóne >> piattune) (big plate >> big plates)
-óna >> -óne (cammaróna >> cammaróne)
Compound Nouns are plentiful in the Neapolitan dialect, often formed by joining two nouns, a noun and an adjective, or a preposition and noun.
Pluralization of these compound nouns follows a distinct set of rules.
a). If the second noun is a complement to the first one, pluralize both.
b). If the first noun is a complement to the second one, pluralize the second.
c). If the compound noun consists of an adjective and noun, pluralize both.
d). Any other case, only pluralize a noun that follows a verb or preposition.
Verbs
Similar to Italian - Neapolitan verbs end in one of three endings in the infinitive form: -ire, -ere, or -are. Also similar to Italian, Neapolitan is a pro-drop dialect, meaning the subject does not need to be explicitly said for it to be clear who the subject is. The various conjugations that are present in the Italian language are also present- ↑ Dizionario Dialettale Napoletano - Antonio Altamura 1956