Difference between revisions of "Mcostag1/Neapolitan"
(→Post-Evaluation) |
|||
Line 98: | Line 98: | ||
==Post-Evaluation== | ==Post-Evaluation== | ||
− | + | ==Corpus Coverage== | |
I added a significant amount of more stems, but my coverage has only increased slightly. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal. | I added a significant amount of more stems, but my coverage has only increased slightly. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal. | ||
Line 126: | Line 126: | ||
380 tutte | 380 tutte | ||
</pre> | </pre> | ||
+ | ==Precision/Recall== | ||
+ | After randomly generating and analyzing 500 stems: | ||
+ | |||
+ | '''Precision:''' <br> | ||
+ | '''Recall:''' <br> | ||
+ | |||
+ | As noted above, the recall measurement for my transducer over my corpus is very low. Currently, I have ~56% coverage over my corpus, but this comes primarily from frequently seen prepositions, pronouns, and verbs. | ||
==Grammar Documentaion== | ==Grammar Documentaion== |
Revision as of 20:11, 8 May 2017
Contents
Source Code
Work on my Neapolitan Transducer can be found on this github repository
Pre-Evaluation
Corpus
I obtained a majority of my corpus from the Neapolitan Wiki Site, which is a Wikipedia site written entirely in the Neapolitan dialect. I used a Wiki Extractor to extract just the text off of every Wikipedia article on the site. I had to manually strip some superfluous numbers, but eventually cleaned up a corpus that consisted of 556092 words. My corpus is the file nap.wiki.current.txt
.
One difficulty I had with this corpus, is that consonant gemination (or raddoppiamento), is explicitly written into the orthography of this wiki site. The resources I have used, specifically [1] states "In genere, tutte le consonanti che formano sillaba con una vocale tonica ne subiscono la forza e si pronunziano raddoppiate: ca, la, ragione (ma non mi sembra corretto scrivere lla, cca, raggione, perche anche in italianio molte consonanti si pronunziano doppie ma si scrivono scempie)," (p.16). This roughly translates to: In general, all consonants that form a syllable with a voiced tone, are subject to be pronounced in a geminated form: ca, la, ragione (but it does not seem correct to me to write them lla, cca, raggione, because even in Italian many consonants are pronounced geminiated, but are not written with double letters."
Results
I wanted to analyze the results of my coverage prior to adding anything to my monolingual dictionary other than punctuation and numbers. These are the results of my coverage test, when my dictionary only analyzed non-lemma particles.
Number of tokenised words in the corpus: 650726 Coverage: 23.00% Top unknown words in the corpus: 38311 'e 13821 è 12144 e 10957 a 10498 nu 9050 'o 8194 pruvincia 8035 d''a 8016 comune 7063 'a 4449 ca 4024 abitante 4009 crestiane 3535 d 2968 o 2755 se 2590 gl 2571 de 2373 d' 2319 na
Therefore, 23.00% is my starting corpus coverage. Seeing as this much corpus coverage can be attributed to punctuation and numbers, my goal is to reach 65-70% coverage on the remainder of the corpus.
Increasing coverage over corpus
In this section, I will keep track of the evolution of my corpus coverage as I continue to add more lemmas that will help increase my coverage. After adding all indefinite and definite articles, my corpus coverage was as follows:
Number of tokenised words in the corpus: 650707 Coverage: 36.93% Top unknown words in the corpus: 12144 e 10957 a 8194 pruvincia 8035 d''a 8016 comune 4449 ca 4024 abitante 4009 crestiane 3535 d 2968 o 2755 se 2590 gl 2571 de 2373 d' 2228 cu 2074 cchiù 1942 la 1889 da 1738 pe 1597 Napule
After adding this list of top 20 unknown words, and a few more, my coverage reached 51.58%.
Number of tokenised words in the corpus: 650666 Coverage: 51.58% Top unknown words in the corpus: 1545 che 1485 San 1405 pure 1355 di 1273 br 1143 é 1141 AC 1026 ô 997 pe' 988 comme 983 Muorte 867 in 835 ll'anno 813 e' 771 nun 759 ' 712 ma 698 parte 679 le 676 o'
Post-Evaluation
Corpus Coverage
I added a significant amount of more stems, but my coverage has only increased slightly. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal.
Number of tokenised words in the corpus: 647477 Coverage: 56.17% Top unknown words in the corpus: 626 ne 610 sta 554 fine 540 cità 524 ce 521 ë 489 ha 481 fa 465 int'ô 461 juorno 407 secunno 402 munno 394 l'anne 391 addò 386 so' 385 int' 385 calannario 382 a' 380 greguriano 380 tutte
Precision/Recall
After randomly generating and analyzing 500 stems:
Precision:
Recall:
As noted above, the recall measurement for my transducer over my corpus is very low. Currently, I have ~56% coverage over my corpus, but this comes primarily from frequently seen prepositions, pronouns, and verbs.
Grammar Documentaion
Articles
(Italian equivalent noted in parenthesis)
Definite articles: 'a (la), 'o (lo), 'e/'i (le/li).
Indefinite articles: 'na (una), 'no/'nu (uno)
Definite articles never join with a preposition:
- d' 'a (de la)
- d' 'o (de lo)
- d' 'e/d' 'i (de le/de li)
The same patter follows with the prepositions per, con, and a.
Nouns
====Feminine noun formations===0
sg. >> pl
-ié- >> -è- (lieggo >> leggia) (to read)
-i- >> -é- (frisco >> fresca) (cool)
-u- >> -ó- (duje >> doje) (two)
-uó- >> -ò- (luongo >> longa) (long)
Pluralization
Pluralization of singular nouns follows the general principles of Italian.
sg. >> pl
-è- >> -ié- (verme >> vierme) (worm >> worms)
-é- >> -i- (cecere >> cicere) (chickpea >> chickpeas)
-ó- >> -u-(nepote >> nepute)
-ò- >> -uó- (ommo >> uómmene)
-u- >> -ó- (pertuso >> pertóse)
-uó- >> -ò- (uóvo >> ove)
Altered nouns
Augmentative suffixes:
-óne >> -une (piattóne >> piattune) (big plate >> big plates)
-óna >> -óne (cammaróna >> cammaróne)
Pejorative nouns -illo/-èlla:
Diminutive nouns
Compound Nouns are plentiful in the Neapolitan dialect, often formed by joining two nouns, a noun and an adjective, or a preposition and noun.
Pluralization of these compound nouns follows a distinct set of rules.
a). If the second noun is a complement to the first one, pluralize both.
b). If the first noun is a complement to the second one, pluralize the second.
c). If the compound noun consists of an adjective and noun, pluralize both.
d). Any other case, only pluralize a noun that follows a verb or preposition.
CITE THIS ENTIRE SECTION: Dizzionario Dialettale Napoletano
Verbs
Similar to Italian - Neapolitan verbs end in one of three endings in the infinitive form: -ire, -ere, or -are. Also similar to Italian, Neapolitan is a pro-drop dialect, meaning the subject does not need to be explicitly said for it to be clear who the subject is. The various conjugations that are present in the Italian language are also present
Auxiliary verbs
essere
pres- ↑ Dizionario Dialettale Napoletano - Antonio Altamura 1956