User:Mcostag1/Neapolitan

From LING073
Jump to: navigation, search

Overview

I created a prototype Neapolitan transducer using lttoolbox for my final project. The transducer consists of a monolingual dictionary of approximately 200 stems. The final coverage over my corpus is 56.22% where 647424 tokenised words were found in the corpus.

My approach to this final project was as followed:

  1. Research on the grammar and morphology of Neapolitan. For this I referenced multiple books and online resources, including:
    1. Grammatica diacronica del napoletano - Ledgeway, Adam 2009
    2. Dizionario dialettale napoletano - Altamura, Antonio 1956
    3. Vocabolario napoletano-italiano - Gaetano Ceraso 1906
      From this I noted all key and interesting grammar points in my #Grammar Documentation section. I primarily used resource #2.
  2. Created a corpus of Neapolitan text." More details can be found in my # section.
  3. Bootstrapped a transducer using lttoolbox. With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.
  4. Evaluated my transducer by testing coverage using aq-covtest. With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.

Source Code

Work on my Neapolitan Transducer can be found on this github repository

Corpus Assembly

I obtained a majority of my corpus from the Neapolitan Wiki Site, which is a Wikipedia site written entirely in the Neapolitan dialect. I used a Wiki Extractor to extract just the text off of every Wikipedia article on the site. I had to manually strip some superfluous numbers, but eventually cleaned up a corpus that consisted of 556092 words. My corpus is the file nap.wiki.current.txt.
One difficulty I had with this corpus, is that consonant gemination (or raddoppiamento), is explicitly written into the orthography of this wiki site. The resources I have used, specifically [1] states "In genere, tutte le consonanti che formano sillaba con una vocale tonica ne subiscono la forza e si pronunziano raddoppiate: ca, la, ragione (ma non mi sembra corretto scrivere lla, cca, raggione, perche anche in italianio molte consonanti si pronunziano doppie ma si scrivono scempie)," (p.16). This roughly translates to: In general, all consonants that form a syllable with a voiced tone, are subject to be pronounced in a geminated form: ca, la, ragione (but it does not seem correct to me to write them lla, cca, raggione, because even in Italian many consonants are pronounced geminiated, but are not written with double letters."

Notes on orthography

Neapolitan is a primarily spoken language. While it does have a rich history in literature and especially music, there is no formal orthography. Most Neapolitan written text is spelled as it would be phonetically spoken using the Latin alphabet with a few diacritics, but phonetically spelling something with an alphabet other than the IPA can produce wildly different results. Since my corpus was drawn from one source (the Neapolitan wiki page), it streamlined the orthography a bit, but not by much. I at times would find 4 or 5 ways to spell one word. An example of such a case is the word for more most commonly spelled cchiù, but also seen as chiù, cchiu, or chiu. I resolved most of these issues by marking the most common spelling as the analysis, and had all other forms analyzed as the the most common form. This way, only cchiù would be generated, and not all 4 different spellings.

Pre-Evaluation

Corpus Coverage

I wanted to analyze the results of my coverage prior to adding anything to my monolingual dictionary other than punctuation and numbers. These are the results of my coverage test, when my dictionary only analyzed non-lemma particles.

Number of tokenised words in the corpus: 650726 
Coverage: 23.00%
Top unknown words in the corpus:
38311	 'e
13821	 è
12144	 e
10957	 a
10498	 nu
9050	 'o
8194	 pruvincia
8035	 d''a
8016	 comune
7063	 'a
4449	 ca
4024	 abitante
4009	 crestiane
3535	 d
2968	 o
2755	 se
2590	 gl
2571	 de
2373	 d'
2319	 na

Therefore, 23.00% is my starting corpus coverage. Seeing as this much corpus coverage can be attributed to punctuation and numbers, my goal is to reach 65-70% coverage on the remainder of the corpus.

Increasing coverage over corpus

In this section, I will keep track of the evolution of my corpus coverage as I continue to add more lemmas that will help increase my coverage. After adding all indefinite and definite articles, my corpus coverage was as follows:

Number of tokenised words in the corpus: 650707
Coverage: 36.93%
Top unknown words in the corpus:
12144	 e
10957	 a
8194	 pruvincia
8035	 d''a
8016	 comune
4449	 ca
4024	 abitante
4009	 crestiane
3535	 d
2968	 o
2755	 se
2590	 gl
2571	 de
2373	 d'
2228	 cu
2074	 cchiù
1942	 la
1889	 da
1738	 pe
1597	 Napule

After adding this list of top 20 unknown words, and a few more, my coverage reached 51.58%.

Number of tokenised words in the corpus: 650666
Coverage: 51.58%
Top unknown words in the corpus:
1545	 che
1485	 San
1405	 pure
1355	 di
1273	 br
1143	 é
1141	 AC
1026	 ô
997	 pe'
988	 comme
983	 Muorte
867	 in
835	 ll'anno
813	 e'
771	 nun
759	 '
712	 ma
698	 parte
679	 le
676	 o'

Post-Evaluation

Corpus Coverage

I added a significant amount of more stems, but my coverage has only increased slightly. Since my only source for my corpus was the Neapolitan wikipedia site, many of my words are proper nouns that occur only a few times. Adding more of these entries into my transducer would take a long time, and would only increase my coverage slightly, but ultimately I think is the only way to continually increase the coverage of my corpus. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal.

Final corpus coverage: 56.22%

Number of tokenised words in the corpus: 647424
Coverage: 56.22%
Top unknown words in the corpus:
626	 ne
610	 sta
554	 fine
540	 cità
524	 ce
521	 ë
489	 ha
481	 fa
465	 int'ô
461	 juorno
407	 secunno
402	 munno
394	 l'anne
391	 addò
385	 calannario
385	 int'
382	 a'
380	 tutte
380	 greguriano
379	 ’e

Precision/Recall

I randomly selected 416 words from my corpus. After hand annotating these words in nap.annotated.txt and cross referencing them with corpus.out.txt, I received these precision/recall measurements:

Totals: 13 tp, 0 fp, 0 tn, 436 fn
Precision: 100.00000%
Recall: 2.89532%

As seen above, the recall measurement for my transducer over my corpus is very low. Currently, I have ~56% coverage over my corpus, but this comes primarily from frequently seen prepositions, pronouns, and verbs. My corpus is made up of stripped down wiki articles from the Neapolitan wikipedia page. Each article is relatively short, which means proper nouns are abundant in the corpus. Most of of these proper nouns seem to appear 1-5 times in the corpus, which did not I think warrant me adding them to my transducer at this stage, because it would not have increased my coverage In my randomly selected corpus for my precision/recall measurements 124 words, or 30%, of the corpus were proper nouns. I believe this contributed to low recall score.
My precision score is 100%, because the analysis of the words that were in my transducer were also correct in corpus.out.txt.

Grammar Points

Parts of Speech

This section comes primarily from the

Neapolitan, similar to Italian, has both indefinite and definite articles, in both the feminine and masculine form. Neapolitan also has a widely used neuter form as well.

Articles

(Italian equivalent noted in parenthesis)
Definite articles: 'a (la), 'o (lo), 'e/'i (le/li).
Indefinite articles: 'na (una), 'no/'nu (uno)

Definite articles never join with a preposition:

  • d' 'a (de la)
  • d' 'o (de lo)
  • d' 'e/d' 'i (de le/de li)

The same patter follows with the prepositions per, con, and a.

Nouns

Feminine noun formations

sg. >> pl
-ié- >> -è- (lieggo >> leggia) (to read)
-i- >> -é- (frisco >> fresca) (cool)
-u- >> -ó- (duje >> doje) (two)
-uó- >> -ò- (luongo >> longa) (long)

Pluralization

Pluralization of singular nouns follows the general principles of Italian.

sg. >> pl
-è- >> -ié- (verme >> vierme) (worm >> worms)
-é- >> -i- (cecere >> cicere) (chickpea >> chickpeas)
-ó- >> -u-(nepote >> nepute)
-ò- >> -uó- (ommo >> uómmene)
-u- >> -ó- (pertuso >> pertóse)
-uó- >> -ò- (uóvo >> ove)

Altered nouns

Augmentative suffixes:
-óne >> -une (piattóne >> piattune) (big plate >> big plates)
-óna >> -óne (cammaróna >> cammaróne)

Compound Nouns are plentiful in the Neapolitan dialect, often formed by joining two nouns, a noun and an adjective, or a preposition and noun. Pluralization of these compound nouns follows a distinct set of rules.

a). If the second noun is a complement to the first one, pluralize both.

b). If the first noun is a complement to the second one, pluralize the second.

c). If the compound noun consists of an adjective and noun, pluralize both.

d). Any other case, only pluralize a noun that follows a verb or preposition.

Verbs

Similar to Italian - Neapolitan verbs end in one of three endings in the infinitive form: -ire, -ere, or -are. Also similar to Italian, Neapolitan is a pro-drop dialect, meaning the subject does not need to be explicitly said for it to be clear who the subject is. The various conjugations that are present in the Italian language are also present
  1. Dizionario Dialettale Napoletano - Antonio Altamura 1956