Difference between revisions of "User:Mcostag1/Neapolitan"

From LING073
Jump to: navigation, search
(Overview)
Line 9: Line 9:
 
## Dizionario dialettale napoletano - Altamura, Antonio 1956
 
## Dizionario dialettale napoletano - Altamura, Antonio 1956
 
## Vocabolario napoletano-italiano - Gaetano Ceraso 1906
 
## Vocabolario napoletano-italiano - Gaetano Ceraso 1906
From this I noted all key and interesting grammar points in my [[#Grammar Documentation]] section.
+
From this I noted all key and interesting grammar points in my [[#Grammar Documentation]] section. I primarily used resource #2.
# Bootstrapped a transducer using lttoolbox.
+
# '''Created a corpus of Neapolitan text." More details can be found in my [[#]] section.
 
+
# '''Bootstrapped a transducer using lttoolbox.''' With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.
 
+
# '''Evaluated my transducer by testing coverage  using aq-covtest.''' With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.
  
 
==Source Code==
 
==Source Code==

Revision as of 14:53, 7 May 2017

Overview

I created an in development phase Neapolitan transducer using lttoolbox for my final project. The transducer consists of a monolingual dictionary of approximately ~___ stems. The final coverage over my corpus is __%, where ___ tokenised words were found in the corpus.

My approach to this final project was as followed:

  1. Research on the grammar and morphology of Neapolitan. For this I referenced multiple books and online resources, including:
    1. Grammatica diacronica del napoletano - Ledgeway, Adam 2009
    2. Dizionario dialettale napoletano - Altamura, Antonio 1956
    3. Vocabolario napoletano-italiano - Gaetano Ceraso 1906

From this I noted all key and interesting grammar points in my #Grammar Documentation section. I primarily used resource #2.

  1. Created a corpus of Neapolitan text." More details can be found in my # section.
  2. Bootstrapped a transducer using lttoolbox. With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.
  3. Evaluated my transducer by testing coverage using aq-covtest. With this transducer, I worked on making a monolingual dictionary of the morphological analysis of lemmas in Neapolitan.

Source Code

Work on my Neapolitan Transducer can be found on this github repository

Pre-Evaluation

Corpus

I obtained a majority of my corpus from the Neapolitan Wiki Site, which is a Wikipedia site written entirely in the Neapolitan dialect. I used a Wiki Extractor to extract just the text off of every Wikipedia article on the site. I had to manually strip some superfluous numbers, but eventually cleaned up a corpus that consisted of 556092 words. My corpus is the file nap.wiki.current.txt.
One difficulty I had with this corpus, is that consonant gemination (or raddoppiamento), is explicitly written into the orthography of this wiki site. The resources I have used, specifically [1] states "In genere, tutte le consonanti che formano sillaba con una vocale tonica ne subiscono la forza e si pronunziano raddoppiate: ca, la, ragione (ma non mi sembra corretto scrivere lla, cca, raggione, perche anche in italianio molte consonanti si pronunziano doppie ma si scrivono scempie)," (p.16). This roughly translates to: In general, all consonants that form a syllable with a voiced tone, are subject to be pronounced in a geminated form: ca, la, ragione (but it does not seem correct to me to write them lla, cca, raggione, because even in Italian many consonants are pronounced geminiated, but are not written with double letters."

Results

I wanted to analyze the results of my coverage prior to adding anything to my monolingual dictionary other than punctuation and numbers. These are the results of my coverage test, when my dictionary only analyzed non-lemma particles.

Number of tokenised words in the corpus: 650726 
Coverage: 23.00%
Top unknown words in the corpus:
38311	 'e
13821	 è
12144	 e
10957	 a
10498	 nu
9050	 'o
8194	 pruvincia
8035	 d''a
8016	 comune
7063	 'a
4449	 ca
4024	 abitante
4009	 crestiane
3535	 d
2968	 o
2755	 se
2590	 gl
2571	 de
2373	 d'
2319	 na

Therefore, 23.00% is my starting corpus coverage. Seeing as this much corpus coverage can be attributed to punctuation and numbers, my goal is to reach 65-70% coverage on the remainder of the corpus.

Increasing coverage over corpus

In this section, I will keep track of the evolution of my corpus coverage as I continue to add more lemmas that will help increase my coverage. After adding all indefinite and definite articles, my corpus coverage was as follows:

Number of tokenised words in the corpus: 650707
Coverage: 36.93%
Top unknown words in the corpus:
12144	 e
10957	 a
8194	 pruvincia
8035	 d''a
8016	 comune
4449	 ca
4024	 abitante
4009	 crestiane
3535	 d
2968	 o
2755	 se
2590	 gl
2571	 de
2373	 d'
2228	 cu
2074	 cchiù
1942	 la
1889	 da
1738	 pe
1597	 Napule

After adding this list of top 20 unknown words, and a few more, my coverage reached 51.58%.

Number of tokenised words in the corpus: 650666
Coverage: 51.58%
Top unknown words in the corpus:
1545	 che
1485	 San
1405	 pure
1355	 di
1273	 br
1143	 é
1141	 AC
1026	 ô
997	 pe'
988	 comme
983	 Muorte
867	 in
835	 ll'anno
813	 e'
771	 nun
759	 '
712	 ma
698	 parte
679	 le
676	 o'

Post-Evaluation

I added a significant amount of more stems, but my coverage has only increased slightly. My goal at the beginning was to reach around 40-50% coverage over my corpus. I hope to spend a bit more time before Wednesday upping my corpus to 60% which will move me into Apertium's "development" phase for transducers, but for the purposes of this assignment I have met my goal.

Number of tokenised words in the corpus: 647477
Coverage: 56.17%
Top unknown words in the corpus:
626	 ne
610	 sta
554	 fine
540	 cità
524	 ce
521	 ë
489	 ha
481	 fa
465	 int'ô
461	 juorno
407	 secunno
402	 munno
394	 l'anne
391	 addò
386	 so'
385	 int'
385	 calannario
382	 a'
380	 greguriano
380	 tutte

Grammar Documentaion

Articles

(Italian equivalent noted in parenthesis) Definite articles: 'a (la), 'o (lo), 'e/'i (le/li).
Indefinite articles: 'na (una), 'no/'nu (uno)

Definite articles never join with a preposition:

  • d' 'a (de la)
  • d' 'o (de lo)
  • d' 'e/d' 'i (de le/de li)

The same patter follows with the prepositions per, con, and a.

Nouns

====Feminine noun formations===0 sg. >> pl
-ié- >> -è- (lieggo >> leggia) (to read)
-i- >> -é- (frisco >> fresca) (cool)
-u- >> -ó- (duje >> doje) (two)
-uó- >> -ò- (luongo >> longa) (long)

Pluralization

Pluralization of singular nouns follows the general principles of Italian.

sg. >> pl
-è- >> -ié- (verme >> vierme) (worm >> worms)
-é- >> -i- (cecere >> cicere) (chickpea >> chickpeas)
-ó- >> -u-(nepote >> nepute)
-ò- >> -uó- (ommo >> uómmene)
-u- >> -ó- (pertuso >> pertóse)
-uó- >> -ò- (uóvo >> ove)

Altered nouns

Augmentative suffixes:
-óne >> -une (piattóne >> piattune) (big plate >> big plates)
-óna >> -óne (cammaróna >> cammaróne)

Pejorative nouns -illo/-èlla:

Diminutive nouns

Compound Nouns are plentiful in the Neapolitan dialect, often formed by joining two nouns, a noun and an adjective, or a preposition and noun. Pluralization of these compound nouns follows a distinct set of rules.

a). If the second noun is a complement to the first one, pluralize both.

b). If the first noun is a complement to the second one, pluralize the second.

c). If the compound noun consists of an adjective and noun, pluralize both.

d). Any other case, only pluralize a noun that follows a verb or preposition.


CITE THIS ENTIRE SECTION: Dizzionario Dialettale Napoletano

Verbs

Similar to Italian - Neapolitan verbs end in one of three endings in the infinitive form: -ire, -ere, or -are. Also similar to Italian, Neapolitan is a pro-drop dialect, meaning the subject does not need to be explicitly said for it to be clear who the subject is. The various conjugations that are present in the Italian language are also present

Auxiliary verbs

essere pres


Category:sp17_FinalProjects
  1. Dizionario Dialettale Napoletano - Antonio Altamura 1956