Difference between revisions of "Central Kurdish/Transducer"

From LING073
Jump to: navigation, search
(Final Evaluation of Morphological Generation)
 
(26 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
[https://github.swarthmore.edu/Ling073-sp21/ling073-ckb GitHub Repository]
 
[https://github.swarthmore.edu/Ling073-sp21/ling073-ckb GitHub Repository]
  
== Evaluation ==
+
== Analyser Evaluation ==
  
 
=== Stems ===
 
=== Stems ===
Line 26: Line 26:
 
=== Coverage ===
 
=== Coverage ===
  
The total coverage over the corpus was 7.3%. After adding three common words for water, earth, and god (all just {{morphTest|ن{{tag|n}}}})
+
The total coverage over the ''ckb.basic'' corpus is 12.02%. This was an increase of 3 percentage points that came after adding three common words for "water", "earth", and "god" (all just {{tag|n}}), two prepositions for "which" and "on" ({{tag|pr}}), and one conjunction for "so" ({{tag|conjcoo}}). The current top unknown words are:
 +
 
 +
* بوو
 +
* هەموو
 +
* با
 +
* ئەمە
 +
* فەرمووی
 +
 
 +
The total coverage over the ''Wikipedia'' corpus is 21.48%.
  
 
=== Tests ===
 
=== Tests ===
  
The transducer currently passes 70/101 (69%) tests. It seems to do well with noun morphology and most verb morphology. The remaining 31 tests fail for the following reasons:
+
The transducer currently passes 80/101 (79%) tests on the main ''yaml'' file and 3/6 (50%) on the ''commonwords'' file. It seems to do well with noun morphology and most verb morphology.
* There is an issue with some words containing the letter 'ە' that are possibly encoded strangely in Unicode, and it is making some straightforward tests fail.
+
 
* The izafa enclitic was skipped (not implemented). All other grammar points were attempted in some way.
+
== Generator Evaluation ==
* Some verbs, particularly هاتن (to come), have irregular stems and/or different imperative/non-past stems. Because only one lexicon was used for both types of verbs, this could not be accounted for.
+
 
 +
=== Initial Evaluation of Morphological Generation ===
 +
 
 +
Morphological Analysis
 +
* 85 passes, 16 fails, 101 total (84%)
 +
* 21.48% coverage over ''Wikipedia'' corpus
 +
 
 +
Morphological Generation
 +
* 85 passes, 49 fails, 134 total (63%)
 +
 
 +
=== Final Evaluation of Morphological Generation ===
 +
* 85 passes, 33 fails, 118 total (72%)
 +
* Number of ''twol'' tests added: 3
  
 
== Notes ==
 
== Notes ==
 +
 +
The remaining 16 morphological analysis tests fail for the following reasons:
 +
* The izafa enclitic and the demonstrative adjectives were skipped (not implemented). All other grammar points were attempted in some way.
 +
* Some verbs, particularly هاتن (to come), have different imperative/non-past stems. Because only one lexicon was used for both types of verbs, this could not be accounted for.
 +
  
 
[[Category: Sp21_Transducers]] [[Category: Central Kurdish]]
 
[[Category: Sp21_Transducers]] [[Category: Central Kurdish]]

Latest revision as of 14:48, 21 March 2021

Code

GitHub Repository

Analyser Evaluation

Stems

The total number of stems can be found below:

  • 8 N-Stems
  • 4 Definite/Plural
  • 4 Verbs_Inf (infinitives)
  • 4 V-Stems_1
  • 4 V-Stems_2
  • 6 Subject_Prn
  • 4 Imperatives
  • 6 Prns
  • 3 Adj-Stem
  • 2 Comparatives
  • 3 Prepositions
  • 3 Conjunctions
  • 2 Adverbs
  • 2 Npast

Coverage

The total coverage over the ckb.basic corpus is 12.02%. This was an increase of 3 percentage points that came after adding three common words for "water", "earth", and "god" (all just <n>), two prepositions for "which" and "on" (<pr>), and one conjunction for "so" (<conjcoo>). The current top unknown words are:

  • بوو
  • هەموو
  • با
  • ئەمە
  • فەرمووی

The total coverage over the Wikipedia corpus is 21.48%.

Tests

The transducer currently passes 80/101 (79%) tests on the main yaml file and 3/6 (50%) on the commonwords file. It seems to do well with noun morphology and most verb morphology.

Generator Evaluation

Initial Evaluation of Morphological Generation

Morphological Analysis

  • 85 passes, 16 fails, 101 total (84%)
  • 21.48% coverage over Wikipedia corpus

Morphological Generation

  • 85 passes, 49 fails, 134 total (63%)

Final Evaluation of Morphological Generation

  • 85 passes, 33 fails, 118 total (72%)
  • Number of twol tests added: 3

Notes

The remaining 16 morphological analysis tests fail for the following reasons:

  • The izafa enclitic and the demonstrative adjectives were skipped (not implemented). All other grammar points were attempted in some way.
  • Some verbs, particularly هاتن (to come), have different imperative/non-past stems. Because only one lexicon was used for both types of verbs, this could not be accounted for.