User:RexxS/GCI-2019-Task08

Lua Task 8 - Name formats (advanced)

Prerequisite: Lua Task 7 - Wikibase client. This task requires a lot of research and independent learning and is considerably more difficult than the introductory seven tasks. You should have successfully and comfortably completed all of the introductory tasks before attempting any of the advanced ones. It is not suitable for beginners to programming, although students new to Lua with previous experience in other programming languages should be able to produce acceptable solutions. Read through the entire task before starting work on it.

An important skill for any programmer is to gather together data in order to test code that they write. In this task you will have to research the topic and assemble your own list of test cases to use in testing the code that you will write. The coding will be relatively simple, but you must ensure that the code is very robust by anticipating all of the possibilities that can be exceptions to the normal conventions before you start coding.

To complete this task you will need to make use of the techniques you learned in the first six tasks, as well as doing further research on string-handling functions and patterns, and possibly making use of other libraries.

Simple names

On the English Wikipedia and on Wikidata, we sometimes need to parse a person's name into its constituent parts. In English the commonest pattern is Givennames Familyname, where Givennames is a list of one or more given names (sometimes called "first names" or "Christian names"), and Familyname (also known as "surname") is a single word, sometimes hyphenated. Examples:

John Smith
Douglas Noel Adams
Alexander Frederick Douglas-Home
The family names are: "Smith", "Adams" and "Douglas-Home" respectively.

Sometimes there is a suffix such as "Sr" (senior) or "Jr" (junior) – sometimes "Sr." or "Jr." – or a Roman numeral. Examples:

Loudon Snowden Wainwright Jr
Loudon Snowden Wainwright III

The family name is "Wainwright".

More complex names

Of course, the English Wikipedia has biographies of people of all nationalities and backgrounds, and they may have different conventions in how a person's name is presented.

If you examine List of Dutch people, you should be able to see that the family names are often multiple words, such as van den Broek, but there is a way of telling that these are family names, not given names.

Spanish names usually have two family names, see Spanish naming customs. However, it is also common to only see one family name. Examples:

Penélope Cruz
Penélope Cruz Sánchez
Federico del Sagrado Corazón de Jesús García Lorca

The family names are "Cruz", "Cruz Sánchez" and "García Lorca" respectively. See List of Spaniards for more.

Modern Chinese names normally consist of a surname followed by a personal name, but in the 19th Century and earlier there could also be a courtesy name. See Chinese name for more details. Examples:

Li Lianjie
Zu Gengzhi (Jingshuo)

The family names are "Li" and "Zu" respectively, and "Jingshuo" was Zu Gengzhi's courtesy name. There are lots of examples to be found by starting at Category:Lists of Chinese people.

Requirements

This task requires you to create your own function which can take text representing a person's name and an optional format parameter. It will output the given name(s) and family name(s) marked something like "Given = Barack Hussein -- Family = Obama" from Barack Hussein Obama II.

You need to deal with English, Dutch, Spanish, Chinese and one other different types of names. You will collect on average 6 examples of each type, making 30 examples in total. Use the table below as a starting point:

Date formatting
Text	Format	Given name(s) -- Family name(s)
David Trevor Price		Given = David Trevor -- Family = Price
Mao Zedong	zh	Given = Zedong -- Family = Mao

You will replace the red text in the table with the call to your function.

You will test your function against all of the examples you have collected. This is designed to show that you can deal with all of the common variants of name, so make sure that your examples cover a full range of possibilities.

Your code may be able to detect the type of name from the five different types that you are working with, so it is possible that you don't need to supply a format parameter. However, if you cannot find a way of doing that for every case, then use a |format= parameter, such as nl, es, zh, etc.

You must work in a fresh module sandbox and user sandbox. If I were doing the task, I would use Module:Sandbox/RexxS/Names and User:RexxS/Sandbox/Names.

Hints and tips

A. Plan your work in separate parts: (i) working out the format of the name from the text given; (ii) extracting the names from text which may contain suffixes; (iii) differentiating between the given names and family names using a known format.
B. You may find that you can do the third part and test for names without suffixes it if you supply the format for each case. That's a good starting point.
C. Once you have the third part working, you can add on the part that strips the suffix. You'll need an array of known suffixes.
D. What do compound Dutch family names have in common? Is there a character class in a pattern that would match, or are there few enough to use an array?
E. How many Chinese family names can you find? Is it a small enough number to make an array of them?
F. Is there any other way of guessing that a name may be Chinese?
G. Is there a way of telling that a name is using Spanish naming customs? If you can't find an algorithm, don't worry about using format=es.