AP On-line

Genealogy on RISC OS

software we need
First in a series of missing software. This is the software we want - are there any programmers listening? Genealogists and Family Historians are welcome to contact us with ideas to add to the following and programmers should write outlining their problems in producing such software.
Perhaps, out of the dialogue we can produce a program of excellence.
by Tim Powys-Lybbe
< • i >

The Design of a Genealogy Program

I have been asked to provide an introduction to the design of a genealogy program on RISC OS. Genealogy is in the top few activities on the internet (no prizes for guessing the top one) and RISC OS desperately needs a strong program to cope with the ever increasing expectations of the typical family historian. Might I add that I am no longer a programmer, it gave me headaches, but I would be delighted to assist anyone who wished to do something in this field.

Tim's original article     Additional observations

1. What is genealogy?
There are two types of genealogy, one is exploring people and the connections between their families and the other is exploring people of the same name. The first is also called Family History, the second is a One Name Study. The data to be stored for each is rather different, the data is then used for different purposes and so they tend to have different programs. I‘m only discussing programs to hold and portray Family History.

2. What data needs to be stored?
What is a family history program about? Families, obviously. Individuals, too.

A good family history has one major feature that makes is stand out: it tells you where to check its information. So we have to have Source information. Better genealogists like to get hold of documentation that was made in the lifetime of the person or family concerned; there are less steps in the information chain and a greater likelihood of its truth. Too much of the genealogy on the internet is plain false and much of it fails to show where the information came from and hence loses its credibility. If the information that someone gives you cannot be verified and if he does not tell you where he found it, then throw it away: it is not worth the paper - or the disc - it is written on.

Different people want to look at their data in different ways and to portray the results in differently too. There have to be facilities to customise a Family History program.

We have now summed up the four main data storage types: Family, Individual, Source and Customisation. These may be stored in separate files or they may be combined on one Big File; but logically they are separate sets of data. The first part of the design process must be to decide on the file structures.

At this stage we should add that the system has to be portable. The data must be transferable between programs and environments. The standard way of doing this is by using the GEDCOM standard, invented by the LDS (Latter Day Saints otherwise known as Mormons), who have done so much to assist genealogists the world over. The current GEDCOM version is 5.5 and the definition of it can be found here. It should be studied carefully as it gives lots of information on the sorts of data people want to hold and your database must be designed from the start to handle all the standard GEDCOM data types. At this stage I am fairly convinced that it is not a good idea to construct your files as GEDCOM (text) files; the files can be enormous and GEDCOM has to be held in memory, which can make the project infeasible.

This introduces the size of family history files. They can be enormous. I have heard of people who have assembled an extended family history of 100,000 people; my own file is now 14,400 people of which about 4.500 are direct ancestors, 5.500 are blood relations and the rest are related by marriage. You have to design your system to handle these sorts of numbers. Inevitably this requires that the data is normally held on disc as sets of records and that you only work on one record at a time. Each record has a, usually hidden, serial number and there are index files on the above core data to speed up access. This is second nature to any database designer, of course.

The Data on Individuals
The core of any family history is the individual. The GEDCOM specification gives a good view of the sort of data that is held about each person. It is worth discussing the structure of the sort of data that needs to be held. There are some events in a person‘s life that happen to everyone: at the very least they are born and they die. They may go to school(s), to college or university and have a wide variety of jobs and interests. Your program should track these in a structured way. Most of the events happen at a particular date or between a pair of dates; so events have time stamps. In addition there are facts about a person that appear to be timeless, they have blue eyes or green ones and these facts are easiest to handle as events with a null time-stamp. Then there is a story that you may be able to write about a person; this needs a big chunk of space. Some people, for some users, may generate tens of thousands of words, whilst others have no story attached. Other large amounts of information may need to be stored separately; examples include wills and perhaps your research notes for each person. You may want to permit a field which contains a To Do list and, though perhaps one of the fact type fields should be able to handle this, you need to look at the way this information may need to be extracted. For reporting purposes you will need flags to identify people who belong to the same group: the lawyers, the clerics, the executed (yes, I have over twenty of these among my ancestors), etc; this data has a description is is usually two-valued: Yes or No, though I‘m sure that three or more values could be useful in some cases.

There are some major fields that needs to be added for individuals. First there are titles; Capt Josiah Bloggs RN has both a prefix and a suffix title and they need storing separately. Many people change their names or have an alternative name, so include an alternative; you may even treat names as events, with a start and finish date, though this is not common. Finally change dates need to be considered; it is usual to have a last change date for the whole record but it is also possible to store the change date for each field.

The last bit of information for any individual is where they fit into the great scheme of things, their links to the family data and to the source information that you have found. Each person‘s record must be able to link to these other sets of data. Let‘s explore what the data is before we talk about the links.

The Data on Families
A family is partly a biological unit and partly a social one. Biologically we each need two parents, one of each sex. Socially there can be any number from (almost) zero for people who are brought up wholly in institutions, through one for the single-parent family, via the complications of adoption and various forms of polygamy. Further the family changes, the parents die or separate. Today we are invited to concede that the social parents can be of the same sex. Your system has to handle all of these. A family is an assembly of individuals who live together, it has to consist of two or more people. The family can have one or more active parents. The family can have no or many children. At different times, the family can be differently composed. It may be that some users of the program wish to track the biological family, making the data structure nice and simple, but others may wish to track different constituents. Each member of a family has a role (of sorts) that starts at a particular time and may end at a later time. The normal roles are biological mother, social mother, biological father, social father, son and daughter, etc; adoptive father and mother are common and variations on this must be allowed for. A child can be a member of one biological family and another, quite separate, social family, as is the case with adoption - which incidentally can be formal or informal and may be permanent or piecemeal. The peculiar event (fact-with-date) normally stored as data for families is any marriage: its date, place and any other relevant details; separation or divorce also has a date and a place. There is also room for a fairly lengthy note about the family: how it came together, what it did together, etc. So the family record can only exist if it has at least two individuals linked to it. If the second individual of two gets deleted, for whatever reason, there should at least be an option to delete the family record. It causes havoc to have family records within your system with only one or, worse, no members. At this point you have to decide where to hold significant bits of information on the family. Should the marriage information be held on the individual record or on the family record? The critical thing is that no information is held on two records; good database design requires that each piece of information is stored in one place and one only. For my money the marriage and separation or divorce are uniquely part of a family record, whereas birth (where someone joins a family) and death (where they leave it) are part of an individual‘s record.

Notes, the fields where you store your stories or other long accounts
I have seen some systems where the Notes, because they can be so large, are stored entirely separately. This makes it difficult to search them for reporting purposes and is to be deprecated. I think there is no substitute for a file structure that allows hugely expandable fields - and several of them and, as above, a facility to add fields - as many as the user wants. I have five separate Notes fields: General notes, Wills, Monumental Inscriptions, Blazons of coats of arms and Biography; I may think of more as time progresses.

The Data on Sources
The last linked data is the source information: birth and death certificates, books of other people‘s researches, records of public events, etc. Every piece of information must have a source. It may be feasible to merely show the source as a free-hand piece of text in each individual‘s record but there are many advantages in having a standard description that does not change. It allows enquiry into how much one type of source is used in your database. It is now customary to have a separate file of sources. The trouble here is the comparative weakness of the current GEDCOM standard. The natural things to record of a source book are, say, its title, the author, the publisher the publication date, the edition, and the ISBN number; but GEDCOM seems only to handle one ”Source• entry. So, if the user adds in a more complex basic description of his source material, then he has immediately lost portability. Your program must warn the user when they are exceeding the current GEDCOM standards by using TAGS that don‘t exist.

Another issue about source information is how it is used. The best researchers insist that your record the source for every fact that you include. Other may think it sufficient merely to list the sources used for each individual or for each family record. Your program must be able to handle both sets of wishes; I have even seen a facility to put a separate source for every word.

The data stored in the source record has some flexibility between that and some data in the individual or family record. With books, it is usual to include the page number. You do not want a separate source record for every page, so it is usual to hold the (volume and) page number, or other citation information on the individual‘s or the family‘s file with a link to the title of the document in the source file. This decision gets even more complicated for birth, etc certificates: what is the source material to hold in the source file? Do you hold the parish in the source file, or is it part of the entry in the individual record? If you hold the parish in the source file, then you have a separate source record for every parish, perhaps multiplying them needlessly. In the end different users will have different policies and you will have to give enough space in each record, source, individual and family, for them to hold what they want.

The Data on Customisation
Customisation can be of many things, of the data to be stored, of the ways is is presented on screen, of the ways it is presented on paper, of the chart formats that are used, etc. At this stage some general principles are worth stating, and can only be taken further when we have looked also at screen designs, report designs and chart designs. The first principle is to make everything customisable: if a user only wants to store names as a whole without separating out fore- and sur- names and with no prefix or suffix titles, then let them do that. The method is to provide a default set of data fields which they can add to or subtract from; then let them be able to save different active sets; you may save any data in inactive fields, but it is not normally displayed or entered. The important thing here is to note that your file structure becomes flexible, it does not have a fixed number of fields. The only problem with all this is in portablility, if they use fields that do not have a standard GEDCOM tag, then that data cannot be moved across to other systems.

The customisation system allows sets of choices to be made and stored under different names. The users must be able to choose the data they are holding, the report contents they may make (layouts are slightly different) and the charts contents they may make. Build this in from the start and you‘ll have a highly flexible system that the skilled users will delight in using. One advantage of a customisations system is that you can change the defaults for the supplied program with no trouble at all: if the bulk of your users want something different, then there is no re-programming needed, just change the defaults. What about binary information?

Binaries are pictures, sounds, films even. Their major problem is that they take up a load of disc space and can slow things down heavily. But many family historians want to record pictures of their family members, pictures of their writing, some words they have said and even a film of them.

Is this data to be stored as part of the data on each individual and on each family? Or should be stored in a separate file - or merely directory - and then linked to individuals in the file of individuals or to families in the file of families? This is a question for the system designer.

Then, for each binary stored, the system need to be told what to do with it. Is it picture, sound or film? Should it be ”displayed• automatically or only on request? How should it be displayed in reports and charts? In charts, people get very excited whether a picture is displayed somewhere within the box for the data or whether it is to be displayed beside the box. Further there needs to be some control on the size of picture, or the volume of sounds, otherwise they will be invisible or swamp everything else.

There are lots of decisions on where to store the binary data; in the core files, or as a separate file system? And data on the use of binaries can be in the core files or in the report or chart files themselves. Finally the binary data needs to be stored in formats that are readily portable between systems - or in a manner whereby automatic translation can take place on export.

Dates are a minefield!
Each country has its own format of dates - or at least the Yanks do M-D-Y while the Brits do D-M-Y. Different people like to show dates differently. Some like to enter months by name or by abbreviated name, others like to enter them solely as a set of numbers. And what do you do about Chinese years? And different year counting systems: christian, muslim, etc. Again the rule is flexibility and customisation.

3. The user interface is critical to the acceptance of the program
There is no substitute in the design of a user interface but to have a look at how other programs do it. Don‘t do this on your own. Get the views of what appeals to others, to novices, to experienced users and to the real experts. (My problem is that I have been using one program for far too long and take a jaundiced view of any other display).

You have to have a core display; this will be where people start their editing. How much do you display? It is common to have clearly on the screen the children, parents and grandparents of any family. How many children? What data do you show of each? And what do you show of the family data?

What data do you show in this display? The same amount for each of grandparents, parents and children? Or a lot more for the currently displayed parents? Customisable, of course.

What happens when you click on the different parts of the display? Some systems make a click on children and grandparent cause a step up and down the family tree; a click on a currently displayed parents leads to the editor for that parent. Other systems make these clicks lead directly to an editor for the individual concerned; they have a different icon to step up and down the family.

How do you show that one person is connected with other families? And how do you allow the display to step between those families?

To make it even more complicated: who do you show for the grandparents, the biological parents or the social (adoptive) parents? How do you handle it when one of the parents or grandparents is not known?

On this display, how do you add a child or a parent or a grandparent? How do you handle it when the person to be added is already on your database but needs linking to the right family? (If you don‘t believe this happens, try doing some medieval genealogy and you‘ll find the one person with different names in different source documents and you may not realise they are the same person and enter them separately.) Do you do the linking by just entering the name (not very clever), by entering a unique reference number, or by selection from a browse list? If you use browse list selection, how do you make it fast when the list may contain thousands of names? If users are selecting from a browse list, how do you handle it for the early medieval people who had no surname, what other (customisable) data should you show so that the right choice is made?

Editing, but not adding, lots of people
You have a list of people for whom you want to edit in a similar way; it may be that a new book has arrived on a particular family. So you want to identify the people in the family and go through them one by one. There are numerous scenarios where you may want to make a list and tick the people off as you go through them.

There are two steps here, identifying the list and then editing it. What are needed are ”markers•; you may think is sufficient to have but one set of markers, so that people are either marked or not. Or you may think it useful to have three or four sets of markers so that you can be going through a few such lists at the same time.

To mark the people, you can go through them one by one and mark them. But this takes ages. The preferred way is to use part of your reporting system to make a sub-list from all the people in your database. The sub-list could be of the descendants of one person, or of their ancestors. It could be everyone born in a country after a date, as in England in 1837 when the system of Birth, etc Registration started. It could be everyone that you have flagged as solicitors, or as executed, or a signatory to Charles I‘s execution warrant. It could be that you wish to combine these, perhaps those already marked who are solicitors or descendants who are solicitors. Your reporting system must be designed to make these selections on any of the data stored in each user‘s system.

Having selected the people, they need to be marked at the click on one icon. And don‘t forget a facility to unmark everyone when required. And there needs to be an easy system to step through this list of marked people and to unmark each person as they are processed.

Finally it should be noted that individuals and families are different and can both be marked, entirely separately. Sometimes you just want to edit family information, not that on the individuals.

Similarly you may want to mark some of your sources in your source file for editing.

And you may wish to globally change some people
For instance I have a flag to show someone as an ancestor and another flag to show them as a blood relation and everyone else is related by marriage (or if they are not, then I have a little pocket of entirely separate people in my so-called family file!). I need this flag when I am trying to list the ancestors who were executed, etc; it gets impractical to do a search up the family tree at the same time.

I am regularly changing my family file, mostly adding people to it, sometimes moving them from one family to another, or linking up people and families that had not had those links before. So the people flagged as ancestors change and I need a global facility to update this. This can be done by reporting the list of all my ancestors, perhaps marking them and then running a routine to reset the Ancestor flag and then set it only for those currently shown as my ancestors or only those currently flagged. This sort of flag setting routine needs to be built in to the system.

Moving people from one family to another
Sometimes, you have got individuals in the wrong family: perhaps you have mistaken which wife was the mother or which husband the father.

I like the facility to put (ie drag) such people into a recoverable dustbin. Then you can pull (drag) them out later and attach them to the right family.

This is as well as the facility described above, to browse for new members of a family who are actually already on file.

Duplicates have to be deleted
You will find you have entered both people and families more than once. You need to move all (or a selection of) the data from the wrong person to the right one; this can be done either manually or a facility might be programmed in. Then you need to unlink the wrong person from the wrong family, if any, and link him to the right family. Finally the wrong person has to be deleted. Similarly for families. Remember that families certainly don‘t exist if they have no members and must be auto-deleted and they probably also don‘t exist if they have but one member, so should usefully also be deleted.

If these deletions are not done automatically, then the void people and families will appear on reports and will cause trouble!

4. Reporting.
Reports are useful for yourself and for exchanging information with others. They are not pretty so they won‘t go in display in your house or at family reunions but they are a mainstay of genealogy progress.

Reports can be about individuals, about collections of individuals, about ancestors and on descendants. These give three rather different formats, individual, ancestral and descendant; for each of these formats there is any mount of data fields that can be selected and there must be options to choose any of these and to save each set.

An individual report is just a list of all the items for that person that you wish to communicate, any field in their record must be selectable. There must be options to include their parents, biological or social, their spouses and the children they had; the options must include decisions on what data on each of these types of people is to be included.

Sometimes it is nice to have a narrative feature for the information, for example ”Jo Soap was born on the 31st February 2232 in Igloo 23 of Station Ice Age on Mars• instead of ”Jo Soap, b. 31.2.2232, location Igloo 23, Station Ice Age, Mars•. I have seen customisation facilities for the words of such narrative descriptions, for each field to be reported.

An ancestor report is usually shown with ahnentafel numbers (refer to the guide for any program for how these work) to show how the generations relate to one another; each generation should include the relationship (6th great-grandfather, etc). Again there need to be options on the fields included and on the use of a narrative style.

A descendant report is frequently shown with generation numbers and even with precise location in what is known as legal format. Sometimes each generation is indented using some special character or a space; if this is done then a fixed font is needed. As usual there needs to be options on the fields included and on the use of a narrative style.

Reporting on a set of individuals requires a query system, as described above for marking people. The user must be able to specify any field he has used and any construct of the contents, such as ”is•, ”is not•, ”contains•, ”after•, ”before•, etc. A wild card facility is needed. Further the queries must be able to combine two or more such with AND and OR facilities - and I know people get confused if they have multiple ANDs and ORs! Messenger Pro has quite a good query system of this type. Having run the query, facilities are needed to print it, store it, or merely view it. Such storage facilities are quite different from the storage of layout customisation. It is probably convenient to store reports in RTF format, for portability reasons.

5. Charts:
Most people can understand family trees, even grannies! So every system has to have facilities for making and printing these. There are a variety of charts that different people like to use:

  1. Pedigree, showing the ancestors of a named individual for a named number of generations.
  2. Descendant, showing the off-spring of a named individual for a number of generations. To confuse matters, both this and the pedigree chart are known as family trees.
  3. Combined, showing both the ancestors and the descendants of a named individual. These are sometimes known as ”hour-glass•, for obvious reasons.
  4. Timeline where some group, generations even, of people are plotted against a set of publicly known events.
  5. Birth brief, which is a special case of pedigree and shows specific data on a number of generations with very little or no graphical content.


The first three types require extensive options for customisation and graphical enhancement. Personally I think the facilities within Generations (Easy Chart is the charting component) or its partner, Reunion for the Apple are worth studying in view of the massive flexibility therein.

However Generations does not handle pictures of individuals very well. Portraits need to be clearly linked to the individuals, it must be possible to decide where the portrait is to appear: inside the individual‘s box or beside it and north, south, east or west in each case. There must be some control on the size the portraits are to be. If the person is moved around the chart, the portrait must move with that person.

In addition there must be facilities to add other graphics and text anywhere on the chart, this enables people to provide an Illuminated Family Tree.

The data shown for each person must be customisable. For some specialist charts that I redraw as my data expands, I even have a special note field in my date for each type of these charts. And it must be possible to edit any text that appears on the individual‘s record on the chart.

Marriages in Britain commonly have a ”=• between the spouses. In the USA they use ”&•. So allow choice of either or, even, of any user-specified symbol.

Sometimes is it convenient to show a name beside each field as in ”Occ: miner.• where ”Occ• is ”Occupation. But some people like to have no such field name. Options have to be provided for both and, indeed, for the user to specify what name to use in the charts for each field.

Some people like a border around their chart, so give this to them.

There need to be facilities to alter the size, colour and nature of the lines joining individuals. Similarly for any boxes surrounding individuals, plus any shadows.

The text size must be customisable, as must be the font. A small font enables more people to be put on a given size of paper.

Facilities are needed to move individuals and groups around en block. This can be done by branch, by single generation of multiple generations. This enables you to put more people on a sheet of paper and to ensure that no entry straddles two sheets of paper.

It must be possible to print out on any size paper that the printer can handle. Further there has to be a facility to split the chart into, say, A4 pieces and print them out in a suitable order; then you can tape or glue the pieces together to make the Big Chart.

I could go on for ever on this!

One feature that some people talk of is to feed data back from the chart to the database. If you alter the data on the chert, then this should optionally alter the data on the database. Personally I am not convinced of this as I see chart construction as an artistic process and not data-relevant.

Charts are not portable between genealogy systems. But it is probably good to have some facility around to convert the final chart file to a format that is portable between computer systems. GIF is an obvious choice, licensing restrictions apart.
6. Data (Gedcom) Transfer, In and Out
People communicate with the friends and relatives. There is no guarantee that others will have the same program. So a portable file format has been created to facilitate such transfers, this is GEDCOM, currently at version 5.5.

This does not resolve the problem, though. The worst problem is when someone creates special fields to hold special data, such as blazons for coats of arms; blazons are NOT in the GEDCOM standard and so they don‘t get transferred. Or if they do get transferred, because you have invented a GEDCOM tag for them, the receiving program knows nothing about that data type. The next problem is that your program may not create the GEDCOM file correctly. Generations, for instance, certainly gets the CONC and CONT tags confused.

And even if you do create correct GEDCOM the receiving program may not handle that OK. Prefix and Suffix Titles are one candidate for this, another is the various data types for source documentation which tends to get received very poorly by most systems.

Obviously you need to offer customisation for the fields to be exported and to allow changes of GEDCOM tags at that stage. And there must be the usual facilities to save an export set of fields with their GEDCOM tags. But as important is some customisation on receiving GEDCOM data. You need to show all the tag names used in the incoming file and allow the user to change them to ones they have in their system, and perhaps also to specify the data type: event, fact or note for individuals, etc. Some people start their computerised data storage on a spreadsheet and this is fine for small families. But interlinking the marriages and updating can become difficult once the size get above a few hundred people. If the spreadsheet data can be put into tabular form, with a column for names, one for birth date, etc, your program should be able to load in a tab or comma separate variable file and create the individuals directly. Obviously it is difficult to create the families from such a file, but it should not be too onerous forming the right links by hand.

Similarly is is very useful sometimes to be able to export (selected) data in tabular form and then load it into a spreadsheet to create far more useful reports. Spreadsheets usually have very nice report layout facilities.

7. Internet site creation
Some family historians like to put their researches onto the internet. This means creating reports in linked HTML. There are programs (for IBM compatible PCs) that will do this from GEDCOM, but they lack flexibility. Broad brush such programs are of two types: first there are long linked lists of people in alphabetical order, these tend to have large files which become very large and impractical for large databases; second there are systems that have a separate file for each family; these can be neat and fast but can have humungeous numbers of files to upload to the site host, taking ages.

I have used both types but these days prefer the latter; I get over the uploading problem by zipping all the files up together and finding a site host that will unzip them on the site (www.freewebs.com).

But the overriding feature required is flexibility in choice of which data and which people you put on your site. By and large you should exclude living people, it causes upset if you include them without their permission. But you do need to be able to decide which fields from your data are to appear on the internet. And when you update, you should be able to call up this configuration by name.

One issue is the surname index, vital for searching. Some programs put all surnames into one great file and then say you can search from there. This is fine when you only have a hundred or so surnames, but when you have thousands, the file can be over 100K and take ages to load and re-load. It works reasonably well to split the surname files by their initial letter.

8. The competing programs

For my money none of the programs available for RISC OS machines even begin to achieve the above requirements. They are in a backwater, quite unlike the RISC OS graphics programs which remain up to the mark, and they are all primitive by comparison with the best programs available on other operating systems.

I use Generations and give lectures on advanced use of it to the Society of Genealogists. I like it but it has a few foibles. The worst problem is that its facilities to transfer GEDCOM out don‘t seem to be foolproof and led to a few errors in receiving programs. But is has nice facilities for transferring GEDCOM in and for transferring in from spreadsheets or tabular data.

The two good free programs are PAF (Personal Ancestral File) from the Latter Day Saints, and Legacy. Both are good and both can be enhanced by spending small sums of money.

And there are many you have to pay for, some with demo versions: Generations (above), The Master Genealogist (very good, in its way), Family Tree Maker, Family Origins and a new British one, Family Historian. Finally there is Reunion for the Apple Mac. But never pay for the data CDROMs that come with some of these programs; everyone I know has found them useless! You will have to get some of these programs to see the sorts of things that are done and which many users will expect to find in your program.



Additional ideas and observationas are added here. Please contribute to this as Family Historian or programmer. How do we produce such a program (or suite of programs) for RISC OS?
Initially published November 2002
December 2002

The contents of this page are copyright the author and AP On-line. This article must not be reproduced without the express permission of the publishers. Whilst we take care to ensure that the facts described in AP On-line are correct we appreciate that errors will occur and will try to ensure that such errors are amended as soon as possible after they are notified to us.