Tuesday, 30 January 2024

Family Tree investigations #3: How I use the FamilySearch APIs

Mechanisms for integration with FamilySearch come into four categories:

1.       Formal (supported) partner APIs

As documented here: https://www.familysearch.org/developers/docs/api/resources.

I do not have access to these at present, however the vast majority of the published read-only APIs are available under the next category.

2.       Documented browser-based APIs

These are all included in the previous category, but are available also from a browser when authenticated using normal FS user credentials.

All can return either XML or JSON if an appropriate header is set. I use JSON.

I have successfully used them:

a) Manually from Chrome with the Requestly extension.

b) From Python via the HTTP-based mechanism (Session class) of the open source getmyancestors module: see https://github.com/Linekio/getmyancestors. (At one stage I had borrowed the relevant getmyancestors source, but for reasons unknown my version stopped working when theirs didn't, and it's much easier and more future-proof to rely on the getmyancestors module - for which many thanks.)

c) From Python via Selenium and Javascript: see the Javascript related answer at https://stackoverflow.com/questions/77494543/seleniumbase-undetected-chrome-driver-how-to-set-request-header.

Some code snippets are provided below.

The APIs I use in this category are as follows:

API

Request Example and Notes

Person

/platform/tree/persons/WY9W-47K

Where a Person has been the source of a Merge, the JSON returned by the Person API will relate to the target (i.e. remaining) Person.

If all you actually want is the current Person version number to check for changes, here's an example of what to look for in the json:

"version" : "1706748431675"

where the numeric value (which is a Unix timestamp) will be updated every time the Person (or its notes, sources, or memories) changes (or sometimes for no discernible reason at all).

The Person json also covers that Person's Couple Relationships and Parent/Child Relationships.

Person Notes

/platform/tree/persons/2733-MKF/notes

Frustratingly, there appears to be no way of telling from the main Person JSON whether there are any Notes present.

Person Memories

/platform/tree/persons/9XX8-DK2/memories

If one or more Memories is present, there will be entry(ies) like this in the main Person JSON:

"evidence" : [ { … "resource" : "familysearch.org/platform/memories/memories/…/personas/…"

Person Sources

/platform/tree/persons/WY9W-47K/sources

If one or more Sources is present, there will be relevant pointer(s) in the main Person JSON.

Note that the Person Sources json does not cover the Person's Couple Relationship or Parent/Child Relationship sources (see the next entry).

Source Descriptions

/platform/sources/descriptions/SLFX-QCX

Used to retrieve Couple Relationship or Parent/Child Relationship sources.

Search Family Tree

/platform/tree/search?q.surname=tinkham&f.sex=Male&count=100&offset=3400

Maximum offset=4800 for count=100.

The overall total number of Persons for this search will be found at the top of the JSON, e.g.

"results" : 3169

I have found it necessary to wait a couple of seconds between calls to this API to avoid throttling.

Also returns Spouse, Parent, and Child information, which I find very useful.

Extract Collections List

/platform/records/collections?count=100&start=3100

For reasons unknown there always seem to be more Collections than the relevant UI Results value would indicate. I just look for the first 4,000 (maximum offset=3900 where count=100); obviously this arbitrary cut-off may need increasing at some point.

 

Code snippets:

 

2.1. Get JSON via HTTP

2.1.1.        Imports (HTTP)

from getmyancestors.classes.session import Session

2.1.2.        Create session and log in (HTTP)

direct = Session(args.username,args.password)

if direct.logged:

    print('Direct login successful')

2.1.3.        Individual request (HTTP)

url = 'https://familysearch.org' + request

res = direct.get(url)

status_code = res.status_code

out_text = res.text

 

2.2. Get JSON via Selenium/Javascript

2.2.1.        Imports (Selenium)

import time

from seleniumbase import Driver

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

2.2.2.        Create session and log in (Selenium)

driver = Driver(uc=True)

driver.get('https://familysearch.org/auth/familysearch/login')

time.sleep(5)

elems = driver.find_elements(By.XPATH,'//*[@id]')

for elem in elems:

    if elem.get_attribute('id') == 'userName':

        elem.send_keys(fsusername)

        break

elems = driver.find_elements(By.XPATH,'//*[@id]')

for elem in elems:

    if elem.get_attribute('id') == 'password':

        elem.send_keys(fspassword)

        break

elems = driver.find_elements(By.XPATH,'//*[@id]')

for elem in elems:

    if elem.get_attribute('id') == 'login':

        elem.send_keys(Keys.RETURN)

        break

time.sleep(5)

if 'Sign In' in driver.title:

    print('Not successfully signed in')

else:

    print('Successfully signed in')

2.2.3.        Individual request (Selenium)

url = 'https://www.familysearch.org' + request

response = driver.execute_async_script("var callback = arguments[arguments.length - 1]; fetch('" + url + "', {method: 'GET', headers: {'Accept' : 'application/json'}}).then((response) => response.text().then((text) => callback({'status': response.status, 'text': text})))")

status_code = response['status']

out_text = response['text']

Note the slight difference in the individual request FamilySearch URLs.

Approach 2.2 is faster and seems much less prone to throttling. The relevant session and login are required anyway if you want to do any screen scraping (see below).

I have also experimented successfully with the following, while not finding it necessary to my use cases as listed in blog post #2, Data:

API

Request Example and Notes

Person Changes

/platform/tree/persons/2733-MKF/changes

N.B. for any Person with appreciable change history this will return loads of data; it’s not (as far as I can see) possible to filter the information by change date or by contributor.

3.       Undocumented browser-based APIs

I have found two of these:

API

Request Example and Notes

Arkid JSON or XML

/ark:/61903/1:1:FHD3-WNG

The JSON or XML output is much more detailed than the corresponding HTML. In particular, for a Record (…/1:1:…) the JSON or XML identifies the Collection to which this Record belongs.

Used in the same way as the APIs in the previous category.

Record metadata TSV export

https://familysearch.org/search/webservice/hrresults/download?count=100&offset=1800&q.surname=tinkham&f.sex=Male&f.collectionId=4496122&fileType=tsv

Maximum offset=4900 for count=100.

I only use the f.sex filter where I am forced to (i.e. >5000 results in total).

Wherever throttling occurs, the response will be the first 21 records of the same search with no f.collectionId filter (‘base output’).

Thus throttling can be caught by checking the number of records, and where 21 are found (which may of course be legit) comparing the output with the relevant ‘base output’, collected for the purpose at the start of the run. I use four types of ‘base output’: overall, and with f.sex filtered to each of Male, Female, and %3F=Unspecified.

I have found that an inserted 2 minute delay will prevent problems with subsequent downloads, although a new session will have to be started before the failed download can be retried.

Any download attempt that would return zero records will typically result in throttling (hence the need to screen scrape the record counts by collection, as described below).

Throttling also frequently occurs where there is a high record count for the relevant collection.

Note that this type of processing is also possible with Family Tree Find results, e.g.: https://familysearch.org/search/webservice/treeresults/download?count=100&offset=3800&q.surname=tinkham&fileType=tsv

4.       Screen scraping

Obviously the screens scraped will change over time, so regular testing/tweaking will be needed.

API

Example URL and Notes

Selenium to scrape overall Records count

https://www.familysearch.org/search/record/results?count=100&q.surname=tinkham (‘Results (’ scraped)

Assuming Selenium session/login as in section 2.2 above:

driver.get(url)

time.sleep(5)

elems = driver.find_elements(By.TAG_NAME,'*')

for elem in elems:

    try:

        for resultStr in (elem.text).splitlines():

            if 'Results (' in resultStr:

                results = int(resultStr.split()[1].replace('(','').replace(')','').replace(',',''))

                break   

    except Exception as e:

        pass

    if results >= 0:

        break

 

API

Example URL and Notes

Selenium to scrape Records counts by collection

https://www.familysearch.org/search/record/results?count=100&q.surname=tinkham (same URL as previous, collection filter selected, each Collection Title then automatically entered and its resulting count scraped – this takes a while but works really well and involves very little FamilySearch server traffic).

Assuming Selenium session/login as in section 2.2 above, id and title for each collection stored in table ‘collectionList’, and the identified counts written to table ‘records’:

rows = cursor.execute('select distinct collectionId,collectionTitle from collectionList').fetchall()

url = 'https://www.familysearch.org/search/record/results?' + query

driver.get(url)

time.sleep(5)

elems = driver.find_elements(By.XPATH,'//*[@name]')

if elems:

    for elem in elems:

        if elem.get_attribute('name') == 'Collection-filter':

            elem.click()

            break

time.sleep(2)

driver.switch_to.active_element.send_keys(Keys.TAB)

driver.switch_to.active_element.send_keys(Keys.TAB)

inputBox = driver.switch_to.active_element

for row in rows:

    collection = row[1]

    inputBox.send_keys(collection)

    time.sleep(.25)

    found = 0

    collectionCount = 0

    elems = driver.find_elements(By.XPATH,'//label')

    for elem in elems:

        if elem.text == collection:

            print('Collection: ' + row[0] + ', ' + elem.text)

            found = 1

            break

    elems = driver.find_elements(By.XPATH,'//p')

    for elem in elems:

        if not elem.text == '' and not ' ' in elem.text and not 'Page' in elem.text:  

            collectionCount = int(elem.text.replace(',',''))

            print('Count: ' + str(collectionCount))

            break

    if found == 1:

        records = [query,row[0],collectionCount,time.strftime("%d/%m/%Y"),time.strftime("%H:%M:%S")]

        cursor.execute('insert into records (query,collection,results,date,time) values(?,?,?,?,?)', records)

        connection.commit()

    inputBox.send_keys(Keys.CONTROL + 'a', Keys.BACKSPACE)

 

No comments:

Post a Comment