Mechanisms for integration with FamilySearch come
into four categories:
1. Formal
(supported) partner APIs
As documented here: https://www.familysearch.org/developers/docs/api/resources.
I do not have access to these at
present, however the vast majority of the published read-only APIs are
available under the next category.
2. Documented
browser-based APIs
These are all included in the
previous category, but are available also from a browser when authenticated
using normal FS user credentials.
All can return either XML or
JSON if an appropriate header is set. I use JSON.
I have successfully used them:
a) Manually from Chrome with the Requestly extension.
b) From Python via the HTTP-based mechanism (Session class) of the open source getmyancestors module: see https://github.com/Linekio/getmyancestors. (At one stage I had borrowed the relevant getmyancestors source, but for reasons unknown my version stopped working when theirs didn't, and it's much easier and more future-proof to rely on the getmyancestors module - for which many thanks.)
c) From Python via Selenium and Javascript: see the Javascript related
answer at https://stackoverflow.com/questions/77494543/seleniumbase-undetected-chrome-driver-how-to-set-request-header.
Some code snippets are provided
below.
The APIs I use in this category
are as follows:
API |
Request Example and Notes |
Person |
/platform/tree/persons/WY9W-47K Where a Person has been the source of a Merge, the JSON returned by
the Person API will relate to the target (i.e. remaining) Person. If all you actually want is the current Person version number to check
for changes, here's an example of what to look for in the json: "version" :
"1706748431675" where the numeric value (which is a Unix timestamp) will be updated
every time the Person (or its notes, sources, or memories) changes (or
sometimes for no discernible reason at all). The Person json also covers that Person's Couple Relationships and
Parent/Child Relationships. |
Person Notes |
/platform/tree/persons/2733-MKF/notes Frustratingly, there appears to be no way of telling from the main
Person JSON whether there are any Notes present. |
Person Memories |
/platform/tree/persons/9XX8-DK2/memories If one or more Memories is present, there will be entry(ies) like this
in the main Person JSON: "evidence" : [ {
… "resource" :
"familysearch.org/platform/memories/memories/…/personas/…" |
Person Sources |
/platform/tree/persons/WY9W-47K/sources If one or more Sources is present, there will be relevant pointer(s)
in the main Person JSON. Note that the Person Sources json does not cover the Person's Couple
Relationship or Parent/Child Relationship sources (see the next entry). |
Source
Descriptions |
/platform/sources/descriptions/SLFX-QCX |
Search Family Tree |
/platform/tree/search?q.surname=tinkham&f.sex=Male&count=100&offset=3400 Maximum offset=4800 for count=100. The overall total number of Persons for this search will be found at
the top of the JSON, e.g. "results" : 3169 I have found it necessary to wait a couple of seconds between calls to
this API to avoid throttling. Also returns Spouse, Parent, and Child information, which I find very
useful. |
Extract Collections List |
/platform/records/collections?count=100&start=3100 For reasons unknown there always seem to be more Collections than the
relevant UI Results value would indicate. I just look for the first 4,000
(maximum offset=3900 where count=100); obviously this arbitrary cut-off may
need increasing at some point. |
Code
snippets:
2.1. Get
JSON via HTTP
2.1.1. Imports
(HTTP)
from getmyancestors.classes.session import Session
2.1.2. Create
session and log in (HTTP)
direct = Session(args.username,args.password)
if direct.logged:
print('Direct login successful')
2.1.3. Individual
request (HTTP)
url = 'https://familysearch.org' + request
res = direct.get(url)
status_code = res.status_code
out_text = res.text
2.2. Get
JSON via Selenium/Javascript
2.2.1. Imports
(Selenium)
import time
from seleniumbase import Driver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
2.2.2. Create
session and log in (Selenium)
driver = Driver(uc=True)
driver.get('https://familysearch.org/auth/familysearch/login')
time.sleep(5)
elems = driver.find_elements(By.XPATH,'//*[@id]')
for elem in elems:
if elem.get_attribute('id') == 'userName':
elem.send_keys(fsusername)
break
elems = driver.find_elements(By.XPATH,'//*[@id]')
for elem in elems:
if elem.get_attribute('id') == 'password':
elem.send_keys(fspassword)
break
elems = driver.find_elements(By.XPATH,'//*[@id]')
for elem in elems:
if elem.get_attribute('id') == 'login':
elem.send_keys(Keys.RETURN)
break
time.sleep(5)
if 'Sign In' in driver.title:
print('Not successfully signed in')
else:
print('Successfully signed in')
2.2.3. Individual
request (Selenium)
url = 'https://www.familysearch.org' + request
response = driver.execute_async_script("var callback =
arguments[arguments.length - 1]; fetch('" + url + "', {method: 'GET',
headers: {'Accept' : 'application/json'}}).then((response) =>
response.text().then((text) => callback({'status': response.status, 'text':
text})))")
status_code = response['status']
out_text = response['text']
Note the
slight difference in the individual request FamilySearch URLs.
Approach
2.2 is faster and seems much less prone to throttling. The relevant session and
login are required anyway if you want to do any screen scraping (see below).
I have also
experimented successfully with the following, while not finding it necessary to
my use cases as listed in blog post #2, Data:
API |
Request
Example and Notes |
Person
Changes |
/platform/tree/persons/2733-MKF/changes N.B. for
any Person with appreciable change history this will return loads of data;
it’s not (as far as I can see) possible to filter the information by change
date or by contributor. |
3. Undocumented
browser-based APIs
I have found two of these:
API |
Request
Example and Notes |
Arkid
JSON or XML |
The JSON
or XML output is much more detailed than the corresponding HTML. In
particular, for a Record (…/1:1:…) the JSON or XML identifies the Collection
to which this Record belongs. Used in
the same way as the APIs in the previous category. |
Record
metadata TSV export |
Maximum
offset=4900 for count=100. I only
use the f.sex filter where I am forced to (i.e. >5000 results in total). Wherever
throttling occurs, the response will be the first 21 records of the same
search with no f.collectionId filter (‘base output’). Thus
throttling can be caught by checking the number of records, and where 21 are
found (which may of course be legit) comparing the output with the relevant
‘base output’, collected for the purpose at the start of the run. I use four
types of ‘base output’: overall, and with f.sex filtered to each of Male,
Female, and %3F=Unspecified. I have
found that an inserted 2 minute delay will prevent problems with subsequent
downloads, although a new session will have to be started before the failed
download can be retried. Any
download attempt that would return zero records will typically result in
throttling (hence the need to screen scrape the record counts by collection,
as described below). Throttling
also frequently occurs where there is a high record count for the relevant
collection. Note that this type of processing is also possible with Family Tree Find results, e.g.: https://familysearch.org/search/webservice/treeresults/download?count=100&offset=3800&q.surname=tinkham&fileType=tsv |
4. Screen
scraping
Obviously
the screens scraped will change over time, so regular testing/tweaking will be
needed.
API |
Example
URL and Notes |
Selenium
to scrape overall Records count |
https://www.familysearch.org/search/record/results?count=100&q.surname=tinkham (‘Results
(’ scraped) |
Assuming
Selenium session/login as in section 2.2 above:
driver.get(url)
time.sleep(5)
elems = driver.find_elements(By.TAG_NAME,'*')
for elem in elems:
try:
for resultStr in
(elem.text).splitlines():
if
'Results (' in resultStr:
results
= int(resultStr.split()[1].replace('(','').replace(')','').replace(',',''))
break
except Exception as e:
pass
if results >= 0:
break
API |
Example
URL and Notes |
Selenium
to scrape Records counts by collection |
https://www.familysearch.org/search/record/results?count=100&q.surname=tinkham (same
URL as previous, collection filter selected, each Collection Title then
automatically entered and its resulting count scraped – this takes a while
but works really well and involves very little FamilySearch server traffic). |
Assuming
Selenium session/login as in section 2.2 above, id and title for each
collection stored in table ‘collectionList’, and the identified counts written
to table ‘records’:
rows = cursor.execute('select distinct collectionId,collectionTitle from
collectionList').fetchall()
url =
'https://www.familysearch.org/search/record/results?' + query
driver.get(url)
time.sleep(5)
elems = driver.find_elements(By.XPATH,'//*[@name]')
if elems:
for elem in elems:
if elem.get_attribute('name')
== 'Collection-filter':
elem.click()
break
time.sleep(2)
driver.switch_to.active_element.send_keys(Keys.TAB)
driver.switch_to.active_element.send_keys(Keys.TAB)
inputBox = driver.switch_to.active_element
for row in rows:
collection = row[1]
inputBox.send_keys(collection)
time.sleep(.25)
found = 0
collectionCount = 0
elems = driver.find_elements(By.XPATH,'//label')
for elem in elems:
if elem.text == collection:
print('Collection: ' + row[0] + ', ' + elem.text)
found
= 1
break
elems = driver.find_elements(By.XPATH,'//p')
for elem in elems:
if not elem.text == '' and
not ' ' in elem.text and not 'Page' in elem.text:
collectionCount = int(elem.text.replace(',',''))
print('Count: ' + str(collectionCount))
break
if found == 1:
records =
[query,row[0],collectionCount,time.strftime("%d/%m/%Y"),time.strftime("%H:%M:%S")]
cursor.execute('insert into
records (query,collection,results,date,time) values(?,?,?,?,?)', records)
connection.commit()
inputBox.send_keys(Keys.CONTROL + 'a', Keys.BACKSPACE)
No comments:
Post a Comment