Avoiding duplication when importing from a CSV using Rhapsody
In the previous post we described a way that we could import some resources into a FHIR® server by using Rhapsody to transform a csv file into a transaction bundle, and then POSTing that to the FHIR® server, which then extracted the resources from the bundle and saved them in the server from where they can be directly queried.
We took a certain amount of care to avoid duplicating the Patient resource by using a conditional update in the bundle so that the resource would only be created if it didn’t already exist – otherwise it would be updated. But we didn’t give the same attention to the Condition resource. If we were to import the csv file twice, then the Patient wouldn’t be duplicated (it would be updated again) but the Condition would be duplicated.
Let’s look at ways that we can avoid this duplication. We’ll be leaning heavily on conditional CRUD operations:
- Conditional Create. Create a resource only if there are none that match a filter
- Conditional Update. If a single matching resource already exists then update it. Otherwise create it.
- Conditional Delete. Delete one or more resources that match a filter.
All of these operations use search parameters to define the filter (Defined at the bottom of each resource in the spec)
Incidentally, if you want to avoid the Patient from being updated if it already exists, then we have a few options:
- We could use the ‘conditional create’ option in the transaction Bundle. To do this, add the ‘ifNoneExist’ element in the bundle entry, with the search criteria specifying the Patient identifier.
- We could have specific logic on the server that only creates – it doesn’t update.
- We could break the whole importing process into 2 parts:
Part one is to create all the missing Patients first. The easiest way is to use a batch operation – create a Bundle with all the Patients and use the conditional create option - so it is only a single call. The transaction response should return all the Patient resources – existing and created – using the location element, which we can then use when creating the references to Patient from Condition. (Note that the spec doesn’t specify that existing resources are returned – though the reference servers seem to do so)
Then part two would be to create the Condition resources (and we have the patient ids from part one) – either individually or as a Transaction/Batch bundle
- A fourth option is to use a different sort of reference – a reference by identifier. This type of reference contains the business identifier, and the server uses that to locate the resource instance. It’s not that well supported though, and the behavior isn’t well defined by the spec (it doesn’t mention what to do if there is more than one matching resource for example, so not really recommended unless you know exactly what the server will do).
Which option you choose would depend on the server capabilities of course…
With that out of the way, let’s think about how we can re-run the import multiple times and not duplicate the Condition. As always, we have a few ways we could do this.
The easiest way is to use the conditional create option on the Condition. It does mean that we need something to identify the Condition though – generally this would be the Condition.identifier property.
However, in the dataset that we’ve been using, there isn’t really anything would act as an identifier. (We could use a combination of properties – e.g. the condition code and the subject reference – but that approach could have some unpleasant side effects so sticking to identifier seems the safest). Of course, there’s nothing to stop us making up an identifier ourselves – so long as it’s unique and won’t change then we’re good to go. Remember that the identifier has 2 main parts – the system property (like a namespace) and the actual unique value within the namespace.
If we can’t do this (maybe the server doesn’t support conditional create) then the only option is to first delete all the conditions that we created whenever this file was imported. To do this, we need some way to ‘mark’ the resources in some way so that we know which ones to delete, and then use the conditionalDelete operation on them. (We can’t use the ‘ordinary’ delete operation as we don’t know what the id of the resources are). Fortunately, all resources have a ‘meta’ element, which contains (among other useful things) a ‘tag’ element that can be used for this purpose. (Tag is used for a number of things already – including security/privacy labels, but we’re quite free to create our own tags (which has a Coding datatype).
So the changes we need to make are as follows.
- Each CSV file is assigned a tag (code plus namespace). Maybe the filename if that is different for each file, and a system (namespace) that identifies the application performing the import
- When we create the transaction Bundle, we add the tag to the meta element of each Condition resource. Here’s an example:
- Before we submit the transaction Bundle, we first issue the conditional delete command. This takes the form:
eg, to delete the conditions we created above:
- Finally we submit the transaction Bundle to the server root.
So we’ve looked at a number of ways of avoiding duplication of resources – both as part of the usual flow (Patient) and also in the case of accidental re-running of the import (Condition). I’m quite sure that there are other options as well.
Note that in all of these discussions (including the previous post) we are quite dependent on the capabilities of the target FHIR® server. The CapabilityStatement resource (returned by a GET [host]/metadata call) should indicate most of these – but testing (as always) will be critical!
Final word: Remember to use transaction bundles – not batch bundles – when submitting resources to be created. From the spec:
References within a Bundle.entry.resource to another Bundle.entry.resource that is being created within the batch are considered to be non-conformant.