<fitz> Happy Thursday, all! If we start inserting computed file ids into our generated pdfs and down the road realize we chose a bad id-generation function, is it possible to migrate those annotations from the old id to a new one?
<robertknight> Do you manage the user accounts being used to annotate these documents? If so, you can use the APIs to patch existing annotations.
<robertknight> We don’t currently have APIs to rename URLs / URNs.
<fitz> Nope, afraid not. This would be for OSF preprints.
<fitz> Gotcha.
<robertknight> Do you articles already have any kind of persistent identifier?
<fitz> Alternatively, would you be interested in a PR to add `<link rel="canonical">` support to the PDF annotator?
<fitz> Not yet… sort of.
<fitz> Some of our users upload pdfs, others upload .doc or .docx files.
<robertknight> By persistent identifier I’m talking about a DOI, PMID, ISBN etc.
<robertknight> … any system for registering an identifier for an article in a database.
<fitz> My bad. We do, once they’re made public.
<fitz> But private preprints do not.
<robertknight> Do the identifiers need to persist across multiple versions of an article?
<robertknight> ie. Do you need to be able to make annotations that continue to show up on future versions of the article? Or can articles be treated as immutable and “finished”?
<fitz> That would be a nice-to-have, but I don’t believe it’s a hard requirement.
<fitz> We do allow folks to upload new versions.
<robertknight> Do you have a particular file ID generation scheme in mind already?
<fitz> On our staging environment, we’re currently using a hash of some OSF-specific metadata that should persist across revisions.
<fitz> I think it satisfies most of our use cases, but I’m trying to anticipate problems.
<fitz> It’s cheap to do for .doc and .docx files since we have to convert those anyway. One downside is that we’d have start postprocessing pdfs and overwriting any built in identifiers.
<robertknight> Hash of metadata sounds reasonable.
<robertknight> Do you include the original metadata in the PDF in any format (eg. XMP)?
<robertknight> We don’t currently extract such metadata for PDFs, but we do for HTML documents, and having that info may be useful for linking documents in future.
<fitz> We’re not yet doing this for pdfs, but I suppose we could.
<fitz> Would it make sense to update the PDF annotator for the h client to look for canonical rels and dc.identifiers like the document annotator does?
<robertknight> Extracting DOIs or other persistent identifiers - Yes Extracting metadata (title, author etc.) - Yes Canonical Rel - Probably not. I think we’re going to avoid promoting this as the preferred way to identify the same document at different URLs in future. It’s meaning for SEO purposes is different than what we want to use it for.
<fitz> That makes sense.
<fitz> (For reference, I’m exploring file ids b/c the urls that the annotations are linked to are chaotic. I’m trying to clean them up to produce a canonical / predictable url, but our renderer’s design makes that difficult.)
<fitz> Thanks for your help, @robertknight!
<robertknight> np. The situation with unstable URLs is not uncommon.