Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Clinical Trial Master File Migration & Information Extraction

Fast, accurate, and consistent migration and information extraction of large-scale clinical trial documents from unstructured or legacy systems to the DIA TMF.

Time and labor savings


Automating Clinical Trial Master File Migration & Information Extraction

End to end AI-enabled solution​


Data guidelines

  • Definition of document class​
  • Definition of extracted data​
  • Data cleaning rules​
  • Annotation in Annotation Lab

Process governance

  • Quality assurance

Training of custom components

  • Training custom models​
  • Development of postprocessing component​
  • Training false-positive classifier


  • Accuracy evaluation​
  • Performance measurements​
  • Feedback loop on postprocessing rules

Running the production pipelines​

  • OCR text extraction​
  • Document classification​
  • Data extraction​
  • Data post processing and cleaning​

Quality control

  • Review of output quality​
  • Exception and manual queue handling
GxP validation​
Extracted information:
  • Artifact: Protocol Signature Page, Principal Investigator’s CV​
  • Version number​
  • Principal investigator’s last name​
  • Signature date​
  • Multiple dates present in text​
  • Hand-written dates and names​
  • OCR-related issues
Example of extracted information:
  • Artifact: Informed Consent Form, Site Staff Qualification Supporting Information​
  • First name and last name​
  • Relevant date​
  • Role​
  • and more.
  • Date selection (e.g. expiration dates may be extracted depending on the presence of other dates)​
  • Role extracted from content or mapped from metadata​
  • Depending on the case, ICF Type is extracted from the content or from the metadata
eTMF: automatic accuracy and confidence estimation
Automatic accuracy and confidence estimation
  • Automatic detection if the extracted information is correct​

  • Reduction of false positives is critical for business success​

  • Machine learning method

State of the art accuracy
  • Based on award winning Spark NLP software​

  • Combination of NLP and user defined rules

Faster & smarter​​
  • 80% reduction of manual labor​​

  • 80% reduction of migration time line

Secure and compliant​
  • On premise, air-gapped installation​​

  • Proven technology​

  • GxP Validated​

Proven in the real world NOVARTIS

Year-long migration project from legacy document system to new enterprise document management system

  • 48 Artifacts (document classes) of DIA TMF Reference Model, e.g., Site Staff Qualification Supporting Information, Sub-Investigator Curriculum Vitae, FDA 1572
  • 29 Attributes, e.g., First name, Last Name, Signature Date, Expiration Date, License Date