There are tremendous research benefits of linking de-identified patient records to get a holistic patient view especially for studies related to drug development and patient outcomes. Today, most research data is focused just on structured dataset due to the complexity of de-identifying all records. In this talk, we will present benefits and learning from a de-identification pipeline using Spark NLP and tokenization.