Due Diligence Pipeline – Generating Open Source Intelligence Dossiers for Donor Vetting

Overview

Institutions need to secure philanthropic support and manage reputational risk through comprehensive donor due diligence. A due diligence process normally involves manually gathering, analyzing, and aggregating complex data from sources scattered across public, private and paid platforms, which requires high staff effort. In order to automate the due diligence process, people sometimes use Gen-AI tools, such as ChatGPT and DeepSeek, to auto-generate PDF dossiers in batch. Such generative models take in names and some keywords and populates profiles for the given entities. However, two main challenges in the implementation are:

LLM Hallucination: Models provide fake or conflicting information about a person / organization
Ambiguation between Named Entities: Models sometimes cannot differentiate entities with the same name, even with keywords provided.

Hence, this project aims at developing a trustable pipeline for due diligence while dealing with these challenges.

Key Features

Chain of Trust

Ensure that profile data is tightly linked to the webpage it is extracted from. Make each fact traceable and verifiable.

Clustering of Profiles

Use agglomerative clustering with human intervention to check if there are ambiguous profiles in the mix.

Looking Forward

Named Entity Recognition
Use NER to improve web scraping quality – discard paragraphs where named entity is not mentioned.
Confidence / Reliability Score
.gov, .edu , forbes – a list of high trusworthy sites. We can use source reliability to label each of our data points
Fuzzy Match Similar Inputs across Sources
Fuzzy matching of values across sources while retaining their verifiability