Bringing Innovations to Customs Data Analytics
To secure government revenue without interrupting legitimate trade flows, customs administrations around the world strive to develop ways to detect illicit trades. IBS data science group has been collaborating with the World Customs Organization and National Cheng-Kung University to build algorithms1 for predicting intentional manipulation of invoices that lead to the undervaluation of trade goods, which is the most common type of customs fraud to avoid ad-valorem duties and taxes. We developed a fully-supervised tree-aware embedding algorithm empowered by deep attentive mechanisms, we called it a DATE model2, 3. Currently, we are testing the DATE model on real-world import declarations collected from Africa4.
For a glance at our current work, have a look at our paper and our repository:
DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection
S. Kim, Y.-C. Tsai, K. Singh, Y. Choi, E. Ibok, C.-T. Li, and M. Cha. [In proc. of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). August 23-27, 2020.] [Github]
We are interested in extending our work in the following directions:
- Artificial customs dataset generation by using variational autoencoder or generative adversarial networks. The virtual data will be used by many customs officials and data scientists in the future.
- Active learning for customs fraud detection. Using active learning, we want to select import declarations for short-term revenue maximization (exploitation) as well as long-term model performance improvement (exploration).
- Performance analysis of our DATE model by changing the testing period and checking the factors that affect the performance. Performance may change depending on changes in the distribution of customs goods over time, or external issues such as natural disasters or diplomatic issues in the African nation.
- Preprocess additional risk indicators from import declarations dataset to help optimize our model for various objectives. We will find a way to automatically generate diverse labels from the data by using natural language processing & regular expressions.
Snapshot from our weekly meeting: