Collaborative Research: Privacy-Sensitive Data Mining from Multi-Party Distributed Data

Project Award Number
National Science Foundation Grant IIS-0350533
Collaborative award (UMBC)

Principal Investigator
Krishnamoorthy Sivakumar
School of Electrical Engineering and Computer Science
Washington State University
Pullman, WA 99164-2752
Voice: (509) 335-4969, Fax: (509) 335-3818, Email: siva @  eecs. wsu. edu

Hillol Kargupta
Department of Computer Science and Electrical Engineering
University of Maryland, Baltimore County
1000 Hilltop Circle, Baltimore,  MD 21250
Voice: (410) 455-3972, Fax : (410) 455-3969, Email: hillol @ cs. umbc. edu

WSU: Da Meng, Jianjie Ma, Wang Qi, Taikun Cheng
UMBC: Kun Liu, Souptik Dutta.

Post Docs (UMBC): Ran Wolff, Chris Giannella

Privacy preserving
distributed data mining

Project Summary
A growing number of data mining applications need to deal with data sources that are distributed, possibly proprietary, and sensitive to privacy. Financial transactions, health-care records, and network communication traffic are a few examples. Privacy is also becoming an increasingly important issue in data mining applications for counter-terrorism and homeland defense that may require creating profiles, constructing social network models, detecting terrorist communications from distributed privacy sensitive multi-party data. Combining such diverse data sets belonging to different parties may violate the privacy laws. Therefore we need algorithms that can mine the data in a distributed fashion while guaranteeing that the privacy of the data is not compromised.

This has resulted in the development of several privacy-preserving data mining techniques. Many of these techniques work using randomized techniques to perturb the data and preserve the data privacy while still guaranteeing the invariance of the underlying patterns. This research pointed out that the popular naive randomization of the data (more specifically, additive random matrices) may preserve little data privacy in many cases. It proposes a framework to understand these data masking techniques using the theory of random matrices to shows the problems of some existing privacy-preserving data mining techniques and potential research directions for solving the problems. Current research effort includes development of a systematic framework for designing privacy-preserving data mining algorithms. A post randomization based scheme has been proposed towards that end.

This proposal also suggests the development of a collection of randomized techniques that are provably correct in supporting privacy-sensitive applications. It proposes a randomized projection-based technique to compute statistical aggregates from multi-party distributed data. It also suggests research on a privacy-sensitive distributed Bayesian network learning algorithm for similar applications. The proposed algorithms will be developed, tested, and evaluated using a collection of testbeds that PIs have been developing for the last several years.
This research will be a collaborative effort between University of Maryland Baltimore County and Washington State University. Professor Kargupta will take the lead on the random matrices and random projection research. Professor Sivakumar will take the lead on the privacy-sensitive Bayesian network-based distributed data mining.

Publications and Products

  1. K. Das, K. Bhaduri, K. Liu, H. Kargupta. (2008). Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network. IEEE Transactions on Knowledge and Data Engineering. Volume 19, Number 3.
  2. K. Das, H. Kargupta, K. Liu, (2008). A Game Theoretic Framework for Practical Privacy Preserving Distributed Data Mining. (In communication)
  3. H. Kargupta, K. Das, and K. Liu. (2007). A Game Theoretic Approach toward Multi-Party Privacy-Preserving Distributed Data Mining. In the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 2007.
  4. K. Liu, C. Giannella, H. Kargupta. (2007). On the Privacy of Euclidean Distance Preserving Data Perturbation. (In communication).
  5. K. Liu, K Bhaduri, K. Das, P. Nguyen, H. Kargupta (2006). Client-side Web Mining for Community Formation in Peer-to-Peer Environments. SIGKDD Explorations. Volume 8, Issue 2, Pages 11-20.
  6. K. Liu, H. Kargupta, and J. Ryan. (2006) Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), volume 18, number 1, pp. 92-106, January 2006.
  7. K. Liu, C. Giannella, and H. Kargupta. (2006). An attacker's view of distance preserving maps for privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006.
  8. J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch. (2005) Distributed Data Mining and Agents, Invited submission, Engineering Applications of Artificial Intelligence Journal. volume 18, pp. 791--807.
  9. H. Kargupta, K. Liu, S. Datta, J. Ryan, K. Sivakumar(2003) Link Analysis, Privacy Preservation, and Random Perturbations.  Proceedings of the  KDD Workshop on Link Analysis for Detecting Complex Behavior  (LinkKDD’03), Washington D.C., July  2003. pdf
  10. H. Kargupta, K. Liu, S. Datta, J. Ryan, and K. Sivakumar. (2003). Homeland security and privacy sensitive data mining from multi-party distributed resources. In Proceedings of the 12th IEEE International Conference on Fuzzy Systems, volume 2, pp. 1257-1260, St. Louis, MO, May 2003. pdf
  11. H. Kargupta, H. Dutta, S. Datta , K. Sivakumar (2003) Privacy Preserving Data Mining and Random Perturbation. Proceedings of the Workshop on Privacy in the Electronic Society (WPES'03), Washington DC,October,2003. pdf
  12. S. Datta, H. Kargupta and K. Sivakumar. (2003) .Homeland Defense, Privacy-Sensitive Data Mining, and Random Value Distortion. Proceedings of the SIAM Workshop on Data Mining for Counter Terrorism and Security  (SDM'03),San Francisco, CA, May, 2003. pdf
  13. D. Meng, K. Sivakumar, and H. Kargupta. (2004). Privacy Sensitive Bayesian Network Parameter Learning. Proceedings of the Fourth IEEE International Conference on Data Mining. Brighton, UK, pages 427--430. pdf
  14. H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. (2004). Random Data Perturbation Techniques and Privacy Preserving Data Mining. Knowledge and Information Systems Journal, volume 7, number 4, pages 387--414. pdf
  15. H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. (2003). On the Privacy Preserving Properties of Random Data Perturbation Techniques. Proceedings of the IEEE International Conference on Data Mining. Melbourne, Florida, USA, pages 99--106. (Winner of 2003 IEEE Data Mining Conference Best Paper Award.) pdf
  16. J. Ma and K. Sivakumar, A Framework of Privacy-Preserving Bayesian Network Parameter Learning using Post Randomization, In WSEAS Transaction of Information Technologies, January, 2006. pdf
  17. J. Ma and K. Sivakumar, Privacy-preserving Bayesian Network Learning from Heterogeneously Distributed data, In the proceeding of the 2nd International conference on Data Mining, Las Vegas, NV, (DMIN06), 2006. pdf
  18. H. Kargupta and K. Sivakumar, (2004) Existential Pleasures of Distributed Data Mining. Data Mining: Next Generation Challenges and Future Directions. Editors: H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha. AAAI/MIT Press.
  19. J. Ma, Learning from perturbed data for privacy-preserving data mining, PhD Thesis, Washington State University, 2006. pdf
  20. H. Kargupta, K. Liu , and J. Ryan. (2003). Privacy Sensitive Distributed Data Mining from Multi-party Data. Proceedings of Intelligence and Security Informatics, First NSF/NIJ Symposium 2003:336-342, Tucson, AZ. K. Liu, H. Kargupta, and J. Ryan. (2003).
  21. Random projection and privacy preserving correlation computation from distributed data. Technical Report TR-CS-03-24, Computer Science and Electrical Engineering Department, University of Maryland, Baltimore County. In communication.

Some of the above publications can be downloaded from this link....

Goals, Objectives and Targeted Activities

1. Develop a framework to understand the privacy-preserving properties of randomized data perturbation techniques using the theory of random matrices.
Exploration of Shannon's Channel Coding theorem to quantify data perturbation privacy.

3. Develop randomized data perturbation techniques that are provably correct and work in a distributed environment.
4. Develop a class random projection-based distributed algorithms for computing common statistical aggregates.
5. Develop privacy-sensitive distributed Bayesian network-based mining.
6. Implement and incorporate the proposed techniques into an experimental distributed data mining system.
7. Evaluate the developed techniques using available test beds.

Educational and Outreach Activities:
1. Broadening the participation of the under-represented groups, web presence.
2. Promoting undergraduate and graduate education through curriculum development and research participation.
3. Enhancing the infrastructure through the development of research and instructional laboratory.

Area References
C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, Tools for Privacy Preserving Distributed Data Mining, ACM SIGKDD Explorations 4(2), January 2003.

Project Websites

Other Resources
Privacy Preserving Data Mining Bibliography
Distributed Data Mining Bibliography
Distributed Data Mining Wiki