The Social Implications of Data Mining

This post explores how data mining, a rapidly changing discipline of new technologies and concepts, affects the individual right to privacy. As technology becomes more enmeshed in the daily lives of individuals, information on their activities is being stored, accessed, and used. Society is developing a new definition of privacy in this information environment, with few laws specifying privacy protection with electronic transmission and storage. Collecting and using data without limitations is unacceptable, but norms have changed enough that data collection has been accepted without much opposition.

Data collection and privacy issues are in the forefront of international discussions. Interest groups who believe that voluntary restraints are sufficient struggle against privacy advocates who argue that controls must be backed by legislation to be effective. Advocacy groups are alarmed about the government’s potential for invading privacy, but data collection by businesses has expanded that concern to public and private sectors.

Under the guise of protecting public interest, government and business are revising regulations to expand use of data once considered private such as bank records and medical files. Advocacy groups counter this trend by encouraging the need for participation in any personal data collection and distribution.

With so much data already stored and transmitted, privacy advocates feel it is no longer possible to control the process. This lack of confidence indicates the need to strengthen watchdog efforts, if privacy rights are to be retained. What are ethical information practices that can satisfy data mining users and privacy advocates? Can data mining principles be developed to dictate how personal data can be protected in terms of quality, purpose, use, security, participation, and accountability?

Data Mining Defined

The information age has enabled many organizations to gather large amounts of data, but its usefulness is negligible if knowledge cannot be extracted. Data mining attempts to answer this need by connecting the fields of databases, artificial intelligence, and statistics. It has steadily evolved since the 1960s.

Data mining is the discovery of actionable patterns in large amounts of data using statistical and artificial intelligence tools (Berry & Linoff, 1997). More specifically, it is “the process of nontrivial extraction of implicit, previously unknown and potentially useful information such as knowledge rules, constraints, and regularities from data stored in repositories using pattern recognition technologies as well as statistical and mathematical techniques” (Lee & Siau, 2001, 41). Data mining can also be defined as “the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner” (Hand, Mannila, & Smyth, 2001, 1). Observational data—rather than experimental—refers to data that has already been collected for an original purpose, such as bank transactions. As a secondary use, data mining harvests this information to find actionable patterns.

There are two main categories of data mining tasks: description and prediction (Mena, 2004). In description, the goal is to discover patterns by seeking out variables common to different individuals or groups that exhibit certain characteristics. Examples of descriptive methods include association rule discovery and clustering. Association rule discovery finds connections among sets of items or objects in database, which implies the likelihood of the occurrence of several events (Agrawal, Imielinski, & Swami, 1993). Clustering creates a grouping from objects; objects that belong to the same cluster are similar to each other and differ from objects in other clusters (Berkhin, 2001). Prediction is used to make statements about the unknown based upon the known; it can forecast the future or explain the present (Weiss & Indurkhya, 1998).

Privacy Defined

The term “privacy” is used frequently in ordinary language, yet it has no single definition (Kemp & Moore, 2007). The concept of privacy has broad historical roots in sociological and anthropological discussions about how it is preserved in various cultures. Some argue that the values of privacy are distinctly Western, culturally relative, and not universally accepted (Brey, 2009). Globalization and the emergence of the Internet has created an international community, which requires a moral system. German theologian Hans Küng stresses the need for global ethics, a shared moral framework agreed upon by all cultures, especially because actions in cyberspace are not local (Küng, 2001, as cited in Brey, 2009). How global ethics will mold a universal definition of privacy has yet to be seen.

The historical use of the word “privacy” is not uniform, and confusion over its meaning, value, and scope remains. “Outside of narrow academic circles, constructing an exact definition of privacy has proven less important than addressing concrete claims for privacy protection. Seclusion, solitude, anonymity, confidentiality, modesty, intimacy, secrecy, autonomy, and reserve—securing these social goods is what privacy is all about” (Allen, 2006, 1).

Different conceptions of privacy typically fall into six categories: the right to be let alone, limited access to the self, secrecy, control of personal information, personhood, and intimacy (Solove, 2002). In this paper, privacy refers to the right of users to conceal their personal information and have some degree of control over the use of any personal information disclosed to others. Data mining techniques should be effective without disposing of the need to preserve privacy, one of the basic values of modern free societies.

Data Mining Problems

When used responsibly, data mining can be beneficial to society. The explosion of data generated by new technologies, decreasing costs of computer storage, and increasing capabilities of search tools have made data mining an important instrument of government anti-terrorism efforts since September 11, 2001.

Additionally, data mining is important for business because it reveals information about past performance that can predict future functioning. It exposes emerging trends from which the company might profit and allows for statistical predictions, groupings, and classifications of data. Data mining tools allow businesses to make proactive, knowledge-driven decisions and answer business questions that were previously too time-consuming to resolve. While some values of data mining are evident, the ethical repercussions of its usage have been slower to emerge.

For example, the government accesses an extraordinary volume of personal data to search for terrorist activities. Privacy statutes fail to limit the government’s access to personal data, and they have been amended in the post-9/11 world to reduce them further. The Fourth Amendment, the constitutional guarantee of individual privacy, has been interpreted by the Supreme Court to not apply to routine data collection, accessing data from third parties, or sharing data, even if illegally gathered. Data mining leads to false positives of innocent people being investigated, which can have serious consequences (Quinn, 2009). Privacy advocates believe that data mining exposes ordinary people to ever more scrutiny by authorities while skirting legal protections designed to limit the government’s collection and use of personal data.

Data mining that allows companies to identify their best customers could easily be used by businesses to categorize vulnerable customers such as the elderly, poor, or sick. Unscrupulous businesses could use the information to offer people inferior deals or to discriminate against certain populations. One person’s file may be confused with another’s, causing an individual with good credit to be rejected for a loan. Although this problem may be corrected in time, the mistake will have negatively affected the individual’s life. Some companies compile data, use it for their own purposes, and then sell it for profit. Other organizations do not have enough security measures in place to protect against unauthorized access. Since information about customers has become a commodity, businesses have increased incentives to acquire information from customers, making it more difficult to protect their privacy (Rosenberg, 2000, as cited in Quinn, 2009).

Another area where data mining presents ethical problems relates to data mining for health-related issues of employees. In the past, employers have used data mining to determine the frequency of sicknesses and possible illnesses that may result. This information is useful when purchasing health insurance for an organization, but there is the potential that the findings may be used when making hiring decisions. For example, an employer may make the decision not to hire an employee because they are likely to have certain expensive health problems. Similarly, insurance companies could refuse to sell policies those that they identified as high risk for carrying diseases.

Data mining tools also make inference easier (Clifton & Marks, 1996). “Inference is the process of users posing queries and deducing unauthorized information from the legitimate response that they receive” (Thuraisingham, 2005). Thuraisingham (2005) points out that inference problems, which mainly deal with confidentiality, have parallels to problems with privacy. While data mining is an important tool for many applications, the information extracted should be used ethically.

Data Mining Problems and the Internet

The Internet has enabled privacy threats to occur on a broader scale. Cavoukian (1998) notes that one of the purposes of data mining is to map Internet patterns. When considering privacy threats related to data mining, it is important to explore how the Internet facilitates data mining and exacerbates privacy issues.

According to Slane (1998), the four privacy issues in data mining with the Internet are security, accuracy, transparency, and fairness. Before the Internet, access to databases was reasonably limited to a few authorized people. However, the Internet makes it easier for more people to access databases. Without strong access control, private information can be disclosed, manipulated, and misused.  Accuracy is also a problem because with the growth of the Internet, data mining involved large amounts of data from a variety of sources. The more databases involved, the greater the risk that the data is inaccurate and the more difficult it is to clean the data, which may lead to errors and misinterpretation. Transparency becomes difficult because people cannot correct data about themselves, and they cannot express concern with the use of their information. When data mining, no one can predict what kinds of relationships or patterns of data will emerge. It is questionable that data subjects are being treated fairly when they are unaware that personal data about them is being mined.

Ethics and Data Mining

Ethical inquiry provides a basis for choosing proper actions based on rational principles and sound arguments. With this in mind, the ethics of data mining will be examined using utilitarianism and Kantianism.

“Act utilitarianism is the ethical theory that an action is good if its net effect (over all affected beings) is to produce more happiness than unhappiness” (Quinn, 2009, 75). In other words, an action is right from ethical point of view, if the sum total of utilities produced by that act is greater than the sum total of utilities produced by any other act the agent could have performed in its place. Utilitarianism holds that in the final analysis only one action is right: that one action whose net benefits are greatest by comparison to the net benefits of other alternatives. Both the immediate and foreseeable future costs and benefits that the alternative will provide for individuals must be taken into account as well as any significant indirect effects. The alternative that produces the greatest sum total of utility must be chosen as the ethically appropriate course of action. Utilitarianism also has the advantage of being able to explain why we hold that certain types of activities are generally morally wrong, such as lying, while others are generally morally right. Actions are never always right or always wrong; it depends on the circumstances.

From a utilitarian viewpoint, data mining is ethical because it enables corporations to minimize risk and increase profits, helps the government strengthen security, and benefits society with technological advancements. The invasion of personal privacy and the risk of having people misuse the data would be considered a small downside. Based on this theory, since the majority benefits from data mining, data mining is ethical.

Kantianism is the belief that “people’s actions ought to be guided by moral laws, and that these moral laws were universal” (Quinn, 2009, 69). The first formulation categorical imperative is “act only from moral rules that you can at the same time will to be universal moral laws” (Quinn, 2009, 70). The second formulation categorical imperative is “act so that you always treat both yourself and other people as ends to themselves and never only as means to an end” (Quinn, 2009, 71). In other words, people should not use others to achieve their goals. Kant set forth that “it is morally obligatory to respect every person as a rational agent” (Davis, 1993, 211). Kantianism requires individuals to consider the impact of their actions on other persons and to modify their actions to reflect the respect and concern they have for others.

Using Kantianism, data mining is unethical because users advance their own interests without regard for people’s privacy—using them as a means to an end. 

Privacy Preserving Data Mining

Since the ethical nature of data mining is questionable, privacy preserving data mining may be a suitable solution, which benefits data mining users and individuals concerned with their right to privacy. Privacy preserving data mining refers to data mining techniques that protect sensitive data while allowing useful information to be extracted from the data set. Many schemes have been proposed for privacy preserving data mining, but there is no paradigm for research. “Although there is an extensive pool of literature that addresses many aspects of both privacy and data mining, it is often unclear as to how this literature relates and integrates to define an integrated privacy preserving data mining research discipline” (Fu, Nemati, & Sadri, 2007, 48). Rakesh Agragwal at IBM Almaden, Johannes Gehrke at Cornell University, and Christopher Clifton at Purdue University are forerunners to further developing privacy protecting data mining by modifying algorithms, maintaining some level of privacy (Thuraisingham, 2005). The field is young, but promising.

Vaidya and Clifton (2004) argue that data mining does not inherently threaten privacy and discuss two strategies in which data can reveal patterns without revealing private information: randomization and secure multiparty computation (SMC). Randomization arbitrarily samples from datasets that share certain characteristics with the original data. SMC allows parties to cooperate in data mining, without revealing data to parties that do not already know them.

Additionally, algorithms can be used to extract data patterns without directly accessing the original data (Kargupta, Liu, Datta, Ryan, & Sivakumar, 2003). Other approaches are based on perturbation, which adds random noise from a known distribution to the privacy sensitive data, and the data miner uses the reconstructed distribution for data mining purposes (Liu, Kantarcioglu, & Thuraisingham, 2008). Sensitive data can be masked in a number of statistical ways, yet still provide workable information to data mining users.

Social Implications of Data Mining

Society is redefining privacy to conform to concepts compatible with the information era. Individuals, governments, and corporations are trying to find common ground, balancing the individuals’ right to privacy and government’s and industry’s need to disseminate information necessary to best serve public interests.

The increasing use of data mining tools in both the public and private sectors raises concerns regarding the potentially sensitive nature of the data being mined. The utility gained from data mining comes into conflict with an individual’s right to privacy. Privacy preserving data mining solutions achieve a paradox: enabling data mining algorithms to use data without accessing it. Thus, the benefits of data mining may be enjoyed, without compromising privacy.

Works Cited

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. SIGMOD Record, 22(2), 207.

Allen, A. L. (2006). Privacy, definition of. In W. G. Staples (Ed.), Encyclopedia of Privacy A-M (pp. 393-403). Westport, CT: Greenwood Press.

Berkhin, P. (2001) Survey of clustering data mining techniques. Retrieved February 17, 2019, from: http://www.accrue.com/products/rp_cluster_review.pdf

Berry, M., & Linoff, G. (1997). Data mining techniques for marketing, sales, and customer support. New York: Wiley.

Brey, P. (2009). Is information ethics culturally relative? In E. Eyob (Ed). Social implications of data mining and information privacy: Interdisciplinary frameworks and solutions (pp. 1-14). Hershey, PA: ICI Global.

Cavoukian, A. (1998). Data mining: Staking a claim on your privacy. Information and Privacy Commissioner. Retrieved March 1, 2019, from http://www.ipc.on.ca/images/Resources/datamine.pdf

Clifton, C., & Marks, D. (1996). Security and privacy implications of data mining. Proceedings of the ACM SIGMOD Conference Workshop on Research Issues in Data Mining and Knowledge Discovery.

Davis, N. (1993). Contemporary deontology. In P. Singer (Ed)., A Companion to ethics (pp. 205-218). Oxford: Blackwell.

Fu, L., Nemati, H., & Sadri, F. (2007). Privacy-preserving data mining and the need for    confluence of research and practice. International Journal of Information Security and Privacy, 1(1), 47-64.

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press.

Kargupta, H., Liu, K., Datta, S., Ryan, J., & Sivakumar, K. (2003). Homeland security and privacy sensitive data mining from multi-party distributed resources. The IEEE International Conference on Fuzzy Systems: Vol. 2. (pp. 1257-1260).

Kemp, R., & Moore, A. D. (2007). Privacy. Library Hi Tech, 25(1), 58-78.

Küng, H. (2001). A global ethic for global politics and economies. Hong Kong: Logos and Pneuma Press.

Lee, S. J., & Siau, K. (2001) A review of data mining techniques. Industrial Management & Data Systems, 100(1), 41-46.

Liu, L., Kantarcioglu, M., & Thuraisingham, B. (2008). The applicability of the perturbation based privacy preserving data mining for real-world data. Data & Knowledge Engineering, 65(1), 5-21.

Mena, J. (2004). Homeland security techniques and technologies. Hingham, MA: Charles            River Media.

Quinn, M. J. (2009). Ethics for the information age. (3rd ed.). New York: Pearson Education.

Rosenberg, A. (2000). Privacy as a matter of taste and right. In E. Frankel, Jr., F. D. Miller & J. Paul (Eds)., The right to privacy (pp. 68-90). Cambridge: Cambridge University Press.

Slane, B.H. (1998). Data mining and fair information practices: Good business sense. Retrieved March 1, 2019, from: http://www.privacy.org.nz/data-mining-and-fair-information-practices-good-business-sense/

Solove, D.J. (2002). Conceptualizing privacy. California Law Review, 90, 1087-1156.

Thearling, K. (n.d.). An introduction to data mining: Discovering hidden value in your data warehouse. Retrieved February 17, 2019, from:   http://www.thearling.com/text/dmwhite/dmwhite.htm

Thuraisingham, B. (2005). Privacy-preserving data mining: Developments and directions. Journal of Database Management, 16(1), 75-87.

Vaidya, J., & Clifton, C. (2004). Privacy-preserving data mining: Why, how and when. IEEE Security and Privacy, 19-27.

Weiss, S. M., & Indurkhya, N. (1998). Predictive data mining: A practical guide. San Francisco: Morgan Kaufmann.

Looking for archival advising, records management, and historical research services? Click below to speak with an expert consultant.

Here are some of my favorite books on the subject: