Science topic

Privacy Preserving Data Mining - Science topic

Explore the latest questions and answers in Privacy Preserving Data Mining, and find Privacy Preserving Data Mining experts.
Questions related to Privacy Preserving Data Mining
  • asked a question related to Privacy Preserving Data Mining
Question
3 answers
I was exploring differential privacy (DP) which is an excellent technique to preserve the privacy of the data. However, I am wondering what will be the performance metrics to prove this between schemes with DP and schemes without DP.
Are there any performance metrics in which a comparison can be made between scheme with DP and scheme without DP?
Thanks in advance.
Relevant answer
Answer
Dear Anik Islam Abhi,
You may want to review the data below:
What is differential data privacy?
Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.
Why is differential privacy so important?
Preventing attackers from access to perfect data This deniability aspect of differential privacy is important in cases like linkage attacks where attackers leverage multiple sources to identify the personal information of a target.
What is privacy budget in differential privacy?
Also known as the privacy parameter or the privacy budget. (i) When ε is small. (ε,0)-differential privacy asserts that for all pairs of adjacent databases x, y and all outputs M, an adversary cannot distinguish which is the true database on the basis of observing the output.
What is differential privacy in machine learning?
Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data.
How much is enough choosing Epsilon for differential privacy?
... The recommended values for ε vary in a big interval, from as small as 0.01 and 0.1 to as big as 7
Who uses differential privacy?
Apple launched differential privacy for the first time in macOS Sierra and iOS 10. Since then, we have expanded to other use cases such as Safari and Health types.
Differential Privacy: General Survey and Analysis of Practicability in the Context of Machine Learning
Franziska Boenisch
  • asked a question related to Privacy Preserving Data Mining
Question
5 answers
Currently, I am exploring federated learning (FL). FL seems going to be in trend soon because of its promising functionality. Please share your valuable opinion regarding the following concerns.
  • What are the current trends in FL?
  • What are the open challenges in FL?
  • What are the open security challenges in FL?
  • Which emerging technology can be a suitable candidate to merge with FL?
Thanks for your time.
Relevant answer
Answer
I agree, there is already a publication with the name "swarm learning". the authors have applied block chain technology for security.
  • asked a question related to Privacy Preserving Data Mining
Question
6 answers
I was exploring federated learning algorithms and reading this paper (https://arxiv.org/pdf/1602.05629.pdf). In this paper, they have average the weights that are received from clients as attached file. In the marked part, they have considered total client samples and individual client samples. As far I have learned that federated learning has introduced to keep data on the client-side to maintain privacy. Then, how come the server will know this information? I am confused about this concept.
Any clarification?
Thanks in advance.
Relevant answer
Answer
Thanks for your input. I have their codes. They have followed the same. I have attached their code below.
  • asked a question related to Privacy Preserving Data Mining
Question
26 answers
With Covid19, we have seen digital transformation happening so much faster: everyone able to do this was suddenly working from home, using the internet and the available digital access and platforms. Education was also put on hold unless it moved online: online-school, online-education got a tremendous stimulus.
Now, what is fueling the data economy, what is the "new green oil" of this digitally transformed world? It's DATA, and it's data fairly priced for the stakeholders, starting with identified data owners.
Please see here a link to some books on the subject, reviewing the rational for data use in every aspect of life, business, markets, society, and looking at the creation of Data Market Places, as well as diving into the detailed equations of how to price and how to make it happen in economic terms, as Data Microeconomics:
Relevant answer
Answer
Exactly. However, more planning is needed from now own. Thank you so much for your enlightening and innovative thoughts
  • asked a question related to Privacy Preserving Data Mining
Question
5 answers
My research group sequenced the genome of a plant with potential commercial interest. We want to publish the article, describing the genome sequence, completeness of our genome, assemble statistics and genes related to some pathways we looked for. But we do not want to deposit it in a public databank, like GenBank. We want to mantaing a private one. Most journals do not accept this, require the sequence is available in a public databank or in private one with public access. We do not have a patent.
Someone can help me how to figure it out how to keep my sequence private without a patent?
Relevant answer
Answer
The journals are correct. Publish your data publicly so that it actually has value.
  • asked a question related to Privacy Preserving Data Mining
  • asked a question related to Privacy Preserving Data Mining
Question
5 answers
What is the difference between p-sensitive k-anonymity and l-diversity?
Relevant answer
Answer
You have other variation such as t-closeness
l-diversity guarantees that for a group there is at least l different sensitive attribute values (or combination), in addition to the basic k-anonymity algorithm.
p-sensitivity relates to location query, but could probably be generalised to a larger set of problem.
What do you want to do?
  • asked a question related to Privacy Preserving Data Mining
Question
8 answers
Hi,
Could anyone help me to find a comprehensive study of privacy-preserving data analytics models in cloud computing (survey)? 
Thank you
Relevant answer
Answer
As a follow-up to Bob's answer, there are some useful papers on the A4Cloud website 
  • asked a question related to Privacy Preserving Data Mining
Question
2 answers
deFinetti's theorem is considered as "final nail in the coffin" for its impact on privacy preserving data mining techniques that are syntactic type. For example k-anonymity, l-diversity, t-closeness etc. Can any one explain why ?
  • asked a question related to Privacy Preserving Data Mining
Question
5 answers
Differential Privacy is usually defined in an interactive setting where the user asks queries and the system returns noisy answers. Is there any study / research done whether they are prone to inferential attacks ?
Relevant answer
Answer
This research is about review of certain databases and their proneness  to  inferential attacks. It went on to talk about examplees of this queries system being tampered  with. And improving Comparative Effectiveness research.
i should mention that you check the references in this paper about anonymized data
 Privacy Technology to Support Data Sharing for Comparative Effectiveness Research: A SYSTEMATIC REVIEW  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3728160/#R32
  • asked a question related to Privacy Preserving Data Mining
Question
3 answers
Our current problem is to classify data-stream with privacy preservation using k-anonymity/perturbation. I am in search for a large stream dataset for this purpose. Dataset should contain partially labelled data.
  • asked a question related to Privacy Preserving Data Mining
Question
3 answers
We are studying functional encryption for application in privacy preserving data mining(PPDM). Are there any development or deployment for the same?. Is there any formal or practical aspects available for the exact understanding of performance measure and complexity calculations for various functional encryption policies viz. key policy attribute based encryption, Ciphertext policy attribute based encryption, Inner product encryption for PPDM. Also we want to know about simulation tools suitable for this task.
Relevant answer
Answer
To me functional encryption is not the correct tool you want to use. Functional encryption let you to decrypt a ciphertext if your secret key satisfies a function specified when encrypting the ciphertext, but it does not allow you to compute a function on the ciphertext. As Kato and Dadid suggested, homomorphic encryption, anonymization or multiparty computation protocols are more appropriate.
  • asked a question related to Privacy Preserving Data Mining
Question
2 answers
We are working on privacy preserving issues in temporal multilevel association mining and want to know which is most effective algorithm in practice/real deployment/research for the same purpose at present.
Relevant answer
Answer
Mr. Robert,
Thanks for giving this answer, your suggestion is really useful to me.
  • asked a question related to Privacy Preserving Data Mining
Question
3 answers
What are the major parameters to test the efficiency of one such algorithm? Is there any appropriate tool?
Relevant answer
Answer
You have to calculate entropy on both original and anonymized  dataset.Following papers written by  Fung et al can help you
1. Privacy-Preserving Data Publishing: A Survey of  Recent Developments
2. Anonymity for Continuous Data Publishing
To compare with existing algorithm, if you are using same standard data set then results of original author can be used otherwise you have implement existing algorithm also.
If you are implementing existing algorithm, then your results must resemble with original results.
  • asked a question related to Privacy Preserving Data Mining
Question
4 answers
I am looking for the state of the art algorithms so I can use it for proving the we have privacy preserving issues and then make some adjustment into the algorithm and make it privacy proof.
Relevant answer
Answer
Hi Ahmad,
Here is the state of the Art in data mining which does NOT preserve privacy at all, at least not in the European Union.
Most people were introduced to the arcane world of data mining when National Security Agency contractor Edward Snowden allegedly leaked classified documents that detail how the U.S. government uses the technique to track terrorists. The security breach revealed that the government gathers billions of pieces of data—phone calls, emails, photos, and videos—from Google, Facebook, Microsoft, and other communications giants, then combs through the information for leads on national security threats. The disclosure caused a global uproar over the sanctity of privacy, the need for security, and the perils of government secrecy. People rightfully have been concerned about where the government gets the data—from all of us—but equal attention has not been paid to what it actually does with it. Here's a guide to big-data mining, NSA-style.
The Information Landscape
Just how much data do we produce? A recent study by IBM estimates that humanity creates 2.5 quintillion bytes of data every day. (If these data bytes were pennies laid out flat, they would blanket the earth five times.) That total includes stored information—photos, videos, social-media posts, word-processing files, phone-call records, financial records, and results from science experiments—and data that normally exists for mere moments, such as phone-call content and Skype chats.
Veins of Useful Information
The concept behind the NSA's data-mining operation is that this digital information can be analyzed to establish connections between people, and these links can generate investigative leads. But in order to examine data, it has to be collected—from everyone. As the data-mining saying goes: To find a needle in a haystack, you first need to build a haystack.
Data Has to Be Tagged Before It's Bagged
Data mining relies on metadata tags that enable algorithms to identify connections. Metadata is data about data—for example, the names and sizes of files on your computer. In the digital world, the label placed on data is called a tag. Tagging data is a necessary first step to data mining because it enables analysts (or the software they use) to classify and organize the information so it can be searched and processed. Tagging also enables analysts to parse the information without examining the contents. This is an important legal point in NSA data mining because the communications of U.S. citizens and lawful permanent resident aliens cannot be examined without a warrant. Metadata on a tag has no such protection, so analysts can use it to identify suspicious behavior without fear of breaking the law.
Finding Patterns in the Noise
The data-analysis firm IDC estimates that only 3 percent of the information in the digital universe is tagged when it's created, so the NSA has a sophisticated software program that puts billions of metadata markers on the info it collects. These tags are the backbone of any system that makes links among different kinds of data—such as video, documents, and phone records. For example, data mining could call attention to a suspect on a watch list who downloads terrorist propaganda, visits bomb-making websites, and buys a pressure cooker. (This pattern matches the behavior of the Tsarnaev brothers, who are accused of planting bombs at the Boston Marathon.) This tactic assumes terrorists have well-defined data profiles—something many security experts doubt.
Open Source and Top Secret
The NSA has been a big promoter of software that can manage vast databases. One of these programs is called Accumulo, and while there is no direct evidence that it is being used in the effort to monitor global communications, it was designed precisely for tagging billions of pieces of unorganized, disparate data. The secretive agency's custom tool, which is based on Google programming, is actually open-source. This year a company called Sqrrl commercialized it and hopes the healthcare and finance industries will use it to manage their own big-data sets.
The Miners: Who Does What
The NSA, home to the federal government's codemakers and code-breakers, is authorized to snoop on foreign communications. The agency also collects a vast amount of data—trillions of pieces of communication generated by people across the globe. The NSA does not chase the crooks, terrorists, and spies it identifies; it sifts information on behalf of other government players such as the Pentagon, CIA, and FBI. Here are the basic steps: To start, one of 11 judges on a secret Foreign Intelligence Surveillance (FISA) Court accepts an application from a government agency to authorize a search of data collected by the NSA. Once authorized—and most applications are—data-mining requests first go to the FBI's Electronic Communications Surveillance Unit (ECSU), according to PowerPoint slides taken by Snowden. This is a legal safeguard—FBI agents review the request to ensure no U.S. citizens are targets. The ECSU passes appropriate requests to the FBI Data Intercept Technology Unit, which obtains the information from Internet company servers and then passes it to the NSA to be examined with data-mining programs. (Many communications companies have denied they open their servers to the NSA; federal officials claim they cooperate. As of press time, it's not clear who is correct.) The NSA then passes relevant information to the government agency that requested it.
What the NSA Is Up To
Phone-Metadata Mining Dragged Into the Light
The NSA controversy began when Snowden revealed that the U.S. government was collecting the phone-metadata records of every Verizon customer—including millions of Americans. At the request of the FBI, FISA Court judge Roger Vinson issued an order compelling the company to hand over its phone records. The content of the calls was not collected, but national security officials call it "an early warning system" for detecting terror plots (see "Connecting the Dots: Phone-Metadata Tracking").
PRISM Goes Public
On the heels of the metadata-mining leak, Snowden exposed another NSA surveillance effort, called US-984XN. Every collection platform or source of raw intelligence is given a name, called a Signals Intelligence Activity Designator (SIGAD), and a code name. SIGAD US-984XN is better known by its code name: PRISM. PRISM involves the collection of digital photos, stored data, file transfers, emails, chats, videos, and video conferencing from nine Internet companies. U.S. officials say this tactic helped snare Khalid Ouazzani, a naturalized U.S. citizen who the FBI claimed was plotting to blow up the New York Stock Exchange. Ouazzani was in contact with a known extremist in Yemen, which brought him to the attention of the NSA. It identified Ouazzani as a possible conspirator and gave the information to the FBI, which "went up on the electronic surveillance and identified his coconspirators," according to congressional testimony by FBI deputy director Sean Joyce. (Details of how the agency identified the others has not been disclosed.) The NYSE plot fizzled long before the FBI intervened, but Ouazzani and two others pleaded guilty of laundering money to support al-Qaida. They were never charged with anything related to the bomb plot.
Mining Data as It's Created
The slides disclosed by Snowden indicate the NSA also operates real-time surveillance tools. NSA analysts can receive "real-time notification of an email event such as a login or sent message" and "real-time notification of a chat login," the slides say. That's pretty straightforward use, but whether real-time information can stop unprecedented attacks is subject to debate. Alerting a credit-card holder of sketchy purchases in real time is easy; building a reliable model of an impending attack in real time is infinitely harder.
What is XKeyscore?
In late July Snowden released a 32-page, top-secret PowerPoint presentation that describes software that can search hundreds of databases for leads. Snowden claims this program enables low-level analysts to access communications without oversight, circumventing the checks and balances of the FISA court. The NSA and White House vehemently deny this, and the documents don't indicate any misuse. The slides do describe a powerful tool that NSA analysts can use to find hidden links inside troves of information. "My target speaks German but is in Pakistan—how can I find him?" one slide reads. Another asks: "My target uses Google Maps to scope target locations—can I use this information to determine his email address?" This program enables analysts to submit one query to search 700 servers around the world at once, combing disparate sources to find the answers to these questions.
How Far Can the Data Stretch?
Oops—False Positives
Bomb-sniffing dogs sometimes bark at explosives that are not there. This kind of mistake is called a false positive. In data mining, the equivalent is a computer program sniffing around a data set and coming up with the wrong conclusion. This is when having a massive data set may be a liability. When a program examines trillions of connections between potential targets, even a very small false-positive rate equals tens of thousands of dead-end leads that agents must chase down—not to mention the unneeded incursions into innocent people's lives.
Analytics to See the Future
Ever wonder where those Netflix recommendations in your email inbox or suggested reading lists on Amazon come from? Your previous interests directed an algorithm to pitch those products to you. Big companies believe more of this kind of targeted marketing will boost sales and reduce costs. For example, this year Walmart bought a predictive analytics startup called Inkiru. The company makes software that crunches data to help retailers develop marketing campaigns that target shoppers when they are most likely to buy certain products.
Pattern Recognition or Prophecy?
In 2011 British researchers created a game that simulated a van-bomb plot, and 60 percent of the "terrorist" players were spotted by a program called DScent, based on their "purchases" and "visits" to the target site. The ability of a computer to automatically match security-camera footage with records of purchases may seem like a dream to law-enforcement agents trying to save lives, but it's the kind of ubiquitous tracking that alarms civil libertarians. Although neither the NSA nor any other agency has been accused of misusing the data it collects, the public's fear over its collection remains. The question becomes, how much do you trust the people sitting at the keyboards to use this information responsibly? Your answer largely determines how you feel about NSA data mining.