Feeds:
Posts
Comments

By Kato Mivule

The latest state-of-the-art in data privacy, proposed by Cynthia Dwork (2006).

Differential privacy enforces confidentiality by:

  • Returning perturbed aggregated query results from databases.

  • Such that users cannot discern if any data item has been altered or not.

  • An attacker cannot derive info about any data item in the database.

According to Dwork (2006):

  • Two databases D1 and D2 are considered similar, if they differ in only one element or row,

  • That is D1 Δ D2 = 1.

  • Therefore, a privacy granting procedure qn satisfies  -differential privacy if results to the same query run on database D1 and again run on database D2 should probabilistically be similar, and satisfy the following condition:

P[qn(D1) ∈ R] / P[qn(D2) ∈ R] ≤ exp()

  • Where D1and D2 are the two databases.
  • P is the probability of the perturbed query results D1and D2 respectively.

  • qn() is the privacy granting procedure (perturbation).

  • qn(D1) the privacy granting procedure on query results from database D1 .

  • qn(D2) the privacy granting procedure on query results from database D2 .

  • R is the perturbed query results from the databases D1and D2 respectively.

  • exp()the exponential epsilon value.

The probability of the perturbed query results qn(D1) divided by the probability of the perturbed query results qn(D2) should be less or equal to a exponential epsilon value.

That is to say, if we run the same query on database D1, and then run the same query again on database D2, our query results should probabilistically be similar.

If the condition can be satisfied in the existence or nonexistence of the most influential observation for a specific query, then this condition will also be satisfied for any other observation.

The effect of the most influential observation for a given query is Δf, assessed as follows:

Δf = Max||f(D1) – f(D1)|| for all possible observed values of D1 and D2

According to Dwork (2006), the results to a query are given as

  • f(x) + Laplace(0, b) noise addition

  • Where b = Δf/

  • x represents a particular observed value of the database

  • f(x) represents the true result to the query

  • Then the result would satisfy -differential privacy

  • The Δf must consider all possible observed values of D1 and D1

Differential Privacy Ilustration

Differential Privacy Ilustration

Example:

What is the GPA of students at Geek Nerd State University?

We compute the maximum possible difference between two databases that differ in exactly one record for a specific query.

Δf = Max||f(D1) – f(D2)||

Let Min GPA = 2.0 for smallest possible GPA

Let Max GPA = 4.0 for largest possible GPA

Δf = | Max GPA – Min GPA|

Δf = 2.0

The parameter b of the Laplace noise is set to Δf/ = 2.0/0.01 = 200

Laplace (0, 200) noise distribution.

Variance of the noise distribution = 2* 200^2 = 80000

A small value of 0.01 is chosen. Smaller yield greater privacy by the procedure.

However, utility risks degeneration with a much more smaller value of .

For example, at 0.0001, will give b as 20000, Laplace (0, 20000) noise distribution.

The unperturbed results of the query +

Noise from Laplace (0, 200) =

Perturbed query results satisfying -differential privacy.

SQL: SELECT GPA FROM Student + Laplace Noise (0, 200) = -differential query results.

Pros and Cons

  • Grants across-the-board Privacy.

  • Easy to implement with SQL for aggregated data publication.

  • Utility a challenge as statistical properties change with a much smaller .

  • Noise takes into account the outliers and most influential observation.

  • Example, income of Warren Buffet verses income of janitor in Omaha Nebraska.

  • Balance between Privacy and Utility still an NP-Hard challenge.

References

[1] C. Dwork, “Differential privacy,” in in ICALP, vol. 2, 2006, pp. 1-12. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.83.7534

[2] K. Muralidhar and R. Sarathy, “Does differential privacy protect terry gross’ privacy?” in Privacy in Statistical Databases, ser. Lecture Notes in Computer Science, J. Domingo-Ferrer and E. Magkos, Eds.    Berlin, Heidelberg: Springer Berlin / Heidelberg, 2011, vol. 6344, ch. 18, pp. 200-209. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-15838-4_18

By Kato Mivule

  • Noise addition: A stochastic value is added to confidential numeric attributes.

  • The stochastic value is chosen from a normal distribution with zero mean and a diminutive standard deviation. First publicized by Jay Kim (1986) with the expression that

  • Z = X + ɛ

  • Where Z is the transformed data point

  • X is the original data point.

  • ɛ (epsilon)is the random variable (noise) with a distribution ε∼ N(0, σ2 ).

  • The X is then replaced with the Z for the data set to be published, Z = X + ɛ

Statistical Considerations in Noise Addition

Gaussian Noise Distribution

  • The Normal Distribution( Gaussian distribution), is a bell shaped probability distribution depicting real-valued stochastic variables clustered around a single mean…

  • Gaussian Noise Distribution
  • μ (mu) is the mean

  • σ2 (Sigma) is the variance

  • N(μ, σ2) is the normal distribution with mean μ and variance σ2

  • Transformed data has to keep the same statistical properties as the original data.
  • Covariance:Cov(X, Y): How affiliated the deviations between points X and Y.
  • Covariance
  • If Cov(X, Y) is positive, X and Y increase together, otherwise they don’t.
  • If Cov(X, Y) is zero, X and Y are each autonomous.
  • Correlation rxy calculates tendency of linear relation between two data points.
  • Correlation
  • If -1, then rxy is a negative linear relation between x and y,
  • if 0, no linear relation,
  • if +1, a strong linear relation.

Notes

[1] Jay Kim, A Method For Limiting Disclosure in Microdata Based Random Noise and Transformation, Proceedings of the Survey Research Methods, American Statistical Association, Pages 370-374, 1986.

[2] J. Domingo-ferrer, F. Sebe, and J. Castella-Roca, “On the security of noise addition for privacy in statistical databases,” in Privacy in Statistical Databases 2004, 2004, pp. 149-161. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.4575

Kato Mivule and Claude Turner
http://arxiv.org/abs/1107.3784 | PDF Download

Abstract
The growth of Information Technology(IT) in Africa has led to an increase in the utilization of communication networks for data transaction across the continent. A growing number of entities in the private sector, academia, and government, have deployed the Internet as a medium to transact in data, routinely posting statistical and non statistical data online and thereby making many in Africa increasingly dependent on the Internet for data transactions. In the country of Uganda, exponential growth in data transaction has presented a new challenge: What is the most efficient way to implement data privacy. This article discusses data privacy challenges faced by the country of Uganda and implementation of data privacy techniques for published tabular data. We make the case for data privacy, survey concepts of data privacy, and implementations that could be employed to provide data privacy in Uganda.

Some welcome news as Google makes the decision to take Privacy ‘more’ seriously after critique from Europe and other privacy watch dogs. Kato Mivule

Google taking more cautious stance on privacy – CNN
…Google, now an Internet giant, is learning that it needs to take more careful steps in regards to privacy. Prompted by intensifying scrutiny from government regulators and privacy watchdogs, Google is striving to be more sensitive to data-collection issues that often make people uneasy. It went so far as to spike a recent project Google engineers had developed because of fears of potential backlash, said Eric Schmidt, Google’s executive chairman and former CEO…
http://www.cnn.com/2011/TECH/web/05/31/google.schmidt/index.html?hpt=hp_c2

By Kato Mivule

Database Systems

During a database development process, often database developers will engage with their clients and users to solicit the requirements that the new database is supposed to accomplish. One of the best ways to communicate with clients and users is with the use of ER diagrams to check if the developers have fully incorporated all the requirements in the design as dictated by the client.

The ER diagrams become easy to study, visualize, and amend during the requirements soliciting process with the client. Therefore, ER diagrams are conceptional and abstract presentation of the database, fully giving a formal description of database components such as schema, entities, relationships, and attributes. [1] [2]

At the same time the Relational Model which was first introduced in 1969 by E.F Codd, is a conceptual view of a database based on first-order predicate logic, depicting a database as a set of predicate variants. The model stipulates a declaratory way for defining data and queries by requiring users to straightaway state what data they need to get from the data and place into the database.[4] Relational Models represent data as a collection of relations, depicting the database as a schema, the table as a relation, the columns as attributes, and rows as tuples.[2]

The Relational Model is declarative and therefore easy to implement using declarative languages like SQL, making it a major difference between ER Diagrams and the Relational Model. ER diagrams tend to be highly conceptual and though could be easily understood by clients and users who are not experts in database terminology, ER diagrams are not declarative and straight forward when it comes to implementation. It is therefore important to have the Relational Model interface between the ER diagrams and the SQL declarative language so as to make implementation faster and easier.

Secondly, any errors that come up during the design phase with the ER diagrams, can be captured when mapping to the Relational Model. Yet even more, wile both the ER and the Relational Model are mathematical, we can still map ER diagrams to the Relational Model, thus making implementation of the conceptual view to physical view much easier.

Yet still, the ease of use between the ER Model and Relational Model is still debatable. Peter P. Chen, the author of the ER model in the 1970s notes three differences between the ER Model and E.F. Codd’s Relational Model: [6] [7]

  • ER models employ mathematical relation constructs to show relationships between entities…
  • ER model incorporates more semantic data than the Relational Model…
  • ER model uses explicit linkage between entities while the Relational model uses implicit linkages between entities…

Yet still despite the advantages of more semantic information with the Chen’s ER model, Codd’s Relational model is closer to SQL implementation and has the declarative advantage. Therefore, for the conceptual view, the ER diagram work best but will work better when mapped to the Relational model for optimized physical view implementation.

Sources

[1] Seyed M. M. Tahaghoghi, Hugh E. Williams, “Learning MySQL”, O’Reilly Media, Inc., 2006, ISBN 0596008643, 9780596008642, Page 112-118

[2] Ramez Elmasri, Shamkant B. Navathe, Fundamentals of Database Systems, Edition 6, Addison Wesley Pub Co Inc, 2010, ISBN 0136086209, 9780136086208, Page 199-350

[3] Heidi Gregersen and Leo Mark and Christian S. Jensen, “Mapping Temporal ER Diagrams to Relational Schemas”, IN PROC. OF THE 9TH INT. CONF, 1998.

[4] “Relational model – Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Relational_model. [Accessed: 02-Nov-2010].

[5] “Entity-relationship model – Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Entity-relationship_model. [Accessed: 02-Nov-2010].

[6] Peter Chen, “Entity-Relationship Modeling: Historical Events, Future Trends, and Lessons Learned”, Software Pioneers: Contributions to Software Engineering, Springer, 2002.

[7] E.F. Codd, “A relational model of data for large shared data banks. 1970.,” M.D. computing : computers in medical practice, vol. 15, 1970, pp. 162-6.

Older Posts »

Follow

Get every new post delivered to your Inbox.