Feeds:
Posts
Comments

Archive for the ‘Tutorials and Papers’ Category

By Kato Mivule

The latest state-of-the-art in data privacy, proposed by Cynthia Dwork (2006).

Differential privacy enforces confidentiality by:

  • Returning perturbed aggregated query results from databases.

  • Such that users cannot discern if any data item has been altered or not.

  • An attacker cannot derive info about any data item in the database.

According to Dwork (2006):

  • Two databases D1 and D2 are considered similar, if they differ in only one element or row,

  • That is D1 Δ D2 = 1.

  • Therefore, a privacy granting procedure qn satisfies  -differential privacy if results to the same query run on database D1 and again run on database D2 should probabilistically be similar, and satisfy the following condition:

P[qn(D1) ∈ R] / P[qn(D2) ∈ R] ≤ exp()

  • Where D1and D2 are the two databases.
  • P is the probability of the perturbed query results D1and D2 respectively.

  • qn() is the privacy granting procedure (perturbation).

  • qn(D1) the privacy granting procedure on query results from database D1 .

  • qn(D2) the privacy granting procedure on query results from database D2 .

  • R is the perturbed query results from the databases D1and D2 respectively.

  • exp()the exponential epsilon value.

The probability of the perturbed query results qn(D1) divided by the probability of the perturbed query results qn(D2) should be less or equal to a exponential epsilon value.

That is to say, if we run the same query on database D1, and then run the same query again on database D2, our query results should probabilistically be similar.

If the condition can be satisfied in the existence or nonexistence of the most influential observation for a specific query, then this condition will also be satisfied for any other observation.

The effect of the most influential observation for a given query is Δf, assessed as follows:

Δf = Max||f(D1) – f(D1)|| for all possible observed values of D1 and D2

According to Dwork (2006), the results to a query are given as

  • f(x) + Laplace(0, b) noise addition

  • Where b = Δf/

  • x represents a particular observed value of the database

  • f(x) represents the true result to the query

  • Then the result would satisfy -differential privacy

  • The Δf must consider all possible observed values of D1 and D1

Differential Privacy Ilustration

Differential Privacy Ilustration

Example:

What is the GPA of students at Geek Nerd State University?

We compute the maximum possible difference between two databases that differ in exactly one record for a specific query.

Δf = Max||f(D1) – f(D2)||

Let Min GPA = 2.0 for smallest possible GPA

Let Max GPA = 4.0 for largest possible GPA

Δf = | Max GPA – Min GPA|

Δf = 2.0

The parameter b of the Laplace noise is set to Δf/ = 2.0/0.01 = 200

Laplace (0, 200) noise distribution.

Variance of the noise distribution = 2* 200^2 = 80000

A small value of 0.01 is chosen. Smaller yield greater privacy by the procedure.

However, utility risks degeneration with a much more smaller value of .

For example, at 0.0001, will give b as 20000, Laplace (0, 20000) noise distribution.

The unperturbed results of the query +

Noise from Laplace (0, 200) =

Perturbed query results satisfying -differential privacy.

SQL: SELECT GPA FROM Student + Laplace Noise (0, 200) = -differential query results.

Pros and Cons

  • Grants across-the-board Privacy.

  • Easy to implement with SQL for aggregated data publication.

  • Utility a challenge as statistical properties change with a much smaller .

  • Noise takes into account the outliers and most influential observation.

  • Example, income of Warren Buffet verses income of janitor in Omaha Nebraska.

  • Balance between Privacy and Utility still an NP-Hard challenge.

References

[1] C. Dwork, “Differential privacy,” in in ICALP, vol. 2, 2006, pp. 1-12. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.83.7534

[2] K. Muralidhar and R. Sarathy, “Does differential privacy protect terry gross’ privacy?” in Privacy in Statistical Databases, ser. Lecture Notes in Computer Science, J. Domingo-Ferrer and E. Magkos, Eds.    Berlin, Heidelberg: Springer Berlin / Heidelberg, 2011, vol. 6344, ch. 18, pp. 200-209. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-15838-4_18

Read Full Post »

By Kato Mivule

  • Noise addition: A stochastic value is added to confidential numeric attributes.

  • The stochastic value is chosen from a normal distribution with zero mean and a diminutive standard deviation. First publicized by Jay Kim (1986) with the expression that

  • Z = X + ɛ

  • Where Z is the transformed data point

  • X is the original data point.

  • ɛ (epsilon)is the random variable (noise) with a distribution ε∼ N(0, σ2 ).

  • The X is then replaced with the Z for the data set to be published, Z = X + ɛ

Statistical Considerations in Noise Addition

Gaussian Noise Distribution

  • The Normal Distribution( Gaussian distribution), is a bell shaped probability distribution depicting real-valued stochastic variables clustered around a single mean…

  • Gaussian Noise Distribution
  • μ (mu) is the mean

  • σ2 (Sigma) is the variance

  • N(μ, σ2) is the normal distribution with mean μ and variance σ2

  • Transformed data has to keep the same statistical properties as the original data.
  • Covariance:Cov(X, Y): How affiliated the deviations between points X and Y.
  • Covariance
  • If Cov(X, Y) is positive, X and Y increase together, otherwise they don’t.
  • If Cov(X, Y) is zero, X and Y are each autonomous.
  • Correlation rxy calculates tendency of linear relation between two data points.
  • Correlation
  • If -1, then rxy is a negative linear relation between x and y,
  • if 0, no linear relation,
  • if +1, a strong linear relation.

Notes

[1] Jay Kim, A Method For Limiting Disclosure in Microdata Based Random Noise and Transformation, Proceedings of the Survey Research Methods, American Statistical Association, Pages 370-374, 1986.

[2] J. Domingo-ferrer, F. Sebe, and J. Castella-Roca, “On the security of noise addition for privacy in statistical databases,” in Privacy in Statistical Databases 2004, 2004, pp. 149-161. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.4575

Read Full Post »

Kato Mivule and Claude Turner
http://arxiv.org/abs/1107.3784 | PDF Download

Abstract
The growth of Information Technology(IT) in Africa has led to an increase in the utilization of communication networks for data transaction across the continent. A growing number of entities in the private sector, academia, and government, have deployed the Internet as a medium to transact in data, routinely posting statistical and non statistical data online and thereby making many in Africa increasingly dependent on the Internet for data transactions. In the country of Uganda, exponential growth in data transaction has presented a new challenge: What is the most efficient way to implement data privacy. This article discusses data privacy challenges faced by the country of Uganda and implementation of data privacy techniques for published tabular data. We make the case for data privacy, survey concepts of data privacy, and implementations that could be employed to provide data privacy in Uganda.

Read Full Post »

By Kato Mivule

Database Systems

During a database development process, often database developers will engage with their clients and users to solicit the requirements that the new database is supposed to accomplish. One of the best ways to communicate with clients and users is with the use of ER diagrams to check if the developers have fully incorporated all the requirements in the design as dictated by the client.

The ER diagrams become easy to study, visualize, and amend during the requirements soliciting process with the client. Therefore, ER diagrams are conceptional and abstract presentation of the database, fully giving a formal description of database components such as schema, entities, relationships, and attributes. [1] [2]

At the same time the Relational Model which was first introduced in 1969 by E.F Codd, is a conceptual view of a database based on first-order predicate logic, depicting a database as a set of predicate variants. The model stipulates a declaratory way for defining data and queries by requiring users to straightaway state what data they need to get from the data and place into the database.[4] Relational Models represent data as a collection of relations, depicting the database as a schema, the table as a relation, the columns as attributes, and rows as tuples.[2]

The Relational Model is declarative and therefore easy to implement using declarative languages like SQL, making it a major difference between ER Diagrams and the Relational Model. ER diagrams tend to be highly conceptual and though could be easily understood by clients and users who are not experts in database terminology, ER diagrams are not declarative and straight forward when it comes to implementation. It is therefore important to have the Relational Model interface between the ER diagrams and the SQL declarative language so as to make implementation faster and easier.

Secondly, any errors that come up during the design phase with the ER diagrams, can be captured when mapping to the Relational Model. Yet even more, wile both the ER and the Relational Model are mathematical, we can still map ER diagrams to the Relational Model, thus making implementation of the conceptual view to physical view much easier.

Yet still, the ease of use between the ER Model and Relational Model is still debatable. Peter P. Chen, the author of the ER model in the 1970s notes three differences between the ER Model and E.F. Codd’s Relational Model: [6] [7]

  • ER models employ mathematical relation constructs to show relationships between entities…
  • ER model incorporates more semantic data than the Relational Model…
  • ER model uses explicit linkage between entities while the Relational model uses implicit linkages between entities…

Yet still despite the advantages of more semantic information with the Chen’s ER model, Codd’s Relational model is closer to SQL implementation and has the declarative advantage. Therefore, for the conceptual view, the ER diagram work best but will work better when mapped to the Relational model for optimized physical view implementation.

Sources

[1] Seyed M. M. Tahaghoghi, Hugh E. Williams, “Learning MySQL”, O’Reilly Media, Inc., 2006, ISBN 0596008643, 9780596008642, Page 112-118

[2] Ramez Elmasri, Shamkant B. Navathe, Fundamentals of Database Systems, Edition 6, Addison Wesley Pub Co Inc, 2010, ISBN 0136086209, 9780136086208, Page 199-350

[3] Heidi Gregersen and Leo Mark and Christian S. Jensen, “Mapping Temporal ER Diagrams to Relational Schemas”, IN PROC. OF THE 9TH INT. CONF, 1998.

[4] “Relational model – Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Relational_model. [Accessed: 02-Nov-2010].

[5] “Entity-relationship model – Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Entity-relationship_model. [Accessed: 02-Nov-2010].

[6] Peter Chen, “Entity-Relationship Modeling: Historical Events, Future Trends, and Lessons Learned”, Software Pioneers: Contributions to Software Engineering, Springer, 2002.

[7] E.F. Codd, “A relational model of data for large shared data banks. 1970.,” M.D. computing : computers in medical practice, vol. 15, 1970, pp. 162-6.

Read Full Post »

By Kato Mivule

Database Systems

Outline

In this article we take a look at the differences between SQL, Relational Algebra, and Relational Calculus. I this article, we focus on the main differences between Relational Algebra and Relational Calculus. Since SQL is mainly an implementation language, we take note of some major differences between Relational Algebra and Relational Calculus.

Introduction

Relational Algebra (RA) and Relational Calculus (RC) are formal languages for the database relational model while SQL is the practical language in the database relational model. [1] In these formal languages a conceptual database model is expressed in mathematical terms and notations while in the practical language – SQL, the mathematical expressions of the functionality and transaction of the database operations are implemented physically. Formal languages provide a medium through which to optimize and implement queries in database transactions. [1]

Of course the first notable differences between these languages in the syntax and notation used in the expressions. Each language, that RA, RC, and SQL have their own notations to express their notations.

RA Notations – Procedural Expression Language

We begin with the notations used in RA and that are unique to RA and are procedural, requiring a sequence of operations to express a query transaction as described by Elmasri et al: [1]

SELECT

The SELECT opeartor is σ (sigma) symbol and used as an expression to choose tuples that meet the selection condition…

σ<Selection condition>(R)

 

PROJECT

The PROJECT operator in RA is ∏ (pi) symbol used to choose attributes from a relation.

<attribute list>(R )

RENAME

The RENAME operator is symbolized by ρ (rho) and is used to express a naming of query results, attributes, and relations.

ρ s(B1, B2, B3,….Bn)(R )

UNION

RA notation for UNION is symbolized by ∪ , and includes all tuples that are in A or in B, eliminating duplicate tuples, therefore set A UNION set B would be expressed as:

RESULT ← A ∪ B

INTERSECTION

The INTERSECTION operation on a relation A INTERSECTION relation B, is symbolized by A B, includes tuples that are only in A and B.

RESULT ← A B

MINUS

the MINUS operation includes tuples from one Relation that are not in another Relation and symbolized by the – (minus) symbol. Therefore A – B would be expressed as…

RESULT ← A – B

CARTERSIAN PRODUCT

Creates a relation that has all the attributes of A and B, allowing all the attainable combinations of tuples from A and B in the result. The notation used is X.

C = A X B

JOIN

The JOIN operation is denoted by the ⟗ symbol and is used to compound similar tuples from two Relations into single longer tuples.

A ⟗ <join condition> B

NATURAL JOIN

The NATURAL JOIN operation returns results that does not include the JOIN attributes of the second Relation B and is symbolized by the * symbol.

    A * ⟗ <join condition> B,

OR A * ⟗ (<join attributes 1>),

(<join attributes 2>) B

OR A * B

DIVISION

The DIVISION operation will return a Relation R(X) that includes all tuples t[X] in R(Z) that appear in R1 in combination with every tuple from R2(Y), where Z = X ∪ Y. The DIVISION operator is symbolized by ∻ symbol

R1(Z) ∻ R2(Y)

RC Notation – Non Procedural Expression Language

As Elmasri et al note, RC is a non procedural expression language that does not require any sequence to be followed in the expression of query transactions. One declarative statement could be used express a query transaction. [1]

The calculus query syntax as noted by Elmasri et al is:

{t | COND(t)}

Where t is the tuple variable.

COND(t) is the conditional expression.

The AND, OR, and NOT logical operators are also used in the expressions. An example of a RC expression is:

{t | STUDENT(t) AND t. GPA > 3.0}

This query would return all students whose GPA is greater than 3.0.

Expressions and Formulas in RC as noted by Elmasri et al are in the form:

{t1.Aj, t2.Ak,…, tn.Am | COND(t1, t2, …, tn, tn+1, tn+2, …, tn+m)}

Where COND is the condition or formula.

Existential and Universal Quantifiers

Elmasri et al further note that in RC quantifies that are largely unique to RC can appear in fomulas. The quantifiers include:

Universal quantifier (∀): Is true if all tuples make the formula true.

Existential quantifier (∃): Is true if there exists some tuple that makes the formula true

Example would include:

List the first name, last name and address of Students whose Major is Computer Science.

Q1: {s.Fname, s.Lname, s.Address | STUDENTS(s) AND (∃ m) (MAJOR(m)

AND m.major=’Computer Science’ AND m.studentnumb=t.studentnumb)}

SQL Notation

SQL is a structured query language, a comprehensive database language with data definition, queries, and updates.[1] It is with SQL that RA and RC are implemented physically in a database management program. Therefore this is the major difference between RA, RC and SQL, in that SQL will take the conceptual expressions of RA and RC and implement them at the computer level or physical level. SQL includes many declarative statements that it would take another article to write all of them but below are some few included here:

  • Create: used to create schema and tables
  • Insert: used to insert data to the tables
  • Select: used to select tuples that satisfy a condition, also used to select attributes.
  • Select fname From STUDENTS Where major = ‘computer science’: selects first name of students from the students table whose major is computer science.

It is noted that RC is much easier to implement in SQL as the RC notation is similar to SQL syntax.

Conclusion

Therefore RA and RC major differences are that RC is non procedural while RA is procedural and rigid as such. However, another notable difference is that it is easy to implement RC expressions directly to SQL language.

Sources

[1] Ramez Elmasri, Shamkant B. Navathe, Fundamentals of Database Systems, Edition 6, Addison Wesley Pub Co Inc, 2010, ISBN 0136086209, 9780136086208, Page 145 – 186

Read Full Post »

Older Posts »

Follow

Get every new post delivered to your Inbox.