Tuesday, July 15, 2008

Making fudged PII

As I start dealing more with application development that sends data to external sources I think more about the security of personally identifiable data within that data. When testing web services and file transfers there is no reason to have real, personally identifiable information (PII) in that stream of test data. However, the destination for that data still needs values in those elements. There are a few ways to approach this:

  1. Random data
  2. Pre-generated test data, test cases, etc.
  3. Use existing data but generate PII from a real primary key


The main problem I see with randomly generated data is "re-testing". For example: you send data for ID 1234567 to a service, randomly generating four columns of test data, then that service requests a re-test with the same data.

Pre-generated data would be a set of test cases, known trouble patterns and other data created before testing occurs. This scenario is feasible for new development on new data systems, to test min/max/null value scenarios that would otherwise never appear in live data, or to force a specific set of data. Where this scenario becomes cumbersome is when there is a tightly integrated system with numerous historical pre-cursor processes; triple the complexity if that system is another vendor's package and not your own. For example: create customer, purchase 14 months worth of product, run through aging and re-bill process, reconcile A/R, and then skip a month of purchases. There could be over a hundred tables touched by that set of processes and if one is skipped/missed then the data for another group is no longer valid.

Generating test data from the primary key is feasible when the system has a long history of identified test data handy but simply needs PII altered to protect the identity of the individual or organization. By using a unique, primary key the "fudged" PII data will always match back to the source primary key for instances where "re-testing" is required. For example: ID 1234567 always generates social security number 898-75-5309 (not a real number but will validate in some systems).

The example below was written for Oracle SQL to show how to convert a seven digit primary key identifier into a social security number, date of birth and gender:
SELECT t1.id  --- VARCHAR2(7)
, TO_CHAR(899-FLOOR(TO_NUMBER(t1.id)/989901),'000')
||TO_CHAR(MOD(FLOOR(TO_NUMBER(t1.id)/9999),99)-99,'S00')
||TO_CHAR(MOD(TO_NUMBER(t1.id),9999)-9999,'S0000') social_security_number
, TO_CHAR(TRUNC(TRUNC(SYSDATE-6574.5,'YEAR')-MOD(FLOOR(TO_NUMBER(t1.id)/10),2191.5)-CASE SUBSTR(t1.id,-1,1)
WHEN '2' THEN 2191.5
WHEN '3' THEN 2191.5
WHEN '4' THEN 4383
WHEN '5' THEN 6574.5
WHEN '6' THEN 8766
WHEN '7' THEN 10957.5
WHEN '8' THEN 13149
WHEN '9' THEN 15340.5
ELSE 0 END),'MM/DD/YYYY') date_of_birth
, CASE
WHEN SUBSTR(t1.id,-1,1) < '5' THEN 'Male'
ELSE 'Female' END gender
, CASE SUBSTR(t1.id,-1,1)
WHEN '0' THEN 'AI' --- American Indian/Alaskan
WHEN '1' THEN 'AS' --- Asian/Pacific Islander
WHEN '2' THEN 'BL' --- Black/Non-Hispanic
WHEN '3' THEN 'HS' --- Hispanic
WHEN '4' THEN 'NR' --- Non-Resident Alien
WHEN '6' THEN 'AS'
WHEN '7' THEN 'BL'
WHEN '8' THEN 'HS'
ELSE 'WH' END ethnic_code
FROM table_name t1 WHERE [selection criteria];
The social security number is pretty straight forward. If area numbers (first three digits) higher than 772 do not validate, then use area 267 (237-267 have all groups allocated within them).

The date of birth is a little complicated but the attempt was to use the ones digit to generate one of seven date ranges with the lower two in the range getting more hits because they are the primary age group dealt with. Starting with a base age of 18 years of age, subtract one of seven six year blocks, and then subtract zero to six years.

Gender was a simple test of the ones digit to determine male or female.

Ethnic background was a simple translation of the ones digit to a code taking into account there are more numbers in four of the six groups.

This was a very bare example meant only to suggest direction. It would be interesting to build a library (although someone probably already has).

Sunday, July 13, 2008

No more Mr Cranky

One of my favorite Friday morning web reads is calling it quits soon. Mr Cranky and his movie reviews are going away in August. I’m so, well the world isn’t going to come to an end but a good source of honest movie reviews and funny caption contests will be no more. I’ve thought about writing movie reviews here and have done so for the “Blog of the Dead” but they relate directly to zombie movies and not to movies in general.

In order to write movie reviews I would actually have to go to see movies in the theater. I’ll probably go see the Dark Night but I think the last movie I went out to see was Spider Man 3 and that was only to see how badly they fucked up the Venom character. I typically wait until the movies hit the rental queue on Netflix and I’m usually limited to horror movies and whatever movies my girlfriend rents. Theoretically I could still do reviews of rentals but they would be kind of late, wouldn’t they? I could always do mini-reviews of the rentals to see if my warped mind matches the rest of the critical world but I find myself with barely enough time to write for this blog and my story blog.

Maybe instead of doing something in absence of Mr Cranky I should just let it fade off and not do something in the same vein, that would be the best honor of all.

Update: Mr Cranky did not go away. Yay!

Thursday, July 10, 2008

Web 2.blow

Okay this whole web 2.0 lets throw everything and the kitchen sink into a web browser and use it as a business computing platform has given me a reason to get angry and blog again.

I usually have two physical machines active at work (well, at home too). I have the Windows XP based machine where I do most of my work and an older Ubuntu Linux desktop machine that I use for web browsing, music listening, scripted automation, etc. I have four desktops on the Linux machine: music player and maintenance, one ore more Opera web browser sessions for various documentation and manuals, Opera web browser for Google searches, and one empty just in case I need it. This configuration is very useful to me and it comes in handy sometimes because the Windows XP machine has about 300 megabytes of anti-virus, usage monitoring, remote management agents, network client, print client, database server, remote update agents, groupware notification agents, software update sleeper programs, and who the frak knows what else running on it before I launch the first bit of work. In fact, if I have to boot/reboot the Windows XP machine in the morning I can usually have the Linux machine booted with all the applications I use there loaded, my "Good Morning" Opera session loaded and all pages viewed, etc. just as the Windows XP machine finished its rituals. Occasionally, usually when I'm doing real intensive work, my employer schedules scanner software to make sure I don't have any viruses, spyware, malware, underwear, etc. and to inventory my machine for hardware and software, and to scan all directories for .avi, .mp3, .mpg, etc. to see how much hard disk real-estate I'm using for I'm assuming "non-work related" stuff.

Okay, I know, get to the point. As we use more web-based applications I keep running into more examples of where a) they don't work in a browser, b) do not have complete functionality in one or more browsers; c) have functionality (possibly) broken by a feature of a browser. Here is what I deal with:

  • one environment, since it is deeply integrated into IE6, does not work in IE7 (so we cannot upgrade to IE7 yet)
  • one application simply refuses to work in Opera but works in IE6 and Firefox
  • one application works in IE but not well in Firefox or Opera due to poor JavaScript and CSS (yes, it's a Microsoft Visual Studio project, how did you guess)
  • one application works in IE but occasionally loses functionality in Firefox and Opera
  • one application requests I upgrade from IE6 to IE7
  • one application works in IE and Opera, but not Firefox (could be fixed with a GreaseMonkey script?)
  • one application does not work in Opera because of how they implement(ed) display:none in CSS for images

There are some applications that do make sense in the Web 2.0 world. However, there are also things that are better off in a "thick" client that is forced to adhere to a strict window API/widget toolkit. People have been conditioned, for good reason, to not trust the Internet and that is the reason why all the browsers either include anti-phishing, anti-hacking, ad blocking, flash blocking, JavaScript blocking, fraud protection, content blocking, etc. via add-on or built-in. Combine that with the straight fact that CSS, JavaScript, DOM and default fonts are not the same and/or do not work the same way on IE6, IE7, Firefox, Opera, etc. and that, IMHO, is why web based applications that try to do too much, fail (and piss me off).

Oh, and don't get me started on how security and personal identity security has been handled in this new, networked and Internet world.