The Manhattan Project to Big Data: Engaging Bias, Ethics, & more in Data

During some downtime before an exam when I was a young chemistry student in college I asked my professor at the time “We are learning how to understand the chemical world around us, and all of the powers that come along with that knowledge, but how come there is no instruction about the ethics and how to properly use that knowledge? I hope that everyone here will use their skills to better humanity, but why is that taken as a given?”. In front of the 150 other students in the room, he said “Frankly we should. Every single thing you study as an adult should reckon with the gravity of its purpose, for good or for ill, and try and examine itself to prevent some of the problems that field of study has engaged with in the past.”

He continued

“Chemist specifically owe a debt to many of the scientists who, in one way or another, were involved with the Manhattan Project. And while their minds were at the forefront of nuclear engineering, the ethical deployment of their skills bounced back and forth on their objective. It seems that not much has changed and we just hope for the best.”

A 10 year reunion of many of the scientists from the Manhattan Project

When Robert Oppenheimer had successfully navigated the development of the world’s first nuclear weapon, he had an understanding of the gravity of his development in his, now famous quote, “Now I am become death, destroyer of worlds”. The result of his (and many other people’s labor) was immediately apparent in the vast destructive power of their creation, they knew they were building a tool they hoped would end the war, and it was pretty clear their bomb would further that reality.

Dr. Oppenheimer

I make that analogy because while a flash of light and a mushroom cloud ushered in a new age of stunning developments that would change the world dramatically, we are not nearly so lucky to have a singular instance in time to measure the world that was and the world that would be. Similarly while we are able to point to a myriad of instances where the world seems to change every few years because of advancements in technology, we lack the contextual wherewithal to really process all the things that are happening simultaneously that moves us into the next age. The self-driving cars, the slow monopolization of social networking in our lives, the further automation of labor, and the deepening application of AI and algorithms to process our lives all add up as ingredients into a different type of atomic bomb, an atomic bomb of big data. The final part of this analogy brings us back to Dr. Oppenheimer and the unintended consequences of his work. He was eventually remorseful for the bombs, but we shouldn’t have that same take away when we talk about big data. That being said, we have to be cognizant of the awesome responsibility we need to assume when we handle big data and the mechanisms that handle it. While we can corral big data to help people when it comes to their health, their small businesses, and so much more we need to think about the underlying issues that might be harmful.

The aftermath of the Trinity Test

People might assume that computers, being machines running on logic, are inherently logical, unbiased. Information comes in and results come out. But there are three specific parts of data science that prevent computers from truly being impartial arbiters when it comes to big data, they are assumptions, biases, and contextual bad data. These aren’t always present in whatever kind of data you work with, but by and large they are pervasive and sometimes. subtly reveal themselves in the output. These kinds of biases can really damage regular peoples lives and make decisions that can adversely affect policies.

When it comes to buying a house, health insurance for you, or avoid being miss-classified by facial recognition software when the police are looking for a suspect, it is important you get it right.

Assumptions are when we take understood notions as authoritative and we don’t reflect on their actual structure of the data and where it is derived. Assumptions are dangerous because science is a process, and there are very few things in the work that ever truly stagnate and don’t ever shift with the times. By not being vigilant scientists can make the mistake of letting assumptions about how things operate blind them to the systemic issues that might be rotting and preventing genuine understanding. Archaic methods and understandings need to constantly be scrutinized to avoid being blindsided for when things eventually shift. While not specifically a data problem, the 2008 housing market collapse was predicated on the layering of bad tranches with “good” ones (which were also bad) to allow them to pass rating agencies which just rubber stamped them anyways to continue to get business over their competitors. Sometimes your way of life or job involves burying problematic issues in a way that doesn’t resolve them. The assumption that everything would continue to keep working without doing the due diligence into the underlying housing bonds, or code in our case, could eventually build a backlog of malicious data that at the time might be a rounding error, but could eventually undermine a division in your organization. If something is ever wrong, you have a responsibility to make sure that it doesn’t end up hurting people. Unfortunately the Upton Sinclair quote “It is difficult to get a man to understand something when his salary depends on his not understanding it.” extends to how people execute their duties or conduct themselves.

Bad bonds packaged with “Good” bonds so they can be repackaged and sold

Biases are found in programs because humans create programs and, whether humans outwardly know it or not, they might even program those biases in subconsciously. Facial recognition technology is one new instance where the technology, when tested on a majority white testing group works well, but when applied for people with darker skin performs poorly. This can be chalked up to having too small a sample of non-white testers, but for some systems who’s to say if something like this is more subtle and isn’t caught. Credit scores and redlining are another place where the observations of an individual are affected by the environment more so than any actions that person might have ever done to be there. A person who doesn’t interact with women very often might make a program that doesn’t take their specific needs into account and that program might inadvertently add extra stress for a woman where there wouldn’t be for a man. Again that is not to say it can’t be reworked with an extra set of eyes, but sometimes biases compound and even groups of people might be ignorant of a problem that might later develop.

The red zones were where minorities historically lived, and any houses in the red immediately had a lower value

Contextual bad data here isn’t data that is erroneously input, but rather the methods or questions one might ask to get the answers being done poorly or in bad faith. Similar to the biases, redlining and credit scores, are derived from rather opaque systems that have been found to structurally disadvantage minorities. Housing data and credit score data when analyzed might be worked by someone oblivious to these issues, and not be able to adequately take them into account. It just might end up that the metrics, when updated, on these historically problematic datasets, might reinforce further bad policies because the assumption (again) is that the data is just the data so any results gleaned from them is just the way the world is. Especially when data intersects with society, it becomes as messy and complicated as people are known to be. But at the same time the stakes are higher to force us to ask questions. Who is getting the data? How are they getting the data? Are there underlying reasons the data is the way it is, and how will that affect the efficacy of the data usage? A good scientist asks questions, and should always be incredulous about the answers they get to those questions.

It is often not as blatant as this but, if your data is adjacent to this that this a problem.

As the digital world continues to balloon the necessity for re-calibrating our expectations for how data is gathered, organized, and used is imperative. There has to be accountability at all levels when it comes to collecting and using this data ethically. As people are turned into scores for advertisers or political algorithms, now more than ever we need to be steadfast in making sure we are transparent in the origin and structure of code, so that these scores aren’t abused. Wherever there is data we need to be constantly be questioning and checking the methodology. This should help data scientists deepen their grounding to the world they will shape, more and more, but also empower people to better inform their decisions when it comes to their data being used. The Pandora’s box of big data is not unlike the atom bomb in 1945, ushering a paradigm shift in the world. The example is further stressed that while the bomb was destructive, it also generates power to create, and while the power of big data to make better decisions for people is easier than ever, bad / ignorant actors can have their decisions be made easier as well. Our curiosity will help shed light into places that need it, and by asking the right questions often, we can create good data that will help remedy some of the issues bad data had created.

Further reading on the topic:

“Weapons of Math Destruction” by Cathy O’Neil (here as a free PDF)

“The Signal and the Noise” by Nate Silver

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store