Blog — Language, culture, and data science

Language, culture, and data science

Posts in DataScience

Transactions are people

Transactions are people

The most basic information about customer transactions tells you what someone bought, when they bought it, and for how much. But if that’s all you see, you’ve pretty much reduced people into rows in your spreadsheet and you’ve put to bed any ambition of understanding the relationships you have with customers. This is a post about coffee, but it’s also about waking up to the meaning and motivations behind transaction data.

ArtificialIntelligence, DataScience, IndustryTyler SchnoebelenJanuary 24, 2018

Ethical AI: Products that Love People

Ethical AI: Products that Love People

Algorithms neither love nor suffer (despite all the hype about super-intelligent sentient robots). The products that algorithms power neither love nor suffer. But people build products and people use products. And a product that loves is one that, as Tyler Schnoebelen laid out in his recent Wrangle Talk, anticipates and respects the goals of the people it impacts. For us, this means building models with evaluation metrics beyond just precision, recall, or accuracy. It means setting objective functions that maximize not only profits, but the mutual benefit between company and consumer. It means helping businesses appreciate the miraculous nuances of people so they can provide contextual experiences and offer relevant products that may just make a consumer experience enjoyable.

ArtificialIntelligence, Ethics, DataScience, MachineLearning, IndustryTyler SchnoebelenSeptember 18, 2017

Cher is the queen of emoji even if she isn't

Cher is the queen of emoji even if she isn't

It is universally recognized by experts that Cher is the Queen of Emoji. (Hail, Cher.)

I’m pretty sure this is what Cher wears while she tweets emoji after emoji after emoji

But as far as I know, no one has (a) performed an actual analysis to prove this, nor has anyone (b) performed an adequate interpretive dance to Dark Lady. I once tried to tackle (b) at a retreat near Big Sur, but today my focus is (a).

CorpusLinguistics, DataScience, Emoji, GenderTyler SchnoebelenSeptember 10, 2017

The Ethics of Everybody Else: New video posted

The Ethics of Everybody Else: New video posted

I had heard about Wrangle for a while — a data science conference where folks come to talk about the hardest problems they’ve faced and how they’ve found their ways around them. It also has a rancher-rustler theme, though you can’t see the cowboy boots I wore in the newly-posted video of my talk.

Here’s how I kicked off my 20-minute talk, called “The Ethics of Everybody Else”:

ArtificialIntelligence, DataScience, Ethics, MachineLearning, MultimediaTyler SchnoebelenSeptember 7, 2017

Did Wonder Woman really cost too much for you to love it?

Did Wonder Woman really cost too much for you to love it?

I was trying to get a group together to see Wonder Woman, but I ran into opposition I didn’t expect. First: a friend revealed his policy against movies on the weekend. Simultaneously, a boyfriend revealed his policy against big-budget Hollywood movies.

I wanted to prove them wrong wrong wrong.

This post is about being lasso-of-truthed by data. Uff.

DataScience, Gender, MultimediaTyler SchnoebelenJune 3, 2017

The carrots and sticks of ethical NLP

The carrots and sticks of ethical NLP

Professions run into ethical problems all the time. Consider engineering: the US sold $9.9b worth of arms in 2016 ($3.9b in missiles). The most optimistic reading is that instruments of death prevent death. Consider medicine: Medical research is dominated by concerns of market size and patentability, leaving basic questions like “is this fever from bacteria or virus” unanswered for people treating illnesses in low-income countries. Consider law: Lawyers upholding the law can break any normal definition of justice. Even in philosophy, ethicists are not known to be more moral than anyone else.

ArtificialIntelligence, DataScience, Ethics, MachineLearning, NLPTyler SchnoebelenApril 12, 2017

Ethics in machine-learning, natural language processing, and AI

Ethics in machine-learning, natural language processing, and AI

This is the visual version of my 5-pg paper, “Goal-oriented design for ethical machine learning and NLP”, which you can find alongside a bunch of others by going to http://ethicsinnlp.com/program.

ArtificialIntelligence, DataScience, MachineLearning, NLP, EthicsTyler SchnoebelenMarch 27, 2017

MGMT, SBTRKT, PWR BTTM : ils ont tué la voyelle, ils s’expliquent enfin

MGMT, SBTRKT, PWR BTTM : ils ont tué la voyelle, ils s’expliquent enfin

You don't have to know French to be able to have fun with this article (use Google Translate!). It's about why there are all these band names with all caps and no vowels...what are the patterns?

CorpusLinguistics, DataScience, PressTyler SchnoebelenJanuary 4, 2017

Budgeting for Training Data

Budgeting for Training Data

Organizations build machine learning systems so that they can predict and categorize data. But to get a system to do anything, you have to train it. This post is meant to help you figure out a budget for training data based on best practices.

ArtificialIntelligence, DataScience, MachineLearning, NLPTyler SchnoebelenNovember 17, 2016

Trump does NOT talk like a woman (BREAKING NEWS: gender continues to be complicated and confusing)

Trump does NOT talk like a woman (BREAKING NEWS: gender continues to be complicated and confusing)

Tldr: gender doesn’t make for good soundbites if you’re doing it right.

Here’s a headline from Politico that is counter-intuitive, aggravating, and compelling: Donald Trump talks like a woman. I’d like to speak out on behalf of a bunch of linguists who say KNOCK IT OFF.

CorpusLinguistics, DataScience, Gender, NLP, PoliticsTyler SchnoebelenNovember 5, 2016

U.S. presidential debates through the eyes of a computer

U.S. presidential debates through the eyes of a computer

This post wraps up a series I’ve been doing on using machine learning models to understand recent American political debates (here and here). By taking all the transcripts of the debates since last year, I show which words and phrases most distinguish debaters’ styles and issues. Training a computer to identify speakers is usually thought of as a way of doing forensics or personalization. But here, I’m interested in something closer to summarization. If you can pick one section of talk for each candidate from the last debate, which moments are most consistent with everything they’ve said up to then?

CorpusLinguistics, DataScience, MachineLearning, NLP, PoliticsTyler SchnoebelenOctober 13, 2016

The most Trumpian and Clintonesque moments in the debate (according to a computer)

The most Trumpian and Clintonesque moments in the debate (according to a computer)

Let’s teach a computer to guess who-said-what in the first US presidential debate between Hillary Clinton and Donald Trump. This is a way of finding out which moments the candidates were most like themselves — as well as when they were most like Bernie Sanders or Ted Cruz.

CorpusLinguistics, NLP, Politics, DataScience, MachineLearningTyler SchnoebelenSeptember 28, 2016

Why Technology Has Not Killed the Period. Period.

Why Technology Has Not Killed the Period. Period.

“Periods are not dead,” says computational linguist Tyler Schnoebelen, who turned to his own trove of 157,305 text messages to analyze how the final period—a period at the end of a thought or sentence—was being used and shared his initial results exclusively with TIME. “They’re actually doing interesting things.”

CorpusLinguistics, DataScience, NLP, PressTyler SchnoebelenSeptember 24, 2016

More data beats better algorithms

More data beats better algorithms

Most academic papers and blogs about machine learning focus on improvements to algorithms and features. At the same time, the widely acknowledged truth is that throwing more training data into the mix beats work on algorithms and features. This post will get down and dirty with algorithms and features vs. training data by looking at a 12-way classification problem: people accusing banks of unfair, deceptive, or abusive practices.

DataScience, MachineLearning, NLPTyler SchnoebelenSeptember 23, 2016

Nattering Nabobs of Negativity: Bigrams, “Nots,” and Text Classification

Nattering Nabobs of Negativity: Bigrams, “Nots,” and Text Classification

You can get pretty far in text classification just by treating documents as bags of words where word order doesn’t matter. So you’d treat “It’s not reliable and it’s not cheap” the same as “It’s cheap and it’s not not reliable”, even though the first is an strong indictment and the second is a qualified recommendation. Surely it’s dangerous to ignore the ways words come together to make meaning, right?

CorpusLinguistics, DataScience, Emotion, MachineLearning, NLPTyler SchnoebelenSeptember 8, 2016

Training an AI doctor

Training an AI doctor

Some of the earliest applications of artificial intelligence in healthcare were in diagnosis—it was a major push in expert systems, for example, where you aim to build up a knowledge base that lets software be as good as a human clinician. Expert systems hit their peak in the late 1980s, but required a lot of knowledge to be encoded by people who had lots of other things to do. Hardware was also a problem for AI in the 1980s.

ArtificialIntelligence, MachineLearning, DataScienceTyler SchnoebelenAugust 17, 2016

Feminism since 2004

Feminism since 2004

A sociologist and I were recently talking about events that have affected discussions about feminism. I went over to Google Trends to check out what people have been searching for and when peaks have appeared.

DataScience, GenderTyler SchnoebelenAugust 10, 2016

Failed vs. fighting: the linguistic differences between speeches at the RNC and the DNC conventions

Failed vs. fighting: the linguistic differences between speeches at the RNC and the DNC conventions

We know that Republicans and Democrats talk differently, but what’s the best way to describe these differences? Commentators note the relative darkness of the Republican National Convention and the focus on optimism and higher production quality for the Democratic National Convention. Looking at the words speakers use helps–but you can’t just use simple frequency (for details, check out the methodology section at the bottom).

CorpusLinguistics, DataScience, NLP, PoliticsTyler SchnoebelenAugust 1, 2016