ETL Data Matching
A major topic of ETL is string matching and data matching. While string matching returns a simularity score between two strings, data matching returning a simularity between two two tuples (rows) in a database. For example, we would like to determine whether the tuples (David Smith, 608-245-4367, Madison WI)
and (D. M. Smith, 245-4367, Madison WI)
refer to the same person.
The problem of data matching arises in many integration situations. In the simplest case, we may be merging multiple databases with identical schemas, but without a unique global ID, and want to decide which rows are duplicates. The problem is complicated when we need to join rows from sources that have different schemas. Data matching may also arise at query time. For example, a query may often imprecisely refer to a particular data item, e.g., a query asking for the phone number of a David Smith who lives in Madison. We need to employ data matching techniques to decide which tuples in the database match the query.
In this project, we will complete a cross-dataset join across two terrorism databases w/ differnent schemas: GTD and GDELT.
Techniques
We cover several classes of solutions to the data matching problem. The first kind employs handcrafted rules, or Rule-based Matching to match tuples. These techniques typically make heavy use of domain-specific knowledge in domains where the complexity of the rules is manageable. The next kind of solution, Learning-based Matching learns the appropriate rules from labeled examples, using supervised learning. The third kind, clustering does not use training data. Instead, it iteratively assigns tuples to clusters, such that all tuples within a single cluster match and those across clusters do not.
import pandas as pd
import time
%pylab inline
Populating the interactive namespace from numpy and matplotlib
Data
Load the two datasets
gtd=pd.read_excel("data/gtd_12to15_0616dist.xlsx")
headers = pd.read_excel("data/GDELT Metadata.xlsx").columns.values
gdelt = pd.read_csv("data/20150108.export.txt", delimiter="\t", names=headers, parse_dates=["Day"])
/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (8,9,10,11,14,19) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Cleaning
Remove tuples where lat/lng or date is na
gtd = gtd.dropna(subset=['latitude', 'longitude', 'iday'])
gdelt = gdelt.dropna(subset=["ActionGeo_Lat", "ActionGeo_Long", "Day"])
Common Years
The two datasets intersect for years 2014-2015
gtd.iyear.value_counts()
2014 16737
2015 14712
2013 11892
2012 8448
Name: iyear, dtype: int64
The years in GDLET are distributed strangly:
gdelt.Year.value_counts()
2015 155997
2014 2744
2005 44
Name: Year, dtype: int64
set(gtd.iyear).intersection(set(gdelt.Year))
{2014, 2015}
Filter for 2015
#limit query to year 2015 b/c both datasest have full 2015 data
gtdf = gtd[(gtd.iyear==2015) | (gtd.iyear==2014)]
gdeltf=gdelt[(gdelt.Year==2015) | (gdelt.Year==2014)].sample(10000) #sampling a subset of the data for fast testing
Rule-Based Matching
Common Columns
We need to define the mapping between the columns that are in common between the two datasets.
I could only identify 2 columns that map GTD to GDELT. Key is GTD column Value is GDELT column
- iyear/imonth/iday : Day - convert to int seconds and measure euclidian distance
- latitude/longitude : ActionGeo_Lat/ActionGeo_Long - convert to 2D vector and measure euclidian distance
Simularity Metric
We begin by covering approaches that employ handcrafted matching rules. A simple yet popular type of rule computes the similarity score between a pair of tuples $x$ and $y$ as a linearly weighted combination of the individual similarity scores:
where $n$ is the number of common coloumns between tables $X$ and $Y$ (in this case 2), $s_i(x,y) ∈ [0,1]$ is a similarity score between the $i$th attributes of $x$ and $y$, and $\alpha_i ∈ [0, 1]$ is a prespecified weight that indicates the importance of the $i$th attribute to the overall similarity score, such that $\Sigma_{i=1}^n \alpha_i = 1$. We declare $x$ and $y$ matched if $sim(x,y) ≥ β$ for a prespecified threshold $β$, and not matched otherwise.
Python Implementation
Lat/Lng
First we take each lat/lng and put them in a 2D vector format so we can take the euclidian disantace.
Date
We take the date data from each touple and put it in native python datetime
format. Then we convert each datetime
into an integer representing the seconds from the epoch. We can then take the euclidian distance between two datetimes
.
Computational Complexity
This routine is a loop in a loop: O(n^2) growth complexity with 5,150,259,916 computations to run if we limit search to 2014-2015.
print (gtdf.shape, gdeltf.shape)
gtdf.shape[0] * gdeltf.shape[0]
((31449, 137), (10000, 58))
314490000
#python implemention of the sim function above w/ 2 matching rules for lat/lng and date
#arguments are pandas series (row) from each database
def sim(target, match):
sim = 0
latlng_sim = sim_latlng(target, match)
sim += (W_LATLNG * latlng_sim)
date_sim = sim_date(target, match)
sim += (W_DATE * date_sim)
return sim
#define lat/lng sim function
#expect 2 1x2 numpy arrays
def sim_latlng(target, match):
target_latlng = np.array(target[["latitude", "longitude"]].values)
match_latlng = np.array(match[["ActionGeo_Lat", "ActionGeo_Long"]].values)
sim = euclid_sim(target_latlng, match_latlng)
return sim
#try to match on date
#2 integers represeignn seconds from th epoch
def sim_date(target, match):
target_date_parts = target[["iyear", "imonth", "iday"]].values
target_date = datetime.datetime(target_date_parts[0], target_date_parts[1], target_date_parts[2])
target_seconds = (target_date - EPOCH).total_seconds()
match_date = match["Day"]
match_seconds = (match_date - EPOCH).total_seconds()
sim = euclid_sim(target_seconds, match_seconds)
return sim
def euclid_sim(a,b):
dist = numpy.linalg.norm(a-b)
prob = 1 / (1 + dist)
return prob
Results
Here we iterate thought our targets in GTD and then compare against each candidate match in GDELT and return the closest match. A match is defined as the highest score which is computed by nanargmax
below. We can also configure our weights here too:
%%time
#weights should sum to 1
W_LATLNG=0.5
W_DATE=0.5
EPOCH = datetime.datetime.utcfromtimestamp(0)
# threshold (beta)
THRESHOLD = 0.4
#choose what tuples of GTD you want to match against
targets = gtdf.sample(500)
matches = {}
for _, target in targets.iterrows():
scores = []
for _, row in gdeltf.iterrows():
score = sim(target, row)
scores.append(score)
i = np.nanargmax(scores)
high_score = scores[i]
#if above thershold, consider match
if high_score >= THRESHOLD:
match = gdeltf.iloc[i]
matches[target.eventid] = (high_score, match.GlobalEventID)
else:
matches[target.eventid] = (None, None)
pass
CPU times: user 1h 26min 10s, sys: 48.1 s, total: 1h 26min 58s
Wall time: 1h 26min 48s
Report
Print out top 200 matches by score
sorted_matches = sorted(matches.items(), key=lambda x: x[1][0], reverse=True)
top200 = sorted_matches[0:200]
gtdids= zip(*top200)[0]; scores = zip(*zip(*top200)[1])[0]; gdeltids=zip(*zip(*top200)[1])[1]
report = pd.DataFrame({"GTD": gtdids, "Score": scores, "GDELT": gdeltids})
report [["GTD", "Score", "GDELT"]].head(200)
GTD | Score | GDELT | |
---|---|---|---|
0 | 201501080052 | 0.999093 | 331067090 |
1 | 201501080005 | 0.989967 | 331080176 |
2 | 201501080044 | 0.79047 | 331118383 |
3 | 201501080026 | 0.774984 | 331004618 |
4 | 201401080033 | 0.638378 | 330987548 |
5 | 201501010069 | 0.528406 | 330990691 |
6 | 201404150072 | 0.499986 | 331138117 |
7 | 201405310048 | 0.499661 | 331083568 |
8 | 201510030022 | 0.499625 | 331122461 |
9 | 201512090035 | 0.499625 | 331122461 |
10 | 201501260068 | 0.499093 | 331067090 |
11 | 201411040061 | 0.498777 | 331099534 |
12 | 201501100067 | 0.498382 | 331067090 |
13 | 201512140021 | 0.498276 | 331084035 |
14 | 201501020078 | 0.498071 | 331008363 |
15 | 201512150074 | 0.497702 | 331033950 |
16 | 201406040063 | 0.497232 | 330989742 |
17 | 201401180006 | 0.4968 | 330990919 |
18 | 201401240009 | 0.496475 | 331065696 |
19 | 201405310013 | 0.496462 | 331148592 |
20 | 201412060035 | 0.496462 | 331092941 |
21 | 201504300066 | 0.496336 | 331083542 |
22 | 201402130064 | 0.496141 | 331064994 |
23 | 201412050036 | 0.495911 | 331104414 |
24 | 201407110039 | 0.495712 | 331117831 |
25 | 201512310032 | 0.495473 | 331012394 |
26 | 201404060013 | 0.495441 | 331117831 |
27 | 201506240032 | 0.495213 | 331012394 |
28 | 201511030068 | 0.494972 | 331117137 |
29 | 201403030061 | 0.494948 | 331025843 |
... | ... | ... | ... |
170 | 201505020118 | 0.415337 | 331070826 |
171 | 201506270007 | 0.412257 | 331117273 |
172 | 201506180020 | 0.410705 | 331085867 |
173 | 201507160029 | 0.409363 | 331148592 |
174 | 201401310047 | 0.40564 | 331074011 |
175 | 201405300030 | 0.404795 | 330990919 |
176 | 201407080025 | 0.404756 | 331097434 |
177 | 201505060091 | 0.404719 | 331118008 |
178 | 201508100089 | 0.404316 | 331113107 |
179 | 201509270003 | 0.403784 | 331088203 |
180 | 201507270012 | 0.401332 | 331061982 |
181 | 201512220041 | 0.400986 | 331003379 |
182 | 201405190034 | 0.400743 | 330987548 |
183 | 201510200005 | 0.400575 | 331117795 |
184 | 201501230074 | 0.400513 | 331110928 |
185 | 201402030081 | None | None |
186 | 201509120014 | None | None |
187 | 201405270039 | None | None |
188 | 201402200088 | None | None |
189 | 201506220061 | None | None |
190 | 201501100063 | None | None |
191 | 201508100134 | None | None |
192 | 201408170029 | None | None |
193 | 201502210096 | None | None |
194 | 201411070001 | None | None |
195 | 201506220087 | None | None |
196 | 201506220088 | None | None |
197 | 201512190011 | None | None |
198 | 201512020039 | None | None |
199 | 201411240023 | None | None |
200 rows × 3 columns
Permalink: etl-data-matching
Tags: