ETL Data Matching

Alex Egg,

A major topic of ETL is string matching and data matching. While string matching returns a simularity score between two strings, data matching returning a simularity between two two tuples (rows) in a database. For example, we would like to determine whether the tuples (David Smith, 608-245-4367, Madison WI) and (D. M. Smith, 245-4367, Madison WI) refer to the same person.

The problem of data matching arises in many integration situations. In the simplest case, we may be merging multiple databases with identical schemas, but without a unique global ID, and want to decide which rows are duplicates. The problem is complicated when we need to join rows from sources that have different schemas. Data matching may also arise at query time. For example, a query may often imprecisely refer to a particular data item, e.g., a query asking for the phone number of a David Smith who lives in Madison. We need to employ data matching techniques to decide which tuples in the database match the query.

In this project, we will complete a cross-dataset join across two terrorism databases w/ differnent schemas: GTD and GDELT.

Techniques

We cover several classes of solutions to the data matching problem. The first kind employs handcrafted rules, or Rule-based Matching to match tuples. These techniques typically make heavy use of domain-specific knowledge in domains where the complexity of the rules is manageable. The next kind of solution, Learning-based Matching learns the appropriate rules from labeled examples, using supervised learning. The third kind, clustering does not use training data. Instead, it iteratively assigns tuples to clusters, such that all tuples within a single cluster match and those across clusters do not.

import pandas as pd
import time
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Data

Load the two datasets

gtd=pd.read_excel("data/gtd_12to15_0616dist.xlsx")
headers = pd.read_excel("data/GDELT Metadata.xlsx").columns.values
gdelt = pd.read_csv("data/20150108.export.txt", delimiter="\t", names=headers, parse_dates=["Day"])
/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (8,9,10,11,14,19) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Cleaning

Remove tuples where lat/lng or date is na

gtd = gtd.dropna(subset=['latitude', 'longitude', 'iday'])
gdelt = gdelt.dropna(subset=["ActionGeo_Lat", "ActionGeo_Long", "Day"])

Common Years

The two datasets intersect for years 2014-2015

gtd.iyear.value_counts()
2014    16737
2015    14712
2013    11892
2012     8448
Name: iyear, dtype: int64

The years in GDLET are distributed strangly:

gdelt.Year.value_counts()
2015    155997
2014      2744
2005        44
Name: Year, dtype: int64
set(gtd.iyear).intersection(set(gdelt.Year))
{2014, 2015}

Filter for 2015

#limit query to year 2015 b/c both datasest have full 2015 data
gtdf = gtd[(gtd.iyear==2015) | (gtd.iyear==2014)]
gdeltf=gdelt[(gdelt.Year==2015) | (gdelt.Year==2014)].sample(10000) #sampling a subset of the data for fast testing

Rule-Based Matching

Common Columns

We need to define the mapping between the columns that are in common between the two datasets.

I could only identify 2 columns that map GTD to GDELT. Key is GTD column Value is GDELT column

Simularity Metric

We begin by covering approaches that employ handcrafted matching rules. A simple yet popular type of rule computes the similarity score between a pair of tuples $x$ and $y$ as a linearly weighted combination of the individual similarity scores:

where $n$ is the number of common coloumns between tables $X$ and $Y$ (in this case 2), $s_i(x,y) ∈ [0,1]$ is a similarity score between the $i$th attributes of $x$ and $y$, and $\alpha_i ∈ [0, 1]$ is a prespecified weight that indicates the importance of the $i$th attribute to the overall similarity score, such that 􏰈$\Sigma_{i=1}^n \alpha_i = 1$. We declare $x$ and $y$ matched if $sim(x,y) ≥ β$ for a prespecified threshold $β$, and not matched otherwise.

Python Implementation

Lat/Lng

First we take each lat/lng and put them in a 2D vector format so we can take the euclidian disantace.

Date

We take the date data from each touple and put it in native python datetime format. Then we convert each datetime into an integer representing the seconds from the epoch. We can then take the euclidian distance between two datetimes.

Computational Complexity

This routine is a loop in a loop: O(n^2) growth complexity with 5,150,259,916 computations to run if we limit search to 2014-2015.

print (gtdf.shape, gdeltf.shape)
gtdf.shape[0] * gdeltf.shape[0]
((31449, 137), (10000, 58))





314490000
#python implemention of the sim function above w/ 2 matching rules for lat/lng and date
#arguments are pandas series (row) from each database
def sim(target, match):
    sim = 0

    latlng_sim = sim_latlng(target, match)
    sim += (W_LATLNG * latlng_sim)
    
    date_sim = sim_date(target, match)
    sim += (W_DATE * date_sim)
    
    return sim
#define lat/lng sim function
#expect 2 1x2 numpy arrays
def sim_latlng(target, match):
    target_latlng = np.array(target[["latitude", "longitude"]].values)
    
    match_latlng = np.array(match[["ActionGeo_Lat", "ActionGeo_Long"]].values)
    
    sim = euclid_sim(target_latlng, match_latlng)
    
    return sim
#try to match on date
#2  integers represeignn seconds from th epoch
def sim_date(target, match):
    target_date_parts = target[["iyear", "imonth", "iday"]].values
    target_date = datetime.datetime(target_date_parts[0], target_date_parts[1], target_date_parts[2])
    target_seconds = (target_date - EPOCH).total_seconds()
    
    match_date = match["Day"]
    match_seconds = (match_date - EPOCH).total_seconds()
    
    sim = euclid_sim(target_seconds, match_seconds)
    return sim
def euclid_sim(a,b):
    dist = numpy.linalg.norm(a-b)
    prob = 1 / (1 + dist)
    return prob

Results

Here we iterate thought our targets in GTD and then compare against each candidate match in GDELT and return the closest match. A match is defined as the highest score which is computed by nanargmax below. We can also configure our weights here too:

%%time

#weights should sum to 1
W_LATLNG=0.5
W_DATE=0.5
EPOCH = datetime.datetime.utcfromtimestamp(0)

# threshold (beta)
THRESHOLD = 0.4

#choose what tuples of GTD you want to match against
targets = gtdf.sample(500)
matches = {}

for _, target in targets.iterrows(): 
    scores = []
    for _, row in gdeltf.iterrows():
        score = sim(target, row)
        scores.append(score)

    i = np.nanargmax(scores)
    high_score = scores[i]
    #if above thershold, consider match
    if high_score >= THRESHOLD:
        match = gdeltf.iloc[i]
        matches[target.eventid] = (high_score, match.GlobalEventID)
    else:
        matches[target.eventid] = (None, None)
        pass

CPU times: user 1h 26min 10s, sys: 48.1 s, total: 1h 26min 58s
Wall time: 1h 26min 48s

Report

Print out top 200 matches by score

sorted_matches = sorted(matches.items(), key=lambda x: x[1][0], reverse=True)
top200 = sorted_matches[0:200]
gtdids= zip(*top200)[0]; scores = zip(*zip(*top200)[1])[0]; gdeltids=zip(*zip(*top200)[1])[1]
report = pd.DataFrame({"GTD": gtdids, "Score": scores, "GDELT": gdeltids})
report [["GTD", "Score", "GDELT"]].head(200)
GTD Score GDELT
0 201501080052 0.999093 331067090
1 201501080005 0.989967 331080176
2 201501080044 0.79047 331118383
3 201501080026 0.774984 331004618
4 201401080033 0.638378 330987548
5 201501010069 0.528406 330990691
6 201404150072 0.499986 331138117
7 201405310048 0.499661 331083568
8 201510030022 0.499625 331122461
9 201512090035 0.499625 331122461
10 201501260068 0.499093 331067090
11 201411040061 0.498777 331099534
12 201501100067 0.498382 331067090
13 201512140021 0.498276 331084035
14 201501020078 0.498071 331008363
15 201512150074 0.497702 331033950
16 201406040063 0.497232 330989742
17 201401180006 0.4968 330990919
18 201401240009 0.496475 331065696
19 201405310013 0.496462 331148592
20 201412060035 0.496462 331092941
21 201504300066 0.496336 331083542
22 201402130064 0.496141 331064994
23 201412050036 0.495911 331104414
24 201407110039 0.495712 331117831
25 201512310032 0.495473 331012394
26 201404060013 0.495441 331117831
27 201506240032 0.495213 331012394
28 201511030068 0.494972 331117137
29 201403030061 0.494948 331025843
... ... ... ...
170 201505020118 0.415337 331070826
171 201506270007 0.412257 331117273
172 201506180020 0.410705 331085867
173 201507160029 0.409363 331148592
174 201401310047 0.40564 331074011
175 201405300030 0.404795 330990919
176 201407080025 0.404756 331097434
177 201505060091 0.404719 331118008
178 201508100089 0.404316 331113107
179 201509270003 0.403784 331088203
180 201507270012 0.401332 331061982
181 201512220041 0.400986 331003379
182 201405190034 0.400743 330987548
183 201510200005 0.400575 331117795
184 201501230074 0.400513 331110928
185 201402030081 None None
186 201509120014 None None
187 201405270039 None None
188 201402200088 None None
189 201506220061 None None
190 201501100063 None None
191 201508100134 None None
192 201408170029 None None
193 201502210096 None None
194 201411070001 None None
195 201506220087 None None
196 201506220088 None None
197 201512190011 None None
198 201512020039 None None
199 201411240023 None None

200 rows × 3 columns

Permalink: etl-data-matching

Tags:

Last edited by Alex Egg, 2017-06-06 02:28:53
View Revision History