Previously in Part 1, I talked about how to clean-up the transaction data from Mint to remove duplicates and add any missing transactions.
Solving for transfers
The next phase is to solve for the transfer pairs. A transfer pair is defined with a matching credit and debit transaction on two different accounts. In Firefly, a transfer is treated separately than a credit/debit because it’s excluded from the expense and income reports.
This ended up being trickier than I expected and I went through multiple phases. First attempt, I used Pandas merge_asof, but this did not handle multiple transactions around the same time. Then I developed a new method that ignored previously used transactions. This worked better, but it still struggled with complex transaction chains. For example, if I transferred $1k from account A to B, then from C to D, it might accidentally book a transfer from A to D and C to B which is incorrect.
The basic algorithm
Here we have the first iteration. How it works is for every single transaction, it looks forward 5 days, finds all transactions that are the opposite type (credit -> debit or vice versa) and the same amount, then tags the first one as the opposite. For example, in the below table, it’d identify the two $300 transactions as a pair.
Date | Description | Amount | Type | Account |
---|
1/1/2024 | Transfer to XYZ | $300.00 | debit | ABC |
1/1/2024 | Bought Avocado Toast | $4.99 | debit | Credit Card |
1/2/2024 | Transfer from ABC | $300.00 | credit | XYZ |
The code looked like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
| def find_transfer(row, df, time_window_days=5):
start_date = row['Date']
end_date = start_date + timedelta(days=time_window_days)
opposing_filter = (
(df['Account Name'] != row['Account Name']) &
(df['Date'] >= start_date) &
(df['Date'] <= end_date) &
(df['Amount'] == row['Amount']) &
(df['Transaction Type'] != row['Transaction Type']) &
(~df['Considered'])
)
transfer_pair = df[opposing_filter]
if not transfer_pair.empty:
df.at[row.name, 'Considered'] = True
opposite = transfer_pair.iloc[0]
return opposite
else:
return None
def process_record(row, df):
if df.at[row.name, 'Considered']:
return None
pair = find_transfer(row, df)
if pair is not None:
df.at[pair.name, 'Considered'] = True
notes = [
row['Notes'],
pair['Notes'],
]
labels = list(filter(lambda x: isinstance(x, str) and x != '', [
row['Labels'],
pair['Labels']
]))
if row['Transaction Type'] == 'debit':
# Bank is on the left
date = row['Date']
process_date = pair['Date']
notes.append(pair['Original Description'])
return {
'type': 'transfer',
'date': row['Date'],
'process_date': pair['Date'],
'amount': row['Amount'],
'category': pair['Category'],
'description': row['Original Description'],
'source_id': row['AccountId'],
'destination_id': pair['AccountId'],
'tags': labels,
'format': 'transfer_debit',
'notes': '\n'.join(filter(lambda x: isinstance(x, str) and x != '', notes))
}
else:
date = pair['Date']
process_date = row['Date']
notes.append(row['Original Description'])
return {
'type': 'transfer',
'date': date,
'process_date': process_date,
'amount': row['Amount'],
'category': pair['Category'],
'description': pair['Original Description'],
'source_id': pair['AccountId'],
'destination_id': row['AccountId'],
'tags': labels,
'format': 'transfer_credit',
'notes': '\n'.join(filter(lambda x: isinstance(x, str) and x != '', notes))
}
result = {
'date': row['Date'],
'description': row['Original Description'],
'amount': row['Amount'],
'category': row['Category'],
'tags': row['Labels'],
'notes': row['Notes']
}
if row['Transaction Type'] == 'credit':
result['type'] = 'deposit'
result['destination_id'] = row['AccountId']
result['source_name'] = row['Description']
else:
result['source_id'] = row['AccountId']
result['destination_name'] = row['Description']
result['type'] = 'withdrawal'
return result
transactions['Considered'] = False
transactions.apply(func=process_record, axis=1, result_type='expand')
|
However, it made a lot of mistakes. For example, it would classify the following as a transfer from Venmo to the Credit Card. Which maybe one could say it’s a “transfer”, but I didn’t want that because the credit card would get paid by my checking account.
Date | Description | Amount | Type | Account |
---|
1/1/2024 | Bought Dinner for my friends | $50 | debit | Credit Card |
1/1/2024 | Friend paid me back | $50 | credit | Venmo |
Eventually, I extended that algorithm with rules based on how my accounts behaved. Let’s look at some of those options.
Not all transactions are transfers
Not all transactions are transfers, but sometimes it lines up that it looks like it’s a transfer because there’s a matching side. Just like the previous example. We can define some rules that match transactions that we know are not transfers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| def account(num: Union[int, List[int]]):
if type(num) == int:
return transactions['AccountId'] == num
else:
return transactions['AccountId'].isin(num)
def orig_descr_contains(msg: str):
return transactions['Original Description'].str.contains(msg, case=False)
# List of all Firefly account ids that are credit cards
credit_cards = [5, 10, 15, 20]
brokerages = []
account_id_401k = 25
not_a_transfer = [
# Maybe exclude an entire account. For example, your 401k may never have
# a transfer (unless you take a loan or do a Mega Backdoor Roth)
account(account_id_401k),
# Or you want to exclude reimbursements via Venmo
account(23) & orig_descr_contains('John Doe Paid '), # Venmo
account(23) & ~transactions['Category'].isna() & (transactions['Category'] != 'Transfer'), # Venmos frequently appear like transfers because a friend reimburses me
# Credit Card Payments (to the card) can be transfers, but
# maybe they're never transfers going out (unless you do a Balance Transfer)
account(credit_cards) & (transactions['Transaction Type'] == 'debit'),
# A paycheck
# While it is a transfer from your employer to you, without the
# debit side, the algorithm could incorrectly classify it as a transfer
orig_descr_contains('PAYROLL') | (transactions['Category'] == 'Paycheck'),
# Interest paid is not a transfer
orig_descr_contains('INTEREST PAYMENT'),
transactions['Category'] == 'Interest Income',
# Dividend reinvestment are not transfers
account(brokerages) & (orig_descr_contains('DIVDEND') | orig_descr_contains('REINVESTMENT')),
# Some accounts have $0 transactions as part of bookkeeping
# They are not transfers
(transactions['Amount'] == 0)
]
|
Then we can integrate it into the algorithm. While searching for new transactions, it excludes anything that is not considered relevant.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # Compute the exclusion set
df_filter_relevant = ~reduce(lambda x, y: x | y, not_a_transfer)
def find_transfer(row, df, time_window_days=5):
# ...
opposing_filter = (
(df['Account Name'] != row['Account Name'])
& (df['Date'] >= start_date)
& (df['Date'] <= end_date)
& (df['Amount'] == row['Amount'])
& (df['Transaction Type'] != row['Transaction Type'])
& (~df['Considered'])
& df_filter_relevant
)
# ...
def process_record(row, df):
if df.at[row.name, 'Considered']:
return None
if df_relevant.loc[row.name]:
transfer_pair = find_transfer(row, df)
# ...
# ...
output = transactions.apply(func=process_record, axis=1, result_type='expand')
|
Figuring out the rules is tricky. What I do is to actually look at both sides of the transfers in the notes:
1
| output[output['type'] == 'transfer']
|
And see what makes sense and what doesn’t. If you can’t figure out how to define a rule that ignores that transaction, then go back to your CSV file, add a new column called IsTransfer
and put true or false into rows that the algorithm struggles with (you don’t have to annotate every row.) Then reload the file and add a rule:
1
2
3
4
| not_a_transfer = [
# ...
~transactions['IsTransfer']
]
|
Heuristic: Account Id in Description
The next problem is that if I make transfers with the same amount during the same time period, it can get confused and depending on the ordering in the transactions CSV, mis-attribute the transfer between the wrong accounts.
This led to a new idea to use the description as a signal. For example, in the transaction description:
ONLINE TRANSFER REF #XYZ1234 TO SAVINGS XXXXXX1234 ON 11/2/11
we can see that I transferred to account X1234
. I know that this account corresponds to Firefly account XYZ and can seed the algorithm to search for credits in that account.
Not all transfers include the exact account id though, especially ones that cross banks will usually show something vague. Like a transfer from one bank to another bank might just show FOOBANK ONLINE TRANSFER
and that bank has multiple candidate accounts.
1
2
3
4
5
6
7
| known_account_ids = {
'X1234': 1,
'X4561': 3,
'FOO BANK $TRANSFER': [1, 2, 3, ...],
# ...
}
|
This says if we find a transaction that says X1234
, then we should look for matching transactions in account id 1
. It also says that if we have transaction in account 1
, we should try to find transactions that contain either X1234
or FOO BANK $TRANSFER
, and if neither can be find, expand the scope to any matching transactions. So in the following example we have two transfers that all have the same amount, but must be disambiguated.
# | Description | Amount | Type | Account | Pair |
---|
1 | Transfer to X1234 | $100 | debit | 4561 | 4 |
2 | Transfer | $100 | credit | 3 | 3 |
3 | Transfer to FOO BANK $TRANSFER | $100 | debit | 5678 | 2 |
4 | Deposit | $100 | credit | 1 | 1 |
In transaction #1, we’d find X1234 and look for transactions in account 1
, thus finding #4 as the pair. In transaction #2, we have no account ids, but we if we look for transactions that contain either FOO BANK $TRANSFER
or X4561
, then we find transaction #3.
Wiring this into the algorithm is hairy, but here goes nothing. First, we identify any matching account ids defined above, along with the inverse.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| def attempt(df, base, attempt):
if len(df[base & attempt]) > 0:
return base & attempt
else:
return base
def find_transfer(row, df, time_window_days=5):
# ...
match_by_description = set()
inverted_by_account_id_in_description = []
for key, value in known_account_ids.items():
# If we find an account id in this record's description
# then refine the other side by the found account id(s)
if key in row['Original Description'].upper():
if isinstance(value, list):
for id in value:
match_by_description.add(id)
else:
match_by_description.add(value)
# Find possible descriptions based on this row's
# account id
if (isinstance(value, list) and row['AccountId'] in value) or value == row['AccountId']:
foo.append(key)
inverted_by_account_id_in_description.append(df['Original Description'].str.contains(key, case=False))
|
Then we use this information to attempt different variations of the pair finder algorithm and if any of them return a result, we use that transactions as our pair.
1
2
3
4
5
6
7
8
9
10
11
12
13
| # ...
if len(match_by_description) > 0:
opposing_filter = attempt(df, opposing_filter, df['AccountId'].isin(match_by_description))
if len(inverted_by_account_id_in_description) > 0:
filter_by_invert = reduce(lambda x, y: x | y, inverted_by_account_id_in_description)
if len(not_inverted) > 0:
filter_by_invert = filter_by_invert & reduce(lambda x, y: x & y, not_inverted)
opposing_filter = attempt(df, opposing_filter, filter_by_invert)
transfer_pair = df[opposing_filter]
if not transfer_pair.empty:
# ...
|
Ambiguous Matches
Even with all that, we can still end up with incorrect transfer matches. For example, I frequently transferred $1k blocks between different accounts and if that happened around the same time, it could still misclassify the transfer. The next step is to identify ambiguous matches– i.e. any time the solver finds two or more candidates and just picks one to proceed. These are likely candidates to investigate further. Let’s filter for those:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| def find_transfer(row, df, time_window_days=5):
# ...
transfer_pair = df[opposing_filter]
if not transfer_pair.empty:
df.at[row.name, 'Considered'] = True
return opposite
else:
return None
def find_ambiguous(row, df):
if df.at[row.name, 'Considered']:
return None
transfer_pair = find_transfer(row, df)
if transfer_pair is None or len(transfer_pair) == 1:
return None
print(f"- {row['Date']} {row['Account Name']} \"{row['Description']}\"")
for index, option in transfer_pair.iterrows():
delta = option['Date'] - row['Date']
print(f" {delta.days} day(s): {option['Account Name']} {option['Original Description']}")
print("")
copy['Considered'] = False
copy[df_relevant].apply(func=find_ambiguous, axis=1, df=copy)
|
This gives me something like:
1
2
3
| - 2012-03-01 Savings Account A: "DepositInternet transfer from Interest Checking account XXXXXXX6789"
0 day(s) Spending Account: "WithdrawalInternet transfer to Online Savings account XXXXXXX1234"
0 day(s) Ally Spending Account: "WithdrawalInternet transfer to Online Savings account XXXXXXX5678"
|
From here, we can see that I need to create an alias for 6789
, X1234
, X5678
in known_account_ids
.
1
2
3
4
5
| account_info_map = {
'X6789': 5, # Firefly's account id
'X1234': 6,
'X5678': 7
}
|
Then rerun the transfer solver and it’ll fix those issues
Testing
How do you know if the transfers are working correctly? There’s no automatic solution for this. I built my rules based on trial and error. I looked through the data, found issues, added rules, and re-ran.
With this code, you can view the solved transfers:
1
2
| transfers = transactions[df_filter_relevant].apply(func=process_record, axis=1, result_type='expand')
transfers[transfers['type'] == 'transfer']
|
You can also write it to a CSV and review it in a spreadsheet editor:
1
| transfers[transfers['type'] == 'transfer'].to_csv('transfers.csv')
|
Once you’re satisfied, you can move on to the next step, which is loading into Firefly-iii in the next part of this series.
Conclusion & Next Steps
In the post, I walked through the Python code to identify and solve for transfers. The final code for this can be found on the GitHub repo here.
In the next post, I’ll show how to load all this into Firefly-iii.