Tips: Cache Intermediate Results with pickle
Here’s a useful pattern I’ve been getting a lot of mileage out of lately. If you’re running an analysis that has a time consuming step you can save the result as a python readable “pickle” file. Addendum: In some cases pickling a python objects can sometimes succeed in storing and retrieving data where a library’s built in functions for saving/loading data fails.
import pickle as pkl
= "./data_intermediates/processed_data.pkl"
path if os.path.exists(path):
= pkl.load(open(path, 'rb'))
processed_data else:
# Make `processed_data` here
open(path, 'wb')) pkl.dump(processed_data,
This also lets you batch a process so that you can do more with your resources. For example here’s a list comprehension that will (for each day from 0-287) rearrange the weather data to be in “long” format. This is concise but requires processing the whole list at once which takes a lot of resources.
= [_get_weather_long(results_list = res,
sal_long_list = ith_day) for ith_day in np.linspace(start = 0, stop = 287, num = 288)] current_day
If we incorporate it into the pattern above we can hold fewer items in memory at a time and then merge them (e.g. with list.extend()
) after the fact.
for ii in range(3):
= '../data/result_intermediates/sal_df_W_long_part_day'+['0-95',
file_path '96-191',
'192-287'][ii]+'.pkl'
if os.path.exists(file_path):
= pkl.load(open(file_path, 'rb'))
sal_long_list
else:
# The original list comprehension is here,
# just made messier by selecting a subset of the indices.
= [_get_weather_long(
sal_long_list = res,
results_list = current_day) for current_day in [
current_day int(e) for e in np.linspace(start = 0, stop = 95, num = 96)], # Batch 1
[int(e) for e in np.linspace(start = 96, stop = 191, num = 96)], # Batch 2
[int(e) for e in np.linspace(start = 192, stop = 287, num = 96)] # Batch 3
[
][ii]
]open(file_path, 'wb')) pkl.dump(sal_long_list,