referencenumber | title | postdate | url | company | city | state | description | |||||||
1652398203 | Sales Associate | 2014-07-09 13:47:18 | URL link | Company Name | City | State | Our Sales Associates are… |
ID | Title |
82 | Pediatricians, General |
area | area_title | area_type | naics | naics_title | own_code | 后略… |
99 | U.S. | 1 | 000000 | Cross-industry | 1235 | 00-0000 |
2010 SOC Code | 2010 SOC Title | 2010 SOC Direct Match Title | llustrative Example |
11-1011 | Chief Executives | CEO |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
#import the necessary packages import pandas as pd import us import numpy as np from multiprocessing import Pool,cpu_count,Queue,Manager # the data in one particular column was number in the form that horrible excel version # of a number where '12000' is '12,000' with that beautiful useless comma in there. # did I mention I excel bothers me? # instead of converting the number right away, we only convert them when we need to def median_maker(column): return np.median([int(x.replace(',','')) for x in column]) # dictionary_of_dataframes contains a dataframe with information for each title; e.g title is 'Data Scientist' # related_title_score_df is the dataframe of information for the title; columns = ['title','score'] ### where title is a similar_title and score is how closely the two are related, e.g. 'Data Analyst', 0.871 # code_title_df contains columns ['code','title'] # oes_data_df is a HUGE dataframe with all of the Bureau of Labor Statistics(BLS) data for a given time period (YAY FREE DATA, BOO BAD CENSUS DATA!) def job_title_location_matcher(title,location): try: related_title_score_df = dictionary_of_dataframes[title] # we limit dataframe1 to only those related_titles that are above # a previously established threshold related_title_score_df = related_title_score_df[title_score_df['score']>80] #we merge the related titles with another table and its codes codes_relTitles_scores = pd.merge(code_title_df,related_title_score_df) codes_relTitles_scores = codes_relTitles_scores.drop_duplicates() # merge the two dataframes by the codes merged_df = pd.merge(codes_relTitles_scores, oes_data_df) #limit the BLS data to the state we want all_merged = merged_df[merged_df['area_title']==str(us.states.lookup(location).name)] #calculate some summary statistics for the time we want group_med_emp,group_mean,group_pct10,group_pct25,group_median,group_pct75,group_pct90 = all_merged[['tot_emp','a_mean','a_pct10','a_pct25','a_median','a_pct75','a_pct90']].apply(median_maker) row = [title,location,group_med_emp,group_mean,group_pct10,group_pct25, group_median, group_pct75, group_pct90] #convert it all to strings so we can combine them all when writing to file row_string = [str(x) for x in row] return row_string except: # if it doesnt work for a particular title/state just throw it out, there are enough to make this insignificant 'do nothing' |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
#runs the function and puts the answers in the queue def worker(row, q): ans = job_title_location_matcher(row[0],row[1]) q.put(ans) # this writes to the file while there are still things that could be in the queue # this allows for multiple processes to write to the same file without blocking eachother def listener(q): f = open(filename,'wb') while 1: m = q.get() if m =='kill': break f.write(','.join(m) + 'n') f.flush() f.close() def main(): #load all your data, then throw out all unnecessary tables/columns filename = 'skill_TEST_POOL.txt' #sets up the necessary multiprocessing tasks manager = Manager() q = manager.Queue() pool = Pool(cpu_count() + 2) watcher = pool.map_async(listener,(q,)) jobs = [] #titles_states is a dataframe of millions of job titles and states they were found in for i in titles_states.iloc: job = pool.map_async(worker, (i, q)) jobs.append(job) for job in jobs: job.get() q.put('kill') pool.close() pool.join() if __name__ == "__main__": main() |
虽然读者可能接触不到本教程处理的任务环境,但通过multiprocessing,可以突破许多计算机硬件的限制。本例的工作环境是c3.8xl ubuntu ec2,硬件为32核60Gb内存(虽然这个内存很大,但还是无法一次性放入所有数据)。这里的关键之处是我们在60Gb的内存的机器上有效的处理了约100Gb的数据,同时速度提升了约25倍。通过multiprocessing在多核机器上自动处理大规模的进程,可以有效提高机器的利用率。也许有些读者已经知道了这个方法,但对于其他人,可以通过multiprocessing能带来非常大的收益。顺便说一句,这部分是skill assets in the job-market这篇博文的延续。