Skip to content

sih19006/traffic_dataproc

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
This branch is up to date with tmp13009/traffic_dataproc:master.

Do not use this code! It is not properly commented or tested.

TODOS:

  1. Lint and comment the code, properly contain in "main"
  2. Some of the code is split between scr_ scripts and nb_ notebooks. This is unideal for recreating the code.
  3. Document the various stages of processed data (raw, processed, etc.) (The below is incomplete documentation.)

Files

nb_001_... was used to explore/test some things out before writing scr_002.

scr_002 is used to process the raw od and gps files into Pandas-readable arrays.

nb_003 was an exploration of using area.txt and matplotlib to check shape membership. nb_004 did the same with shapely.

scr_005 is a script used to output the pick region and drop region of each OD file, and export them to .CSV. (This makes it easier to speed up the quantization work!)

scr_006, like scr_005, is used to output the regions. This one is applied to each GPS files. (todo..)

Data format

The original data, unzipped, is found in ../201407OD and ../201407GPS/201407.gz.

GPS

The files in 201407GPS are titled part-r-<sequence number> and contain a time series of values. The headers are plateID, color, longitude, latitude, time, speed (persumably kph), and noMeaning (a column with no meaningful data it seems?)

As far as I can tell, these files are lists ordered by time (ascending), cut up due to their size (with the last line of the last file being the most recent entry.)

OD

The files in 201407OD/201407.gz are titled part-m-<sequence number> and contain a time series of values. The format is plateID, pickupTime, dropoffTime, pickLongitude, pickLatitude, dropLongitude, dropLatitude.

The files here do not seem to be ordered by neither pickup nor dropoff time.

RegionData

TODO. RegionID, NumberOfPickups, long1, lat1, long2, lat2, ...

There seems to be 20190117All.csv with format RegionID, ?, ?, ?, ?, ?, ?, long, lat, long, lat ...

New format plans

The dates seem to span 2014-07-03 to 2014-07-19. Hmm...

1. Preprocessing into dicts for easy access.

Convert raw data into a directly usable format.

For GPS, use dataframe of [plateID, GPS lat, GPS lon, GPS time, speed].

For OD, use a dataframe of [plateID, pickTime, dropTime, pickLat, pickLon, dropLat, dropLon].

Warning: About 750k lines in each of 12 part-m-* files, and 8.8mil lines in each of 2 part-r-* files. This is around 2.5GB of raw data! Let's create (and then save) each dict of dataframes, one by one. Then, we can build the quantized matrix one by one from this.

2. Quantizing end goal.

We want to quantize into a timeslotted matrix. Each row corresponds to a timeslot (e.g. 1 minute long). Each file has the same number of rows. (I.e. row ii in one file corresponds to the same timeslot in another file.)

We want to quantize a matrix indexed by ID and timeslot. A given ID and timeslot has a status (occupied, vacant, low-battery, or charging) values, demand (number of pickups during the timeslot in this region), and supply (something a bit more difficult to calculate!)

3. Quantizing for target values

We want to find, per ID per timeslot, status (occupied, vacant, low-battery or charging), demand (number of pickups for a given pickup timeslot $k$ for a given region ID $i$), and supply (something more difficult; consider later.)

Hmmm

graph TD;

area.txt-->contain.py;
area.txt-->containNew.py;
area.txt-->RegionWhich3.py;

tools.py-->contain.py;
tools.py-->containNew.py;
tools.py-->RegionWhich3.py;

About

Data processing code for the Rutgers traffic project

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages

  • Jupyter Notebook 98.9%
  • Python 1.1%