- Reinventing the wheel. For example, writing one-off code to read a CSV instead of using a convenient purpose-built library that could offer deeper functionality. (Python Pandas, to be specific, in this case - interesting stuff, actually!)
- Failing to tune for performance. Cuts down on testing cycles per day.
- Failing to understand time and timezones. Ain't that the truth.
- Manual integration of different technologies in a solution (copying results files back and forth by hand, etc.)
- Not keeping track of data types and schemata.
- Failing to include data provenance tracking. Oooh, I like this notion.
- No testing, especially no regression testing.
Thursday, January 22, 2015
Top Python mistakes when dealing with Big Data