
I am using google cloud to create an event on Cloud Storage to Big Query using Apache Beam pythons library. I was executing an ETL in the “DirectRunner” mode and found no issue. But later when I take everything on dataflow to execute found an error.
Below command used to upload the file and I can see my file is present at location same:-
gsutil cp datapip.csv.gz gs://myproject/data/datapip.csv.gz
Sadly whenever I run below command to execute my pipeline in cloud mode get an error:-
python dfmypy.py -p myproject -b mybucket -d mydataset
Correcting timestamps and writing to BigQuery dataset flights Traceback (most recent call last): File "df06.py", line 171, in <module> run(project=args['project'], bucket=args['bucket'], dataset=args['dataset']) File "df06.py", line 134, in run | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 524, in __init__ skip_header_lines=skip_header_lines) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 119, in __init__ validate=validate) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py", line 121, in __init__ self._validate() File "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py", line 137, in _f return fnc(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py", line 178, in _validate match_result = FileSystems.match([pattern], limits=[1])[0] File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 186, in match filesystem = FileSystems.get_filesystem(patterns[0]) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem raise ValueError('Unable to get the Filesystem for path %s' % path)
Resolution:-
It looks like my apache-beam library installation might be incomplete. Running below command show up same error:-
pip install apache-beam[gcp]
After running below command I can execute my code successfully:-
sudo pip install apache-beam[gcp]
The apache-beam package doesn’t include all the stuff to read/write from GCP. To get all that, as well as the runner for being able to deploy your pipeline to CloudDataflow (the DataRunner), we’ll need to install it via sudo pip. It seems wired the error message is not clear, therefore I thought to write a small write-up to help someone.
Happy Machine Learning!